Simple HTML DOM Parserの日本語訳

2011/04/14

「PHP Simple HTML DOM Parserのオンラインマニュアル」を自分なりにまとめたもの。

Quick Start

Get HTML elements

URLやファイルからDOM化: $html = file_get_html('http://www.google.com/');
DOM化したものから全てのimgタグのsrc属性の値を取得: foreach($html->find('img') as $element){
echo $element->src . '<br>';
}
DOM化したものから全てのaタグのhref属性の値を取得: foreach($html->find('a') as $element){
echo $element->href . '<br>';
}

Modify HTML elements

HTMLソースからDOM化

// $htmlにHTMLソースをパースしたものを代入（？）
$html = str_get_html('<div id="hello">Hello</div><div id="world">World</div>');

// 1番目（2つめ）のdivタグにclass属性を追加。値は[bar]。
$html->find('div', 1)->class = 'bar';

// divタグのid属性の値が[hello]の0番目（1つめ）のテキストを「foo」に置き換え。
$html->find('div[id=hello]', 0)->innertext = 'foo';

echo $html; // Output: <div id="hello">foo</div><div id="world" class="bar">World</div>

Extract contents from HTML

タグなしのテキストの抽出（Dump contents (without tags) from HTML）: echo file_get_html('http://www.google.com/')->plaintext;

Scraping Slashdot!

URLからDOM化をして、値を取得するサンプル

// Create DOM from URL
$html = file_get_html('http://slashdot.org/');

// Find all article blocks
foreach($html->find('div.article') as $article) {
// 配列（$item）に
// [<div class=article>]の中にある
// 各class内のタグなしテキストを代入（？）していく。
$item['title'] = $article->find('div.title', 0)->plaintext;
$item['intro'] = $article->find('div.intro', 0)->plaintext;
$item['details'] = $article->find('div.details', 0)->plaintext;
$articles[] = $item;
}

print_r($articles);

How to create HTML DOM object?

未翻訳

How to find HTML elements?

Basics

<a>タグ全部: $ret = $html->find('a');
全部の<a>タグの中からn番目（下記だと0番目）の<a>タグ（配列同様、一番最初は「0」）: $ret = $html->find('a', 0);
全部の<a>タグの一番最後（-1）。: $ret = $html->find('a', -1);
<div>タグの属性「id」の全部: $ret = $html->find('div[id]');
<div>タグの属性「id」で値が「foo」の全部: $ret = $html->find('div[id=foo]');

Advanced

属性「id」の値が「foo」の全部: $ret = $html->find('#foo');
属性「class」の値が「foo」の全部: $ret = $html->find('.foo');
全タグの属性「id」の全部: $ret = $html->find('*[id]');
<a>タグと<img>タグの全部: $ret = $html->find('a, img');
<a>タグの属性「title」全部と<img>タグの属性「title」全部: $ret = $html->find('a[title], img[title]');

Descendant selectors

<ul>タグの中にある<li>タグの全部: $es = $html->find('ul li');
<div>タグの中にある<div>タグの中にある<div>タグの全部: $es = $html->find('div div div');
<table class=”hello”>の中にある<td>タグの全部: $es = $html->find('table.hello td');
<table>タグの中にある<td align=center>の全部: $es = $html->find('table td[align=center]');

Nested selectors

<ul>タグの中にある<li>の全部: foreach($html->find('ul') as $ul){
  foreach($ul->find('li') as $li){
    // do something…
  }
}
一つ目の<ul>タグ内にある一つ目の<li>タグ: $e = $html->find('ul', 0)->find('li', 0);

Attribute Filters

属性フィルタ

[attribute]	属性が存在
[!attribute]	属性が存在しない
[attribute=value]	属性と値が一致するもののみ
[attribute!=value]	属性と値が一致しないもののみ
[attribute^=value]	属性と指定した値で始まるものすべて
[attribute$=value]	属性と指定した値で終るものすべて
[attribute*=value]	属性と指定した値が含まれるものすべて

Text & Comments

テキストブロック（改行コードも取得される）: $es = $html->find('text');
コメント（<!–…–>）の取得: $es = $html->find('comment');

How to access the HTML element’s attributes?

Get, Set and Remove attributes

指定した属性の値を取得: //下記はhref属性の取得。
$value = $e->href;
指定した属性の値に代入: //下記はhref属性にセット。
$e->href = ‘my link’;
指定した属性の値を「NULL」に変更、削除？: $e->href = null;
指定した属性の存在をチェック: if(isset($e->href)){
echo 'href exist（存在する）!';
}

Magic attributes

Attribute Name	Usage
$e->tag	外側のタグ自体を取得
$e->outertext	外側のタグと中身のタグありデータ。
$e->innertext	中身のタグありデータ。
$e->plaintext	中身のタグ無しデータ。

plaintext：改行されている箇所は空白文字に置き換えられる

// Example
$html = str_get_html("<div>foo <b>bar</b></div>");
$e = $html->find("div", 0);

echo $e->tag; // Returns: " div"
echo $e->outertext; // Returns: " <div>foo <b>bar</b></div>"
echo $e->innertext; // Returns: " foo <b>bar</b>"
echo $e->plaintext; // Returns: " foo bar"

Tips

使いドコがよくわからん…。

HTMLからコンテンツを取得: echo $html->plaintext;
ラップ（？）する: $e->outertext = '<div class=”wrap”>' . $e->outertext . '</div>';
要素を削除、空データを代入: $e->outertext = '';
要素の追加: $e->outertext = $e->outertext . '<div>foo</div>';
要素の挿入: $e->outertext = '<div>foo</div>' . $e->outertext;

How to traverse the DOM tree?

未翻訳

How to dump contents of DOM object?

未翻訳

How to customize the parsing behavior?

未翻訳