php HTMLのパースの基本的なメモ（Simple HTML DOM Parser）

2011/02/03

PHP Simple HTML DOM Parserのマニュアルを見ても分からない部分があるので、そのメモ。

ライブラリの読込

ライブラリなので読み込む必要あり。

include_once('simple_html_dom.php');

データの読込

使いドコはよくわかんないけど、記載したソースを読む込むときとURLからウェブページを読む込むときで微妙に違う。

HTMLを読み込むとき

$html = str_get_html(“<html>aa</html>”);

URLからウェブページを読み込むとき

$html = file_get_html(“https://tips.recatnap.info/”);

読み込んだデータから特定の何かを取り出す

上でデータを全部まるっと取り込んでいるのでその中から必要なものを抽出していく。

<body>タグ内にある0番目のタグの中身を抽出

※中身で別のタグが使われていたらそのまま抽出される。

記述パターン1

$gVal = $html->find('body b',0)->innertext;

記述パターン2

$gVaB = $html->find('body b');
$gVal = gVaB[0]->innertext;

<body>タグ内にある1番目のタグを抽出

※中身だけではない。

$gVal = $html->find('body b',1)->outertext;

<body>タグ内にある0番目の<a>タグのhref属性の値を抽出

$gVal = ('body a',0)->href;

<body>タグ内にある0番目の<a>タグのtitle属性の値を抽出

$gVal = ('body a',0)->title;

ソースのサンプル

<?php
// PHP Simple HTML DOM Parser
include_once('simple_html_dom.php');
$str = “”;
$str .= '<html>' . “\n”;
$str .= '<head>' . “\n”;
$str .= '<title>test Simple HTML DOM Parser</title>' . “\n”;
$str .= '</head>' . “\n”;
$str .= '<body>' . “\n”;
$str .= '<div>' . “\n”;
$str .= 'momomo' . “\n”;
$str .= 'hahahahaha' . “\n”;
$str .= 'mamama' . “\n”;
$str .= 'hehehe' . “\n”;
$str .= 'hehehe' . “\n”;
$str .= '<a href=”http://recatnap.info/” title=”linkTitle”>hahahahaha</a>' . “\n”;
$str .= '</div>' . “\n”;
$str .= '</body>' . “\n”;
$str .= '</html>' . “\n”;
// HTMLを読み込む
$html = str_get_html($str);
// URLからページを読み込むときは下記。
// $html = file_get_html(“https://tips.recatnap.info/”);
// <body>の内の0番目のテキスト（innertext）を出力
$b0 = $html->find('body b',0)->innertext;
echo $b0 . “\n”;
// 結果は、hahahahaha
// <body>の内の0番目のテキスト（innertext）を出力：記述２
$b3 = $html->find('body b');
echo $b3[0]->innertext . “\n”;
// 結果は、hahahahaha
// <body>の内の1番目のタグとテキスト（outertext）を出力
$b1 = $html->find('body b',1)->outertext;
echo $b1 . “\n”;
// 結果は、hehehe
// <body>の内の2番目のテキスト（innertext）を出力
$em = $html->find('body b',2)->innertext;
echo $em . “\n”;
// 結果は、hehehe
// <body>の<a>内の0番目のhref属性の値（href）を出力
$a = $html->find('body a',0)->href;
echo $a . “\n”;
// 結果は、http://recatnap.info/
// <body>の<a>内の0番目のタグとtitle属性の値（title）を出力
$at = $html->find('body a',0)->title;
echo $at . “\n”;
// 結果は、linkTitle
?>

サンプルソースの結果をまとめたもの

内容	結果	記述
記述方法パターン１ <b>の0番目のテキスト	hahahahaha	$b0 = ('body b',0)->innertext;
記述方法パターン２ <b>の0番目のテキスト	hahahahaha	$b3 = ('body b'); echo \b3[0]->innertext;
<b>の1番目のタグとテキスト	<b>hehehe</b>	$b1 = ('body b',1)->outertext;
<b>の2番目のテキスト	he<em>he</em>he	$em = ('body b',2)->innertext;
<a>の0番目のhref属性の値	http://recatnap.info/	$a = ('body a',0)->href;
<a>の0番目のtitle属性の値	linkTitle	$at = ('body a',0)->title;

参考：HTMLをパースするライブラリ[PHP Simple HTML DOM Parser]