simple_html dom 参考手册_dom取值问题

下次自动登录
现在的位置:
& 综合 & 正文
simple_html_dom使用不当导致的内存泄漏
今天用simple_html_dom写一个简单的网页爬虫(php5.2.3),运行很慢,并且很快超出默认的8m内存,后面把内存改成memory_limit = 128M才运行成功。这显然不满足我的要求。然后我用memory_get_usage()函数在每次循环结尾输出一下所用内存,循环每执行一次,内存上涨1m多,最后60次循环用了80多m的内存。
细细检查,在脚本用到的simple_html_dom对象和simple_html_dom_node对象的析构函数中添加了输出,运行发现在循环的过程中,这两个对象没有一次也没有被GC回收,我用了unset($var),$var=null等操作仍然没有效果,都是在脚本结束之后,释放了全部对象。
后面又试了试在php5.3下面运行这个脚本,内存占用就大大减少了,每次循环结束,就会输出call destruct等信息,说明每次循环结束,对象就被回收了。
后面发现simple_html_dom对象有一个clear方法,于是我在每次使用完simple_html_dom之后,调用其clear方法,这样无论在5.2还是5.3的版本中,都不会占用大量内存了。
另附一篇:刚好解决了我今天的问题,早些搜到就不用摸索这么久了
你写了一个php脚本,一般你都不用考虑内存泄露和垃圾回收的问题,因为一般情况下你的脚本很快就执行完退出了;只有当你写一个需要处理很多数据或者运行很长时间的php脚本的时候,才需要考虑这个问题。
我的一位同事就遇到了这个问题;他需要抓取并分析几千个页面;处理页面的时候,它使用了 simple_html_dom 这个开源工具,每个页面处理时都新建一个 simple_html_dom 对象;然后他发现,运行一段时间后,php脚本就占用了过多内存,然后就报错(PHP Fatal error: Allowed memory size of
bytes exhausted)退出了。一般来说,每个页面处理结束,新建的simple_html_dom对象就应该被销毁了——但是实际上没有,很明显,内存泄露发生了。
1.PHP的垃圾回收机制
php 5.3之前使用的垃圾回收机制是单纯的“引用计数”,也就是每个内存对象都分配一个计数器,当内存对象被变量引用时,计数器+1;当变量引用撤掉后,计数器-1;当计数器=0时,表明内存对象没有被使用,该内存对象则进行销毁,垃圾回收完成。
“引用计数”存在问题,就是当两个或多个对象互相引用形成环状后,内存对象的计数器则不会消减为0;这时候,这一组内存对象已经没用了,但是不能回收,从而导致内存泄露;
php5.3开始,使用了新的垃圾回收机制,在引用计数基础上,实现了一种复杂的,来检测内存对象中引用环的存在,以避免内存泄露。
该算法可以参考下面这篇文章,这是这篇小总结的主要参考文献:) :浅谈PHP5中垃圾回收算法(Garbage Collection)的演化
2.查看内存是否泄露
看是否有该释放的内存没有被释放,可以简单的通过 调用 memory_get_usage 函数查看内存使用情况来判断;memory_get_usage 函数返回的内存使用数据据说不是很准确,可以使用 php 的 xdebug 扩展来获得更准确翔实的内存使用情况。
private $b;
function __construct(){
$this-&b = new B($this);
function __destruct(){
//echo "A destruct\n";
private $a;
function __construct($a){
$this-&a = $a;
function __destruct(){
//echo "B descturct\n";
for($i=0;;$i++){
$a = new A();
if($i%1000 == 0){
echo memory_get_usage()."\n";
上面就构造了一个会产生环状引用的例子;每次创建一个A对象的实例a,a就创建一个B对象的实例b,同时让b引用a ;这样,每个A对象永远被一个B引用,而每个B对象同时被一个对象A引用;引用环就这样产生了。
在php5.2的环境下执行这段,会发现内存使用在单调上涨,也没有A和B的析构函数被执行后输出的“A/B desctruct”信息;直到内存耗尽,输出“PHP Fatal error: Allowed memory size of
bytes exhausted (tried to allocate 40 bytes)”。
在php5.3的环境下执行这段代码,则发现内存使用在上跳下窜,但是永远没有超过一个限额;程序也会输出大量的“A/B desctruct”,这说明析构函数被调用了。
我的同事的程序中,就存在这种引用的环路,而他的脚本,实在php5.2.3下执行的。simple_html_dom工具中,有两个类,分别是simple_html_dom和simple_html_dom_node,前者中有一个数组成员变量nodes,数组中每个元素都是一个simple_html_dom_node对象;而每个simple_html_dom_node对象都有一个成员变量dom,该dom的值就是前面的simple_html_dom对象——这样就形成了一个漂亮的引用环,导致了内存泄露。解决的办法也很简单,就是simple_html_dom对象在使用完毕时,主动调用其clear函数,清空其成员变量nodes,环就被打破了,内存泄露也就不会发生了。
1)垃圾回收的时机
Php中,引用计数为0,则内存立刻释放;也就是说,不存在环状引用的变量,离开变量的作用域,内存被立刻释放。
环状引用检测则是在满足一定条件下触发,所以在上面的例子中,会看到使用的内存有大幅度的波动;也可以通过 gc_collect_cycles 函数来主动进行环状引用检测。
2) &符号的影响
显式引用一个变量,会增加该内存的引用计数:
$a = “something”;
此时unset($a), 但是仍有$b指向该内存区域的引用,内存不会释放。
3)unset函数的影响
unset只是断开一个变量到一块内存区域的连接,同时将该内存区域的引用计数-1;在上面的例子中,循环体内部,$a=new A(); unset($a);并不会将$a的引用计数减到零;
4)= null 操作的影响;
$a = null 是直接将$a 指向的数据结构置空,同时将其引用计数归0。
5)脚本执行结束的影响
脚本执行结束,该脚本中使用的所有内存都会被释放,不论是否有引用环。
&&&&推荐文章:
【上篇】【下篇】simple&html&dom
define('HDOM_TYPE_ELEMENT', 1);
define('HDOM_TYPE_COMMENT', 2);
define('HDOM_TYPE_TEXT', &
define('HDOM_TYPE_ENDTAG', &4);
define('HDOM_TYPE_ROOT', &
define('HDOM_TYPE_UNKNOWN', 6);
define('HDOM_QUOTE_DOUBLE', 0);
define('HDOM_QUOTE_SINGLE', 1);
define('HDOM_QUOTE_NO', & &
define('HDOM_INFO_BEGIN', & 0);
define('HDOM_INFO_END', & &
define('HDOM_INFO_QUOTE', & 2);
define('HDOM_INFO_SPACE', & 3);
define('HDOM_INFO_TEXT', &
define('HDOM_INFO_INNER', & 5);
define('HDOM_INFO_OUTER', & 6);
define('HDOM_INFO_ENDSPACE',7);
// helper functions
-----------------------------------------------------------------------------
// get html dom form file
function file_get_html() {
& & $dom = new
simple_html_
& & $args =
func_get_args();
$dom-&load(call_user_func_array('file_get_contents',
$args), true);
& & return $
// get html dom form string
function str_get_html($str, $lowercase=true) {
& & $dom = new
simple_html_
$dom-&load($str, $lowercase);
& & return $
// dump html dom tree
function dump_html_tree($node, $show_attr=true, $deep=0)
& & $lead = str_repeat('
& &', $deep);
$lead.$node-&
& & if ($show_attr
count($node-&attr)&0) {
& echo '(';
& foreach($node-&attr as
& & & echo
"[$k]=&\"".$node-&$k.'", ';
& echo ')';
& & echo "\n";
foreach($node-&nodes as $c)
& dump_html_tree($c, $show_attr, $deep+1);
// get dom form file (deprecated)
function file_get_dom() {
& & $dom = new
simple_html_
& & $args =
func_get_args();
$dom-&load(call_user_func_array('file_get_contents',
$args), true);
& & return $
// get dom form string (deprecated)
function str_get_dom($str, $lowercase=true) {
& & $dom = new
simple_html_
$dom-&load($str, $lowercase);
& & return $
// simple html dom node
-----------------------------------------------------------------------------
class simple_html_dom_node {
& & public $nodetype =
HDOM_TYPE_TEXT;
& & public $tag =
& & public $attr =
& & public $children =
& & public $nodes =
& & public $parent =
& & public $_ =
& & private $dom =
& & function
__construct($dom) {
& $this-&dom = $
& $dom-&nodes[] = $
& & function __destruct()
& $this-&clear();
& & function __toString()
& return $this-&outertext();
& & // clean up memory due
to php5 circular references memory leak...
& & function clear() {
& $this-&dom =
& $this-&nodes =
& $this-&parent =
& $this-&children =
& & // dump node's
& & function
dump($show_attr=true) {
& dump_html_tree($this, $show_attr);
& & // returns the parent of
& & function parent()
& return $this-&
& & // returns children of
& & function
children($idx=-1) {
& if ($idx===-1) return
& if (isset($this-&children[$idx]))
return $this-&children[$idx];
& & // returns the first
child of node
& & function first_child()
(count($this-&children)&0) return
$this-&children[0];
& & // returns the last
child of node
& & function last_child()
(($count=count($this-&children))&0)
return $this-&children[$count-1];
& & // returns the next
sibling of node & &
& & function next_sibling()
& if ($this-&parent===null) return
& $idx = 0;
& $count =
count($this-&parent-&children);
& while ($idx&$count
$this!==$this-&parent-&children[$idx])
& if (++$idx&=$count) return
$this-&parent-&children[$idx];
& & // returns the previous
sibling of node
& & function prev_sibling()
& if ($this-&parent===null) return
& $idx = 0;
& $count =
count($this-&parent-&children);
& while ($idx&$count
$this!==$this-&parent-&children[$idx])
& if (--$idx&0)
$this-&parent-&children[$idx];
& & // get dom node's inner
& & function innertext()
(isset($this-&_[HDOM_INFO_INNER])) return
$this-&_[HDOM_INFO_INNER];
(isset($this-&_[HDOM_INFO_TEXT])) return
$this-&dom-&restore_noise($this-&_[HDOM_INFO_TEXT]);
& $ret = '';
& foreach($this-&nodes as $n)
& & & $ret .=
$n-&outertext();
& return $
& & // get dom node's outer
text (with tag)
& & function outertext()
& if ($this-&tag==='root') return
$this-&innertext();
& // trigger callback
($this-&dom-&callback!==null)
call_user_func_array($this-&dom-&callback,
array($this));
(isset($this-&_[HDOM_INFO_OUTER])) return
$this-&_[HDOM_INFO_OUTER];
(isset($this-&_[HDOM_INFO_TEXT])) return
$this-&dom-&restore_noise($this-&_[HDOM_INFO_TEXT]);
& // render begin tag
$this-&dom-&nodes[$this-&_[HDOM_INFO_BEGIN]]-&makeup();
& // render inner text
(isset($this-&_[HDOM_INFO_INNER]))
& & & $ret .=
$this-&_[HDOM_INFO_INNER];
foreach($this-&nodes as $n)
& & $ret .=
$n-&outertext();
& // render end tag
& if(isset($this-&_[HDOM_INFO_END])
$this-&_[HDOM_INFO_END]!=0)
& & & $ret .=
'&/'.$this-&tag.'&';
& return $
& & // get dom node's plain
& & function text() {
(isset($this-&_[HDOM_INFO_INNER])) return
$this-&_[HDOM_INFO_INNER];
& switch ($this-&nodetype) {
& & & case
HDOM_TYPE_TEXT: return
$this-&dom-&restore_noise($this-&_[HDOM_INFO_TEXT]);
& & & case
HDOM_TYPE_COMMENT: return '';
& & & case
HDOM_TYPE_UNKNOWN: return '';
& if (strcasecmp($this-&tag,
'script')===0) return '';
& if (strcasecmp($this-&tag,
'style')===0) return '';
& $ret = '';
& foreach($this-&nodes as $n)
& & & $ret .=
$n-&text();
& return $
& & function xmltext()
& $ret = $this-&innertext();
& $ret = str_ireplace('&![CDATA[',
'', $ret);
& $ret = str_replace(']]&', '',
& return $
& & // build node's text
& & function makeup()
& // text, comment, unknown
(isset($this-&_[HDOM_INFO_TEXT])) return
$this-&dom-&restore_noise($this-&_[HDOM_INFO_TEXT]);
'&'.$this-&
& $i = -1;
& foreach($this-&attr as
$key=&$val) {
& & & // skip
removed attribute
($val===null || $val===false)
& & & $ret .=
$this-&_[HDOM_INFO_SPACE][$i][0];
& & & //no value
attr: nowrap, checked selected...
($val===true)
& & $ret .= $
& & & else
switch($this-&_[HDOM_INFO_QUOTE][$i]) {
& case HDOM_QUOTE_DOUBLE: $quote = '"';
& case HDOM_QUOTE_SINGLE: $quote = '\'';
& default: $quote = '';
& & $ret .=
$key.$this-&_[HDOM_INFO_SPACE][$i][1].'='.$this-&_[HDOM_INFO_SPACE][$i][2].$quote.$val.$
$this-&dom-&restore_noise($ret);
& return $ret .
$this-&_[HDOM_INFO_ENDSPACE] .
& & // find elements by css
& & function find($selector,
$idx=null) {
& $selectors =
$this-&parse_selector($selector);
& if (($count=count($selectors))===0) return
& $found_keys = array();
& // find each selector
& for ($c=0; $c&$ ++$c)
(($levle=count($selectors[0]))===0) return array();
(!isset($this-&_[HDOM_INFO_BEGIN])) return
& & & $head =
array($this-&_[HDOM_INFO_BEGIN]=&1);
& & & // handle
descendant selectors, no recursive!
& & & for ($l=0;
$l&$ ++$l) {
& & $ret = array();
& & foreach($head as
& $n = ($k===-1) ?
$this-&dom-&root :
$this-&dom-&nodes[$k];
& $n-&seek($selectors[$c][$l],
& & $head = $
foreach($head as $k=&$v) {
(!isset($found_keys[$k]))
& $found_keys[$k] = 1;
& // sort keys
& ksort($found_keys);
& $found = array();
& foreach($found_keys as
& & & $found[] =
$this-&dom-&nodes[$k];
& // return nth-element or array
& if (is_null($idx)) return $
else if ($idx&0) $idx = count($found) +
& return (isset($found[$idx])) ? $found[$idx] :
& & // seek for given
conditions
& & protected function
seek($selector, &$ret) {
& list($tag, $key, $val, $exp, $no_key) =
& // xpath index
& if ($tag && $key
&& is_numeric($key)) {
& & & $count =
& & & foreach
($this-&children as $c) {
& & if ($tag==='*' ||
$tag===$c-&tag) {
& if (++$count==$key) {
$ret[$c-&_[HDOM_INFO_BEGIN]] = 1;
(!empty($this-&_[HDOM_INFO_END])) ?
$this-&_[HDOM_INFO_END] : 0;
& if ($end==0) {
& & & $parent =
& & & while
(!isset($parent-&_[HDOM_INFO_END])
&& $parent!==null) {
& & $end -= 1;
& & $parent =
& & & $end +=
$parent-&_[HDOM_INFO_END];
for($i=$this-&_[HDOM_INFO_BEGIN]+1;
$i&$ ++$i) {
& & & $node =
$this-&dom-&nodes[$i];
& & & $pass =
($tag==='*' && !$key) {
& & if (in_array($node,
$this-&children, true))
& $ret[$i] = 1;
& & & // compare
& & & if ($tag
&& $tag!=$node-&tag
&& $tag!=='*') {$pass=}
& & & // compare
& & & if ($pass
&& $key) {
& & if ($no_key) {
& if (isset($node-&attr[$key]))
& & else if
(!isset($node-&attr[$key])) $pass=
& & & // compare
& & & if ($pass
&&& $val!=='*')
& & $check =
$this-&match($exp, $val,
$node-&attr[$key]);
& & // handle multiple
& & if (!$check
&& strcasecmp($key, 'class')===0)
& foreach(explode('
',$node-&attr[$key]) as $k) {
& & & $check =
$this-&match($exp, $val, $k);
& & if (!$check) $pass =
& & & if ($pass)
$ret[$i] = 1;
unset($node);
& & protected function
match($exp, $pattern, $value) {
& switch ($exp) {
& & & case
& & return
($value===$pattern);
& & & case
& & return
($value!==$pattern);
& & & case
& & return
preg_match("/^".preg_quote($pattern,'/')."/", $value);
& & & case
& & return
preg_match("/".preg_quote($pattern,'/')."$/", $value);
& & & case
& & if ($pattern[0]=='/')
& return preg_match($pattern, $value);
& & return
preg_match("/".$pattern."/i", $value);
& & protected function
parse_selector($selector_string) {
& // pattern of CSS selectors, modified from
& $pattern =
"/([\w-:\*]*)(?:\#([\w-]+)|\.([\w-]+))?(?:\[@?(!?[\w-]+)(?:([!*^$]?=)[\"']?(.*?)[\"']?)?\])?([\/,
& preg_match_all($pattern,
trim($selector_string).' ', $matches, PREG_SET_ORDER);
& $selectors = array();
& $result = array();
& //print_r($matches);
& foreach ($matches as $m) {
& & & $m[0] =
trim($m[0]);
($m[0]==='' || $m[0]==='/' || $m[0]==='//')
& & & // for
borwser grnreated xpath
($m[1]==='tbody')
& & & list($tag,
$key, $val, $exp, $no_key) = array($m[1], null, null, '=',
if(!empty($m[2])) {$key='id'; $val=$m[2];}
if(!empty($m[3])) {$key='class'; $val=$m[3];}
if(!empty($m[4])) {$key=$m[4];}
if(!empty($m[5])) {$exp=$m[5];}
if(!empty($m[6])) {$val=$m[6];}
& & & // convert
to lowercase
($this-&dom-&lowercase)
{$tag=strtolower($tag); $key=strtolower($key);}
& & & //elements
that do NOT have the specified attribute
(isset($key[0]) && $key[0]==='!')
{$key=substr($key, 1); $no_key=}
& & & $result[]
= array($tag, $key, $val, $exp, $no_key);
(trim($m[7])===',') {
& & $selectors[] = $
& & $result = array();
& if (count($result)&0)
$selectors[] = $
& return $
& & function __get($name)
& if (isset($this-&attr[$name]))
return $this-&attr[$name];
& switch($name) {
& & & case
'outertext': return $this-&outertext();
& & & case
'innertext': return $this-&innertext();
& & & case
'plaintext': return $this-&text();
& & & case
'xmltext': return $this-&xmltext();
& & & default:
return array_key_exists($name, $this-&attr);
& & function __set($name,
& switch($name) {
& & & case
'outertext': return $this-&_[HDOM_INFO_OUTER] =
& & & case
'innertext':
(isset($this-&_[HDOM_INFO_TEXT])) return
$this-&_[HDOM_INFO_TEXT] = $
& & return
$this-&_[HDOM_INFO_INNER] = $
& if (!isset($this-&attr[$name]))
$this-&_[HDOM_INFO_SPACE][] = array(' ', '',
$this-&_[HDOM_INFO_QUOTE][] =
HDOM_QUOTE_DOUBLE;
& $this-&attr[$name] =
& & function __isset($name)
& switch($name) {
& & & case
'outertext':
& & & case
'innertext':
& & & case
'plaintext':
& //no value attr: nowrap, checked
selected...
& return (array_key_exists($name,
$this-&attr)) ? true :
isset($this-&attr[$name]);
& & function __unset($name)
(isset($this-&attr[$name]))
unset($this-&attr[$name]);
& & // camel naming
conventions
& & function
getAllAttributes() {return $this-&}
& & function
getAttribute($name) {return
$this-&__get($name);}
& & function
setAttribute($name, $value) {$this-&__set($name,
& & function
hasAttribute($name) {return
$this-&__isset($name);}
& & function
removeAttribute($name) {$this-&__set($name,
& & function
getElementById($id) {return $this-&find("#$id",
& & function
getElementsById($id, $idx=null) {return
$this-&find("#$id", $idx);}
& & function
getElementByTagName($name) {return
$this-&find($name, 0);}
& & function
getElementsByTagName_r($name, $idx=null) {return
$this-&find($name, $idx);}
& & function parentNode()
{return $this-&parent();}
& & function
childNodes($idx=-1) {return
$this-&children($idx);}
& & function firstChild()
{return $this-&first_child();}
& & function lastChild()
{return $this-&last_child();}
& & function nextSibling()
{return $this-&next_sibling();}
& & function
previousSibling() {return
$this-&prev_sibling();}
// simple html dom parser
-----------------------------------------------------------------------------
class simple_html_dom {
& & public $root =
& & public $nodes =
& & public $callback =
& & public $lowercase =
& & protected $
& & protected $
& & protected $
& & protected $
& & protected $
& & protected $
& & protected $noise =
& & protected $token_blank =
" \t\r\n";
& & protected $token_equal =
& & protected $token_slash =
" /&\r\n\t";
& & protected $token_attr =
& & // use isset instead of
in_array, performance boost about 30%...
& & protected
$self_closing_tags = array('img'=&1,
'br'=&1, 'input'=&1,
'meta'=&1, 'link'=&1,
'hr'=&1, 'base'=&1,
'embed'=&1, 'spacer'=&1);
& & protected $block_tags =
array('root'=&1, 'body'=&1,
'form'=&1, 'div'=&1,
'span'=&1, 'table'=&1);
& & protected
$optional_closing_tags = array(
'tr'=&array('tr'=&1,
'td'=&1, 'th'=&1),
'th'=&array('th'=&1),
'td'=&array('td'=&1),
'li'=&array('li'=&1),
'dt'=&array('dt'=&1,
'dd'=&array('dd'=&1,
'dl'=&array('dd'=&1,
'p'=&array('p'=&1),
'nobr'=&array('nobr'=&1),
& & function
__construct($str=null) {
& if ($str) {
(preg_match("/^http:\/\//i",$str) ||
is_file($str))&
$this-&load_file($str);&
& & & else
$this-&load($str);
& & function __destruct()
& $this-&clear();
& & // load html from
& & function load($str,
$lowercase=true) {
& // prepare
& $this-&prepare($str,
$lowercase);
& // strip out comments
$this-&remove_noise("'&!--(.*?)--&'is");
& // strip out cdata
$this-&remove_noise("'&!\[CDATA\[(.*?)\]\]&'is",
& // strip out
&style& tags
$this-&remove_noise("'&\s*style[^&]*[^/]&(.*?)&\s*/\s*style\s*&'is");
$this-&remove_noise("'&\s*style\s*&(.*?)&\s*/\s*style\s*&'is");
& // strip out
&script& tags
$this-&remove_noise("'&\s*script[^&]*[^/]&(.*?)&\s*/\s*script\s*&'is");
$this-&remove_noise("'&\s*script\s*&(.*?)&\s*/\s*script\s*&'is");
& // strip out preformatted tags
$this-&remove_noise("'&\s*(?:code)[^&]*&(.*?)&\s*/\s*(?:code)\s*&'is");
& // strip out server side scripts
$this-&remove_noise("'(&\?)(.*?)(\?&)'s",
& // strip smarty scripts
$this-&remove_noise("'(\{\w)(.*?)(\})'s",
& // parsing
& while ($this-&parse());
$this-&root-&_[HDOM_INFO_END] =
& & // load html from
& & function load_file()
& $args = func_get_args();
$this-&load(call_user_func_array('file_get_contents',
$args), true);
& & // set callback
& & function
set_callback($function_name) {
& $this-&callback =
$function_
& & // remove callback
& & function
remove_callback() {
& $this-&callback =
& & // save dom as
& & function
save($filepath='') {
$this-&root-&innertext();
& if ($filepath!=='') file_put_contents($filepath,
& return $
& & // find dom node by css
& & function find($selector,
$idx=null) {
$this-&root-&find($selector,
& & // clean up memory due
to php5 circular references memory leak...
& & function clear() {
& foreach($this-&nodes as $n)
{$n-&clear(); $n =}
& if (isset($this-&parent))
{$this-&parent-&clear();
unset($this-&parent);}
& if (isset($this-&root))
{$this-&root-&clear();
unset($this-&root);}
& unset($this-&doc);
& unset($this-&noise);
& & function
dump($show_attr=true) {
$this-&root-&dump($show_attr);
& & // prepare HTML data and
init everything
& & protected function
prepare($str, $lowercase=true) {
& $this-&clear();
& $this-&doc = $
& $this-&pos = 0;
& $this-&cursor = 1;
& $this-&noise = array();
& $this-&nodes = array();
& $this-&lowercase =
& $this-&root = new
simple_html_dom_node($this);
& $this-&root-&tag =
$this-&root-&_[HDOM_INFO_BEGIN] =
$this-&root-&nodetype =
HDOM_TYPE_ROOT;
& $this-&parent =
& // set the length of content
& $this-&size = strlen($str);
& if ($this-&size&0)
$this-&char = $this-&doc[0];
& & // parse html
& & protected function
& if (($s =
$this-&copy_until_char('&'))==='')
& & & return
$this-&read_tag();
& $node = new simple_html_dom_node($this);
& ++$this-&
& $node-&_[HDOM_INFO_TEXT] =
& $this-&link_nodes($node,
& & // read tag info
& & protected function
read_tag() {
($this-&char!=='&') {
$this-&root-&_[HDOM_INFO_END] =
& & & return
& $begin_tag_pos =
& $this-&char =
(++$this-&pos&$this-&size)
? $this-&doc[$this-&pos] : //
& // end tag
& if ($this-&char==='/') {
$this-&char =
(++$this-&pos&$this-&size)
? $this-&doc[$this-&pos] : //
$this-&skip($this-&token_blank_t);
& & & $tag =
$this-&copy_until_char('&');
& & & // skip
attributes in end tag
& & & if (($pos
= strpos($tag, ' '))!==false)
& & $tag = substr($tag, 0,
$parent_lower =
strtolower($this-&parent-&tag);
& & & $tag_lower
= strtolower($tag);
($parent_lower!==$tag_lower) {
(isset($this-&optional_closing_tags[$parent_lower])
isset($this-&block_tags[$tag_lower])) {
$this-&parent-&_[HDOM_INFO_END] =
& $org_parent =
(($this-&parent-&parent)
strtolower($this-&parent-&tag)!==$tag_lower)
$this-&parent =
$this-&parent-&
(strtolower($this-&parent-&tag)!==$tag_lower)
$this-&parent = $org_ // restore origonal
($this-&parent-&parent)
$this-&parent =
$this-&parent-&
$this-&parent-&_[HDOM_INFO_END] =
& & & return
$this-&as_text_node($tag);
& & else if
(($this-&parent-&parent)
isset($this-&block_tags[$tag_lower])) {
$this-&parent-&_[HDOM_INFO_END] =
& $org_parent =
(($this-&parent-&parent)
strtolower($this-&parent-&tag)!==$tag_lower)
$this-&parent =
$this-&parent-&
(strtolower($this-&parent-&tag)!==$tag_lower)
$this-&parent = $org_ // restore origonal
$this-&parent-&_[HDOM_INFO_END] =
& & & return
$this-&as_text_node($tag);
& & else if
(($this-&parent-&parent)
strtolower($this-&parent-&parent-&tag)===$tag_lower)
$this-&parent-&_[HDOM_INFO_END] =
& $this-&parent =
$this-&parent-&
$this-&as_text_node($tag);
$this-&parent-&_[HDOM_INFO_END] =
($this-&parent-&parent)
$this-&parent =
$this-&parent-&
$this-&char =
(++$this-&pos&$this-&size)
? $this-&doc[$this-&pos] : //
& & & return
& $node = new simple_html_dom_node($this);
& $node-&_[HDOM_INFO_BEGIN] =
& ++$this-&
$this-&copy_until($this-&token_slash);
& // doctype, cdata &
comments...
& if (isset($tag[0])
&& $tag[0]==='!') {
$node-&_[HDOM_INFO_TEXT] = '&' . $tag
. $this-&copy_until_char('&');
(isset($tag[2]) && $tag[1]==='-'
&& $tag[2]==='-') {
$node-&nodetype = HDOM_TYPE_COMMENT;
& & $node-&tag =
'comment';
& & & } else
$node-&nodetype = HDOM_TYPE_UNKNOWN;
& & $node-&tag =
'unknown';
($this-&char==='&')
$node-&_[HDOM_INFO_TEXT].='&';
$this-&link_nodes($node, true);
$this-&char =
(++$this-&pos&$this-&size)
? $this-&doc[$this-&pos] : //
& & & return
& if ($pos=strpos($tag,
'&')!==false) {
& & & $tag =
'&' . substr($tag, 0, -1);
$node-&_[HDOM_INFO_TEXT] = $
$this-&link_nodes($node, false);
$this-&char =
$this-&doc[--$this-&pos]; //
& & & return
& if (!preg_match("/^[\w-:]+$/", $tag)) {
$node-&_[HDOM_INFO_TEXT] = '&' . $tag
$this-&copy_until('&&');
($this-&char==='&') {
$this-&link_nodes($node, false);
($this-&char==='&')
$node-&_[HDOM_INFO_TEXT].='&';
$this-&link_nodes($node, false);
$this-&char =
(++$this-&pos&$this-&size)
? $this-&doc[$this-&pos] : //
& & & return
& // begin tag
& $node-&nodetype =
HDOM_TYPE_ELEMENT;
& $tag_lower = strtolower($tag);
& $node-&tag =
($this-&lowercase) ? $tag_lower : $
& // handle optional closing tags
(isset($this-&optional_closing_tags[$tag_lower]) )
& & & while
(isset($this-&optional_closing_tags[$tag_lower][strtolower($this-&parent-&tag)]))
$this-&parent-&_[HDOM_INFO_END] =
& & $this-&parent
= $this-&parent-&
$node-&parent = $this-&
& $guard = 0; // prevent infinity loop
& $space =
array($this-&copy_skip($this-&token_blank),
& // attributes
($this-&char!==null
&& $space[0]==='')
& & & $name =
$this-&copy_until($this-&token_equal);
if($guard===$this-&pos) {
& & $this-&char =
(++$this-&pos&$this-&size)
? $this-&doc[$this-&pos] : //
& & & $guard =
& & & // handle
endless '&'
if($this-&pos&=$this-&size-1
$this-&char!=='&') {
$node-&nodetype = HDOM_TYPE_TEXT;
$node-&_[HDOM_INFO_END] = 0;
$node-&_[HDOM_INFO_TEXT] = '&'.$tag .
$space[0] . $
& & $node-&tag =
$this-&link_nodes($node, false);
& & & // handle
mismatch '&'
if($this-&doc[$this-&pos-1]=='&')
$node-&nodetype = HDOM_TYPE_TEXT;
& & $node-&tag =
& & $node-&attr =
$node-&_[HDOM_INFO_END] = 0;
$node-&_[HDOM_INFO_TEXT] =
substr($this-&doc, $begin_tag_pos,
$this-&pos-$begin_tag_pos-1);
& & $this-&pos -=
& & $this-&char =
(++$this-&pos&$this-&size)
? $this-&doc[$this-&pos] : //
$this-&link_nodes($node, false);
($name!=='/' && $name!=='') {
& & $space[1] =
$this-&copy_skip($this-&token_blank);
& & $name =
$this-&restore_noise($name);
($this-&lowercase) $name = strtolower($name);
($this-&char==='=') {
& $this-&char =
(++$this-&pos&$this-&size)
? $this-&doc[$this-&pos] : //
& $this-&parse_attr($node, $name,
& & else {
& //no value attr: nowrap, checked
selected...
& $node-&_[HDOM_INFO_QUOTE][] =
HDOM_QUOTE_NO;
& $node-&attr[$name] =
($this-&char!='&')
$this-&char =
$this-&doc[--$this-&pos]; //
$node-&_[HDOM_INFO_SPACE][] = $
& & $space =
array($this-&copy_skip($this-&token_blank),
& & & else
while($this-&char!=='&'
$this-&char!=='/');
& $this-&link_nodes($node,
& $node-&_[HDOM_INFO_ENDSPACE] =
$space[0];
& // check self closing
($this-&copy_until_char_escape('&')==='/')
$node-&_[HDOM_INFO_ENDSPACE] .= '/';
$node-&_[HDOM_INFO_END] = 0;
& & & // reset
(!isset($this-&self_closing_tags[strtolower($node-&tag)]))
$this-&parent = $
& $this-&char =
(++$this-&pos&$this-&size)
? $this-&doc[$this-&pos] : //
& & // parse
attributes
& & protected function
parse_attr($node, $name, &$space) {
& $space[2] =
$this-&copy_skip($this-&token_blank);
& switch($this-&char) {
& & & case
$node-&_[HDOM_INFO_QUOTE][] =
HDOM_QUOTE_DOUBLE;
& & $this-&char =
(++$this-&pos&$this-&size)
? $this-&doc[$this-&pos] : //
$node-&attr[$name] =
$this-&restore_noise($this-&copy_until_char_escape('"'));
& & $this-&char =
(++$this-&pos&$this-&size)
? $this-&doc[$this-&pos] : //
& & & case
$node-&_[HDOM_INFO_QUOTE][] =
HDOM_QUOTE_SINGLE;
& & $this-&char =
(++$this-&pos&$this-&size)
? $this-&doc[$this-&pos] : //
$node-&attr[$name] =
$this-&restore_noise($this-&copy_until_char_escape('\''));
& & $this-&char =
(++$this-&pos&$this-&size)
? $this-&doc[$this-&pos] : //
$node-&_[HDOM_INFO_QUOTE][] = HDOM_QUOTE_NO;
$node-&attr[$name] =
$this-&restore_noise($this-&copy_until($this-&token_attr));
& & // link node's
& & protected function
link_nodes(&$node, $is_child) {
& $node-&parent =
$this-&parent-&nodes[] = $
& if ($is_child)
$this-&parent-&children[] =
& & // as a text node
& & protected function
as_text_node($tag) {
& $node = new simple_html_dom_node($this);
& ++$this-&
& $node-&_[HDOM_INFO_TEXT] =
'&/' . $tag . '&';
& $this-&link_nodes($node,
& $this-&char =
(++$this-&pos&$this-&size)
? $this-&doc[$this-&pos] : //
& & protected function
skip($chars) {
& $this-&pos +=
strspn($this-&doc, $chars,
$this-&pos);
& $this-&char =
($this-&pos&$this-&size)
? $this-&doc[$this-&pos] : //
& & protected function
copy_skip($chars) {
& $pos = $this-&
& $len = strspn($this-&doc, $chars,
& $this-&pos += $
& $this-&char =
($this-&pos&$this-&size)
? $this-&doc[$this-&pos] : //
& if ($len===0) return '';
& return substr($this-&doc, $pos,
& & protected function
copy_until($chars) {
& $pos = $this-&
& $len = strcspn($this-&doc,
$chars, $pos);
& $this-&pos += $
& $this-&char =
($this-&pos&$this-&size)
? $this-&doc[$this-&pos] : //
& return substr($this-&doc, $pos,
& & protected function
copy_until_char($char) {
& if ($this-&char===null) return
& if (($pos = strpos($this-&doc,
$char, $this-&pos))===false) {
& & & $ret =
substr($this-&doc, $this-&pos,
$this-&size-$this-&pos);
$this-&char =
$this-&pos = $this-&
& & & return
& if ($pos===$this-&pos) return
& $pos_old = $this-&
& $this-&char =
$this-&doc[$pos];
& $this-&pos = $
& return substr($this-&doc,
$pos_old, $pos-$pos_old);
& & protected function
copy_until_char_escape($char) {
& if ($this-&char===null) return
& $start = $this-&
& while(1) {
& & & if (($pos
= strpos($this-&doc, $char, $start))===false)
& & $ret =
substr($this-&doc, $this-&pos,
$this-&size-$this-&pos);
& & $this-&char =
& & $this-&pos =
& & return $
($pos===$this-&pos) return '';
($this-&doc[$pos-1]==='\\') {
& & $start = $pos+1;
& & & $pos_old =
$this-&char = $this-&doc[$pos];
$this-&pos = $
& & & return
substr($this-&doc, $pos_old, $pos-$pos_old);
& & // remove noise from
html content
& & protected function
remove_noise($pattern, $remove_tag=false) {
& $count = preg_match_all($pattern,
$this-&doc, $matches,
PREG_SET_ORDER|PREG_OFFSET_CAPTURE);
& for ($i=$count-1; $i&-1; --$i)
& & & $key =
'___noise___'.sprintf('% 3d',
count($this-&noise)+100);
& & & $idx =
($remove_tag) ? 0 : 1;
$this-&noise[$key] = $matches[$i][$idx][0];
$this-&doc =
substr_replace($this-&doc, $key,
$matches[$i][$idx][1], strlen($matches[$i][$idx][0]));
& // reset the length of content
& $this-&size =
strlen($this-&doc);
& if ($this-&size&0)
$this-&char = $this-&doc[0];
& & // restore noise to html
& & function
restore_noise($text) {
& while(($pos=strpos($text,
'___noise___'))!==false) {
& & & $key =
'___noise___'.$text[$pos+11].$text[$pos+12].$text[$pos+13];
(isset($this-&noise[$key]))
& & $text = substr($text, 0,
$pos).$this-&noise[$key].substr($text,
& return $
& & function __toString()
$this-&root-&innertext();
& & function __get($name)
& switch($name) {
& & & case
'outertext': return
$this-&root-&innertext();
& & & case
'innertext': return
$this-&root-&innertext();
& & & case
'plaintext': return
$this-&root-&text();
& & // camel naming
conventions
& & function
childNodes($idx=-1) {return
$this-&root-&childNodes($idx);}
& & function firstChild()
$this-&root-&first_child();}
& & function lastChild()
$this-&root-&last_child();}
& & function
getElementById($id) {return $this-&find("#$id",
& & function
getElementsById($id, $idx=null) {return
$this-&find("#$id", $idx);}
& & function
getElementByTagName($name) {return
$this-&find($name, 0);}
& & function
getElementsByTagName_r($name, $idx=-1) {return
$this-&find($name, $idx);}
& & function loadFile()
func_get_args();$this-&load(call_user_func_array('file_get_contents',
$args), true);}
已投稿到:
以上网友发言只代表其个人观点,不代表新浪网的观点或立场。}

我要回帖

更多关于 html dom 参考手册 的文章

更多推荐

版权声明:文章内容来源于网络,版权归原作者所有,如有侵权请点击这里与我们联系,我们将及时删除。

点击添加站长微信