本文介绍了有没有一种方法可以在使用DomDocument解析html时使实体保持完整?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我具有此功能来确保每个 img 标签都具有绝对URL:
I have this function to ensure every img tag has absolute URL:
function absoluteSrc($html, $encoding = 'utf-8')
{
$dom = new DOMDocument();
// Workaround to use proper encoding
$prehtml = "<html><head><meta http-equiv=\"Content-Type\" content=\"text/html; charset={$encoding}\"></head><body>";
$posthtml = "</body></html>";
if($dom->loadHTML( $prehtml . trim($html) . $posthtml)){
foreach($dom->getElementsByTagName('img') as $img){
if($img instanceof DOMElement){
$src = $img->getAttribute('src');
if( strpos($src, 'http://') !== 0 ){
$img->setAttribute('src', 'http://my.server/' . $src);
}
}
}
$html = $dom->saveHTML();
// Remove remains of workaround / DomDocument additions
$cut_start = strpos($html, '<body>') + 6;
$cut_length = -1 * (1+strlen($posthtml));
$html = substr($html, $cut_start, $cut_length);
}
return $html;
}
它可以正常工作,但是它以Unicode字符返回解码后的实体
It works fine, but it returns decoded entities as unicode characters
$html = <<< EOHTML
<p><img src="images/lorem.jpg" alt="lorem" align="left">
Lorem ipsum dolor sit amet consectetuer Nullam felis laoreet
Cum magna. Suscipit sed vel tincidunt urna.<br>
Vel consequat pretium Curabitur faucibus justo adipiscing elit.
<img src="others/ipsum.png" alt="ipsum" align="right"></p>
<center>© Dr Jekyll & Mr Hyde</center>
EOHTML;
echo absoluteSrc($html);
输出:
<p><img src="http://my.server/images/lorem.jpg" alt="lorem" align="left">
Lorem ipsum dolor sit amet consectetuer Nullam felis laoreet
Cum magna. Suscipit sed vel tincidunt urna.<br>
Vel consequat pretium Curabitur faucibus justo adipiscing elit.
<img src="http://my.server/others/ipsum.png" alt="ipsum" align="right"></p>
<center>© Dr Jekyll & Mr Hyde</center>
如您在最后一行所见
- & copy; 被翻译为©(U + 00A9),
- 到不间断空格(U + 00A0),
- & 到&
- © is translated to © (U+00A9),
- to non-breaking space (U+00A0),
- & to &
我希望它们保持与输入字符串相同.
I would like them to remain the same as in input string.
推荐答案
我也想知道答案.
我最终转换了& ..;实体先解析为**ENTITY-...-ENTITY**
,然后再解析回.
I ended up converting &..; entities to **ENTITY-...-ENTITY**
before parsing and converting back after it is done.
这篇关于有没有一种方法可以在使用DomDocument解析html时使实体保持完整?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!