我有一个字符串值,正试图为其提取列表项。我想提取文本和任何子节点,但是,domdocument正在将实体转换为字符,而不是保持原始状态。
我尝试将domdocument::resolveexternals和domdocument::substituteEntities设置为false,但这没有效果。需要注意的是,我使用php 5.2.17在win7上运行。
示例代码为:

$example = '<ul><li>text</li>'.
    '<li>&frac12; of this is <strong>strong</strong></li></ul>';

echo 'To be converted:'.PHP_EOL.$example.PHP_EOL;

$doc = new DOMDocument();
$doc->resolveExternals = false;
$doc->substituteEntities = false;

$doc->loadHTML($example);

$domNodeList = $doc->getElementsByTagName('li');
$count = $domNodeList->length;

for ($idx = 0; $idx < $count; $idx++) {
    $value = trim(_get_inner_html($domNodeList->item($idx)));
    /* remainder of processing and storing in database */
    echo 'Saved '.$value.PHP_EOL;
}

function _get_inner_html( $node ) {
    $innerHTML= '';
    $children = $node->childNodes;
    foreach ($children as $child) {
        $innerHTML .= $child->ownerDocument->saveXML( $child );
    }

    return $innerHTML;
}

&frac12;最终被转换为1/2(单字符/utf-8版本,而不是实体版本),这不是所需的格式。

最佳答案

非PHP5.3.6的解决方案++

$html =<<<HTML
<ul><li>text</li>
<li>&frac12; of this is <strong>strong</strong></li></ul>
HTML;

$doc = new DOMDocument();
$doc->resolveExternals = false;
$doc->substituteEntities = false;
$doc->loadHTML($html);
foreach ($doc->getElementsByTagName('li') as $node)
{
  echo htmlentities(iconv('UTF-8', 'ISO-8859-1', $node->nodeValue)), "\n";
}

10-07 17:31