问题描述
考虑以下示例,test.php
:
<?php
$mystr = "<p>Hello, με काचं ça øy jeść</p>";
var_dump($mystr);
$domdoc = new DOMDocument('1.0', 'utf-8'); //DOMDocument();
$domdoc->loadHTML($mystr); // already here corrupt UTF-8?
var_dump($domdoc);
?>
如果我使用PHP 5.5.9(cli)运行此程序,则会进入终端:
If I run this with PHP 5.5.9 (cli), I get in terminal:
$ php test.php
string(50) "<p>Hello, με काचं ça øy jeść</p>"
object(DOMDocument)#1 (34) {
["doctype"]=>
string(22) "(object value omitted)"
...
["actualEncoding"]=>
NULL
["encoding"]=>
NULL
["xmlEncoding"]=>
NULL
...
["textContent"]=>
string(70) "Hello, με à¤à¤¾à¤à¤ ça øy jeÅÄ"
}
很明显,原始字符串正确为UTF-8,但是DOMDocument的textContent
编码错误.
Clearly, the original string is correct as UTF-8, but the textContent
of the DOMDocument is incorrectly encoded.
那么,如何在DOMDocument中以正确的UTF-8格式获取内容?
So, how can I get the content as correct UTF-8 in the DOMDocument?
推荐答案
DOM扩展建立在 libxml2 上,其HTML解析器是针对HTML 4制作的-默认编码为ISO-8859- 1.除非遇到适当的元标记或XML声明,否则 loadHTML()
都将假定内容为ISO-8859-1.
The DOM extension was built on libxml2 whose HTML parser was made for HTML 4 - the default encoding for which is ISO-8859-1. Unless it encounters an appropriate meta tag or XML declaration stating otherwise loadHTML()
will assume the content is ISO-8859-1.
在创建 DOMDocument 时指定编码不会影响什么解析器会执行-加载HTML(或XML)会同时替换您为其构造函数提供的xml版本和编码.
Specifying the encoding when creating the DOMDocument as you have does not influence what the parser does - loading HTML (or XML) replaces both the xml version and encoding that you gave its constructor.
首先使用 mb_convert_encoding()
来翻译高于ASCII范围等同于它的html实体.
First use mb_convert_encoding()
to translate anything above the ASCII range into its html entity equivalent.
$domdoc->loadHTML(mb_convert_encoding($mystr, 'HTML-ENTITIES', 'UTF-8'));
或者入侵指定UTF-8的元标记或xml声明.
Or hack in a meta tag or xml declaration specifying UTF-8.
$domdoc->loadHTML('<meta http-equiv="Content-Type" content="charset=utf-8" />' . $mystr);
$domdoc->loadHTML('<?xml encoding="UTF-8">' . $mystr);
这篇关于UTF-8与PHP DOMDocument loadHTML吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!