问题描述
我正在使用DOMDocument在将HTML输出到页面之前对其进行操作/修改.这只是一个html片段,而不是一个完整的页面.我最初的问题是,所有法语字符都被弄乱了,经过反复试验,我能够纠正这一点.现在,似乎只剩下一个问题:角色变成了?
I am using DOMDocument to manipulate / modify HTML before it gets output to the page. This is only a html fragment, not a complete page. My initial problem was that all french character got messed up, which I was able to correct after some trial-and-error. Now, it seems only one problem remains : ' character gets transformed into ? .
代码:
<?php
$dom = new DOMDocument('1.0','utf-8');
$dom->loadHTML(utf8_decode($row->text));
//Some pretty basic modification here, not even related to text
//reinsert HTML, and make sure to remove DOCTYPE, html and body that get added auto.
$row->text = utf8_encode(preg_replace('/^<!DOCTYPE.+?>/', '', str_replace( array('<html>', '</html>', '<body>', '</body>'), array('', '', '', ''), $dom->saveHTML())));
?>
我知道utf8解码/编码变得越来越混乱,但这是迄今为止我能使它正常工作的唯一方法.这是一个示例字符串:
I know it's getting messy with the utf8 decode/encode, but this is the only way I could make it work so far. Here is a sample string :
输入:Sans doute parce qu'il vient d'atteindre une datedéterminantedans son spectaculaire cheminement
Input :Sans doute parce qu’il vient d’atteindre une date déterminante dans son spectaculaire cheminement
输出:dé terminante dans son spectaculaire cheminement的无花果价格
Output :Sans doute parce qu?il vient d?atteindre une date déterminante dans son spectaculaire cheminement
如果找到更多详细信息,请添加它们.谢谢您的时间和支持!
If I find any more details, I'll add them. Thank you for your time and support!
推荐答案
请勿使用utf8_decode
.如果您的文本使用UTF-8,则将其照原样传递.
Don't use utf8_decode
. If your text is in UTF-8, pass it as such.
不幸的是,对于HTML,DOMDocument
默认为LATIN1.看来是这样的
Unfortunately, DOMDocument
defaults to LATIN1 in case of HTML. It seems the behavior is this
- 如果获取远程文档,则应从标头中推断出编码
- 如果未发送标头或文件位于本地,请查找对应的元设备
- 否则,默认为LATIN1.
工作示例:
<?php
$s = <<<HTML
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
</head>
<body>
Sans doute parce qu’il vient d’atteindre une date déterminante
dans son spectaculaire cheminement
</body>
</html>
HTML;
libxml_use_internal_errors(true);
$d = new domdocument;
$d->loadHTML($s);
echo $d->textContent;
使用XML(默认为UTF-8):
And with XML (default is UTF-8):
<?php
$s = '<x>Sans doute parce qu’il vient d’atteindre une date déterminante'.
'dans son spectaculaire cheminement</x>';
libxml_use_internal_errors(true);
$d = new domdocument;
$d->loadXML($s);
echo $d->textContent;
这篇关于DOMDocument编码问题/字符转换的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!