问题描述
我解析了很多网站。所有的工作正常,我正在阅读的字符集声明转换编码。现在我对有问题。
I'm parsing a lot of sites. All works fine, I'm reading also charset declarations to convert encodings. Now I've a problem with http://celleheute.de/sonntagsfuhrung-3/.
HTML元标记说,内容被编码为ISO-8859-2,但HTTP标头说,它是UTF-8。真的,内容是UTF编码的,所以当我的解析器试图将内容转换为ISO时,它会破坏一些字符。
The HTML meta tag says, that the content is encoded as ISO-8859-2, but the HTTP header says, it's UTF-8. And really, the content is UTF encoded, so when my parser tries to convert the content to ISO it will break some chars.
现在我的问题是,喜欢?应该忽略元标记,当我可以在HTTP标头中找到声明,反之亦然吗?
Now my question is, which declaration should I prefer? Should I ignore meta tags, when I can find the declaration in HTTP header or vice versa? What will most web browsers do?
推荐答案
要了解现代浏览器的功能,您应该开始阅读
To understand what modern browsers do, you should start reading at http://dev.w3.org/html5/spec/parsing.html#determining-the-character-encoding
步骤一和二与问题最相关。他们说
Steps one and two are most relevant to the question. They say
-
如果用户已明确指示用户代理覆盖
文档的具有特定编码的字符编码,可选地,
返回具有置信度的编码并且中止这些
步骤。
If the user has explicitly instructed the user agent to override the document's character encoding with a specific encoding, optionally return that encoding with the confidence certain and abort these steps.
层指定一个编码,并且它被支持,
返回具有置信度的编码,并中止这些
步骤。
If the transport layer specifies an encoding, and it is supported, return that encoding with the confidence certain, and abort these steps.
这意味着真正的HTTP头优先于除了用户覆盖之外的所有内容。
which means that the real HTTP header takes precedence over everything except user over-ride.
它可以变得复杂。例如,字节顺序标记可以优先于元标记。
Beyond that it can get complex. A byte order mark, can for example, take precedence over the meta tag.
更新:规格已更改(约在2012年中期),因此字节顺序标记现在优先于HTTP标头。
UPDATE: Since this answer was written, the spec changed (around mid-2012) so that the byte order mark now takes precedence over the HTTP header.
这篇关于首选在HTML元标记或HTTP标头中的字符集声明?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!