问题描述
我正在解析很多网站。一切工作正常,我正在阅读字符集声明来转换编码。现在我有一个的问题。 p>HTML元标记说,内容被编码为ISO-8859-2,但是HTTP头就是UTF-8。而真正的内容是UTF编码,所以当我的解析器尝试将内容转换为ISO时,它会打破一些字符。
现在我的问题是,哪个声明应该喜欢哪一种?我应该忽略元标记,当我可以在HTTP标题中找到声明,反之亦然吗?大部分网络浏览器会做什么?
要了解什么是现代浏览器,您应该从
第一步和第二步与问题最相关。他们说
这意味着,真正的HTTP头优先于除用户超越之外的所有内容。
除此之外它可以变得复杂。一个字节顺序标记可以例如优先于元标记。
更新:由于这个答案是写的,规格已更改(约2012年中),以便字节顺序标记现在优先于HTTP标头。
I'm parsing a lot of sites. All works fine, I'm reading also charset declarations to convert encodings. Now I've a problem with http://celleheute.de/sonntagsfuhrung-3/.
The HTML meta tag says, that the content is encoded as ISO-8859-2, but the HTTP header says, it's UTF-8. And really, the content is UTF encoded, so when my parser tries to convert the content to ISO it will break some chars.
Now my question is, which declaration should I prefer? Should I ignore meta tags, when I can find the declaration in HTTP header or vice versa? What will most web browsers do?
To understand what modern browsers do, you should start reading at http://dev.w3.org/html5/spec/parsing.html#determining-the-character-encoding
Steps one and two are most relevant to the question. They say
which means that the real HTTP header takes precedence over everything except user over-ride.
Beyond that it can get complex. A byte order mark, can for example, take precedence over the meta tag.
UPDATE: Since this answer was written, the spec changed (around mid-2012) so that the byte order mark now takes precedence over the HTTP header.
这篇关于喜欢HTML元标记或HTTP标头中的字符集声明?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!