喜欢HTML元标记或HTTP标头中的字符集声明？ | 喜欢HTML元标记或HTTP标头中的字符集声明

本文介绍了喜欢HTML元标记或HTTP标头中的字符集声明？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在解析很多网站。一切工作正常，我正在阅读字符集声明来转换编码。现在我有一个的问题。 p>

HTML元标记说，内容被编码为ISO-8859-2，但是HTTP头就是UTF-8。而真正的内容是UTF编码，所以当我的解析器尝试将内容转换为ISO时，它会打破一些字符。

现在我的问题是，哪个声明应该喜欢哪一种？我应该忽略元标记，当我可以在HTTP标题中找到声明，反之亦然吗？大部分网络浏览器会做什么？

解决方案

要了解什么是现代浏览器，您应该从

第一步和第二步与问题最相关。他们说

这意味着，真正的HTTP头优先于除用户超越之外的所有内容。

除此之外它可以变得复杂。一个字节顺序标记可以例如优先于元标记。

更新：由于这个答案是写的，规格已更改（约2012年中），以便字节顺序标记现在优先于HTTP标头。

I'm parsing a lot of sites. All works fine, I'm reading also charset declarations to convert encodings. Now I've a problem with http://celleheute.de/sonntagsfuhrung-3/.

The HTML meta tag says, that the content is encoded as ISO-8859-2, but the HTTP header says, it's UTF-8. And really, the content is UTF encoded, so when my parser tries to convert the content to ISO it will break some chars.

Now my question is, which declaration should I prefer? Should I ignore meta tags, when I can find the declaration in HTTP header or vice versa? What will most web browsers do?

解决方案

To understand what modern browsers do, you should start reading at http://dev.w3.org/html5/spec/parsing.html#determining-the-character-encoding

Steps one and two are most relevant to the question. They say

which means that the real HTTP header takes precedence over everything except user over-ride.

Beyond that it can get complex. A byte order mark, can for example, take precedence over the meta tag.

UPDATE: Since this answer was written, the spec changed (around mid-2012) so that the byte order mark now takes precedence over the HTTP header.

这篇关于喜欢HTML元标记或HTTP标头中的字符集声明？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！