问题描述
OpenTag 常见问题指出:
如果不存在编码声明在 XML 文档中(并且没有外部编码声明机制,如HTTP 标头可用),XML 文档的假定编码取决于存在字节顺序标记 (BOM).
BOM 是一个 Unicode 特殊标记放在文件的顶部表示其编码.物料清单是UTF-8 可选.
The BOM is a Unicode special marker placed at the top of the file that indicate its encoding. The BOM is optional for UTF-8.
First bytes Encoding assumed
-----------------------------------------
EF BB BF UTF-8
FE FF UTF-16 (big-endian)
FF FE UTF-16 (little-endian)
00 00 FE FF UTF-32 (big-endian)
FF FE 00 00 UTF-32 (little-endian)
None of the above UTF-8
对上面的段落有愚蠢的解释吗?
Is there a dumbed-down explanation of the above paragraph?
推荐答案
或者你必须使用像这样的一行
Either you have to use a line like
<?xml version="1.0" encoding="iso-8859-1" ?>
指定使用哪种编码.如果未指定编码,则可以存在 字节顺序标记 (BOM).如果存在 UTF-16 或 UTF-32 的 BOM,则使用该编码.否则 UTF-8 是编码.(UTF-8 的 BOM 是可选的)
to specify which encoding is used. If the encoding is not specified, a Byte order mark (BOM) can be present. If a BOM for either UTF-16 or UTF-32 is present, that encoding is used. Otherwise UTF-8 is the encoding. (The BOM for UTF-8 is optional)
编辑
BOM 是一个不可见的字符.但没有必要看到它.应用程序会自动处理它.当您使用windows记事本时,您可以在保存文件时选择编码.记事本将自动在文件开头插入 BOM.当您稍后重新打开文件时,记事本将识别 BOM 并使用正确的编码来读取文件.您无需修改 BOM,如果您这样做,字符可以获得不同的含义,因此文本将不相同.
The BOM is an invisible character. But there is no need to see it. Applications take care of it automatically. When you use windows notepad, you can select the encoding when you save the file. Notepad will automatically insert the BOM at the start of the file. When you later reopen the file, notepad will recognise the BOM and use the proper encoding to read the file. There is no need for you to ever modify the BOM, if you would do so, characters can get a different meaning, so the text will not be the same.
我会试着用一个例子来解释.考虑一个只有字符test"的文本文件.默认记事本将使用 ANSI 编码,当您以 十六进制模式查看时,文本文件将如下所示:
C:\>C:\gnuwin32\bin\hexdump -C test-ansi.txt
00000000 74 65 73 74 |test|
00000004
(如您所见,我使用的是 gnuwin32 的十六进制转储,但您也可以使用十六进制编辑器,例如 Frhed 看到这个.
(as you see, I am using hexdump from gnuwin32, but you can also use an hex editor like Frhed to see this.
这个文件前面没有BOM.这是不可能的,因为用于 BOM 的字符在 ANSI 编码中不存在.(因为没有 BOM,不支持 ANSI 编码的编辑器会将此文件视为 UTF-8).
There is no BOM in front of this file. It would not be possible, because the character which is used for the BOM does not exist in ANSI encoding. (Because there is not BOM, editors which don't support ANSI encoding, would treat this file as UTF-8).
当我现在像 utf8 这样保存文件时,你会在test"前面看到 3 个额外的字节(BOM):
when I now save the file like utf8, you will see 3 extra bytes (the BOM) in front of "test":
C:\>C:\gnuwin32\bin\hexdump -C test-utf8.txt
00000000 ef bb bf 74 65 73 74 |test|
00000007
(如果您使用不支持 utf-8 的文本编辑器打开此文件,您实际上会看到这些字符")
(if you would open this file with a text editor which does not support utf-8, you would actually see those characters "")
记事本也可以将文件保存为unicode,即UTF-16 little-endian (UTF-16LE):
Notepad can also save the file as unicode, this means UTF-16 little-endian (UTF-16LE):
C:\>C:\gnuwin32\bin\hexdump -C test-unicode.txt
00000000 ff fe 74 00 65 00 73 00 74 00 |ÿþt.e.s.t.|
0000000a
这里是保存为 unicode (big endian) (UTF-16BE) 的版本:
And here is the version saved as unicode (big endian) (UTF-16BE):
C:\>C:\gnuwin32\bin\hexdump -C test-unicode-big-endian.txt
00000000 fe ff 00 74 00 65 00 73 00 74 |þÿ.t.e.s.t|
0000000a
现在考虑一个包含 4 个汉字琀攀猀琀"的文本文件.当我将其保存为 unicode(大端)时,结果如下所示:
Now consider a text file with the 4 chinese characters "琀攀猀琀". When I save that as unicode (big endian), the result looks like this:
C:\>C:\gnuwin32\bin\hexdump -C test2-unicode-big-endian.txt
00000000 fe ff 74 00 65 00 73 00 74 00 |þÿt.e.s.t.|
0000000a
如您所见,UTF-16LE 中test"一词的存储方式与 UTF-16BE 中琀攀猀琀"一词的存储方式相同.但是因为BOM如果存储不同,你可以看到文件是否包含测试"或琀攀猀琀".如果没有 BOM,您将不得不猜测.
As you see, the word "test" in UTF-16LE is stored the same way as the word "琀攀猀琀" in UTF-16BE. But because the BOM if stored different, you can see whether the file contains "test" or "琀攀猀琀". Without a BOM you would have to guess.
这篇关于XML 的默认编码是 UTF-8 还是 UTF-16?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!