问题描述
当我使用iconv从UTF16转换为UTF8时,一切正常,但反之亦然,它不工作。
我有这些文件:
a-16.strings:Little-endian UTF-16 Unicode c程序文本
a-8.strings:UTF-8 Unicode c程序文本,有很长的行
文本在编辑器中看起来OK。当我运行这个:
iconv -f UTF-8 -t UTF-16LE a-8.strings> b-16.strings
然后我得到这个结果:
b-16.strings:data
a-16.strings:Little-endian UTF-16 Unicode c程序文本
a-8.strings :UTF-8 Unicode c程序文本,有很长的行
文件
实用程序不显示预期的文件格式,文本在编辑器中看起来不好。可能是iconv不创建正确的BOM?我在MAC命令行上运行它。
为什么b-16不是正确的UTF-16LE格式?是否有另一种方法将utf8转换为utf16?
更详细的说明。
$ iconv -f UTF-8 -t UTF-16LE a-8.strings> b-16le-BAD-fromUTF8.strings
$ iconv -f UTF-8 -t UTF-16 a-8.strings> b-16be.strings
$ iconv -f UTF-16 -t UTF-16LE b-16be.strings> b-16le-BAD-fromUTF16BE.strings
$ file * s
a-16.strings:Little-endian UTF-16 Unicode c程序文本,带有很长的行
a-8.strings:UTF-8 Unicode c程序文本,带有很长的行
b-16be.strings:Big-endian UTF-16 Unicode c程序文本,带很长的行
b-16le BAD-fromUTF16BE.strings:data
b-16le-BAD-fromUTF8.strings:data
$ od -c a-16.strings |头
0000000 377 376 / \0 * \0 \0 \f 001 E \0 S \0 K \0
$ od -c a-8 .strings |头
0000000 / * * *Č** ESKY(JVO
$ od -c b-16be.strings | head
0000000 376 377 \0 / \0 * \0 * \0 * \0 001 \f \0 E
$ od -c b-16le-BAD-fromUTF16BE.strings | head
0000000 / \ 0 * \0 * \0 * \0 \0 \f 001 E \0 S \0
$ od -c b-16le-BAD-fromUTF8.strings | head
0000000 / \0 * \0 * \0 * \0 \0 \f 001 E \0 S \0
很明显,当我运行转换为UTF-16LE时,BOM丢失
任何帮助?
UTF-16LE
告诉 iconv
-endian UTF-16 没有一个BOM(字节顺序标记)。显然,它假定自从你指定 LE
后,BOM不是必需的。
UTF-16
指示它生成UTF-16文本(以本地计算机的字节顺序)
我发现 file
命令不能识别没有BOM的UTF-16文本,您的编辑器也不能。但是如果你运行 iconv -f UTF-16LE -t UTF_8 b-16字符串
,你应该得到一个有效的原始文件的UTF-8版本。
尝试对文件运行 od -c
以查看其实际内容。
UPDATE:
看起来你是一个大端机器(x86是little-endian)尝试生成带有BOM的小端UTF-16文件。那是对的吗?据我所知, iconv
不会直接这样做。但这应该工作:
(printf\xff\xfe; iconv -f utf-8 -t utf- 16le UTF-8-FILE)> UTF-16-FILE
printf
可能取决于您的区域设置;我有 LANG = en_US.UTF-8
。
(任何人都可以提出更优雅的解决方案? p>
另一种解决方法,如果您知道 -t utf-16 :
iconv -f utf-8 -t utf-16 UTF-8-FILE | dd conv = swab 2> / dev / null
When I use iconv to convert from UTF16 to UTF8 then all is fine but vice versa it does not work.I have these files:
a-16.strings: Little-endian UTF-16 Unicode c program text
a-8.strings: UTF-8 Unicode c program text, with very long lines
The text look OK in editor. When I run this:
iconv -f UTF-8 -t UTF-16LE a-8.strings > b-16.strings
Then I get this result:
b-16.strings: data
a-16.strings: Little-endian UTF-16 Unicode c program text
a-8.strings: UTF-8 Unicode c program text, with very long lines
The
file
utility does not show expected file format and the text does not look good in editor either. Could it be that iconv does not create proper BOM? I run it on MAC command line.
Why is not the b-16 in proper UTF-16LE format? Is there another way of converting utf8 to utf16?
More elaboration is bellow.
$ iconv -f UTF-8 -t UTF-16LE a-8.strings > b-16le-BAD-fromUTF8.strings
$ iconv -f UTF-8 -t UTF-16 a-8.strings > b-16be.strings
$ iconv -f UTF-16 -t UTF-16LE b-16be.strings > b-16le-BAD-fromUTF16BE.strings
$ file *s
a-16.strings: Little-endian UTF-16 Unicode c program text, with very long lines
a-8.strings: UTF-8 Unicode c program text, with very long lines
b-16be.strings: Big-endian UTF-16 Unicode c program text, with very long lines
b-16le-BAD-fromUTF16BE.strings: data
b-16le-BAD-fromUTF8.strings: data
$ od -c a-16.strings | head
0000000 377 376 / \0 * \0 \0 \f 001 E \0 S \0 K \0
$ od -c a-8.strings | head
0000000 / * * * Č ** E S K Y ( J V O
$ od -c b-16be.strings | head
0000000 376 377 \0 / \0 * \0 * \0 * \0 001 \f \0 E
$ od -c b-16le-BAD-fromUTF16BE.strings | head
0000000 / \0 * \0 * \0 * \0 \0 \f 001 E \0 S \0
$ od -c b-16le-BAD-fromUTF8.strings | head
0000000 / \0 * \0 * \0 * \0 \0 \f 001 E \0 S \0
It is clear the BOM is missing whenever I run conversion to UTF-16LE.Any help on this?
解决方案
UTF-16LE
tells iconv
to generate little-endian UTF-16 without a BOM (Byte Order Mark). Apparently it assumes that since you specified LE
, the BOM isn't necessary.
UTF-16
tells it to generate UTF-16 text (in the local machine's byte order) with a BOM.
If you're on a little-endian machine, I don't see a way to tell
iconv
to generate big-endian UTF-16 with a BOM, but I might just be missing something.
I find that the
file
command doesn't recognize UTF-16 text without a BOM, and your editor might not either. But if you run iconv -f UTF-16LE -t UTF_8 b-16 strings
, you should get a valid UTF-8 version of the original file.
Try running
od -c
on the files to see their actual contents.
UPDATE :
It looks like you're on a big-endian machine (x86 is little-endian), and you're trying to generate a little-endian UTF-16 file with a BOM. Is that correct? As far as I can tell,
iconv
won't do that directly. But this should work:
( printf "\xff\xfe" ; iconv -f utf-8 -t utf-16le UTF-8-FILE ) > UTF-16-FILE
The behavior of the
printf
might depend on your locale settings; I have LANG=en_US.UTF-8
.
(Can anyone suggest a more elegant solution?)
Another workaround, if you know the endianness of the output produced by
-t utf-16
:
iconv -f utf-8 -t utf-16 UTF-8-FILE | dd conv=swab 2>/dev/null
这篇关于使用iconv将UTF8转换为UTF16的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!