问题描述
这是我的问题;我有一个InputStream,我已经转换为字节数组,但我不知道InputStream在运行时的字符集。我原来的想法是做所有的UTF-8,但我看到流的编码为ISO-8859-1和有外来字符的奇怪问题。 (那些疯狂的瑞典人)
这里是有问题的代码:
IOUtils.toString(inputstream,utf-8)
//对iso8859-1外来字符失败
$ b b
为了模拟这个,我有:
new String(\\\ö)
由于默认编码是UTF-8,所以返回ö。
new String(\\\ö.getBytes(utf-8),utf-8)
/ /也按预期返回ö。
new String(\\\ö.getBytes(iso-8859-1),utf-8)
//返回\\\,未知字符
我缺少什么?
你应该有数据的来源告诉你编码,但如果不能发生你或者
需要拒绝它或猜测编码,如果它不是UTF-8。
对于西方语言,如果不是UTF-8,猜测ISO-8859-1可能会在大部分时间工作:
ByteBuffer bytes = ByteBuffer.wrap(IOUtils.toByteArray(inputstream));
CharBuffer chars;
try {
try {
chars = Charset.forName(UTF-8)。newDecoder()。decode(bytes);
} catch(MalformedInputException e){
throw new RuntimeException(e);
} catch(UnmappableCharacterException e){
throw new RuntimeException(e);
} catch(CharacterCodingException e){
throw new RuntimeException(e);
}
} catch(RuntimeException e){
chars = Charset.forName(ISO-8859-1)newDecoder()。
}
System.out.println(chars.toString());
所有这个样板文件用于获取编码异常并且能够多次读取相同的数据。 p>
您也可以使用,使用更复杂的
启发式来确定编码,如果它不是UTF-8。但它不是完美的,例如我记得它检测到Windows-1252
中的芬兰文本为希伯来语Windows-1255。
还要注意,任意二进制数据是有效的在ISO-8859-1所以这是为什么你检测UTF-8首先(这是非常像,如果它通过UTF-8没有例外,它是UTF-8),这就是为什么你不能尝试检测后的任何其他ISO -8859-1。
Here's my problem; I have an InputStream that I've converted to a byte array, but I don't know the character set of the InputStream at runtime. My original thought was to do everything in UTF-8, but I see strange issues with streams that are encoded as ISO-8859-1 and have foreign characters. (Those crazy Swedes)
Here's the code in question:
IOUtils.toString(inputstream, "utf-8")
// Fails on iso8859-1 foreign characters
To simulate this, I have:
new String("\u00F6")
// Returns ö as expected, since the default encoding is UTF-8
new String("\u00F6".getBytes("utf-8"), "utf-8")
// Also returns ö as expected.
new String("\u00F6".getBytes("iso-8859-1"), "utf-8")
// Returns \uffff, the unknown character
What am I missing?
You should have the source of the data telling you the encoding, but if that cannot happen you eitherneed to reject it or guess the encoding if it's not UTF-8.
For western languages, guessing ISO-8859-1 if it's not UTF-8 is probably going to work most of the time:
ByteBuffer bytes = ByteBuffer.wrap(IOUtils.toByteArray(inputstream));
CharBuffer chars;
try {
try {
chars = Charset.forName("UTF-8").newDecoder().decode(bytes);
} catch (MalformedInputException e) {
throw new RuntimeException(e);
} catch (UnmappableCharacterException e) {
throw new RuntimeException(e);
} catch (CharacterCodingException e) {
throw new RuntimeException(e);
}
} catch (RuntimeException e) {
chars = Charset.forName("ISO-8859-1").newDecoder().decode(bytes);
}
System.out.println(chars.toString());
All this boilerplate is for getting encoding exceptions and being able to read the same data multiple times.
You can also use Mozilla Chardet that uses more sophisticatedheuristics to determine the encoding if it's not UTF-8. But it's not perfect, for instance I recall it detecting Finnish text in Windows-1252as Hebrew Windows-1255.
Also note that arbitrary binary data is valid in ISO-8859-1 so this is why you detect UTF-8 first (It is extremely like that if it passes UTF-8 without exceptions, it is UTF-8) and which is why you cannot try to detect anything else after ISO-8859-1.
这篇关于Scala - 从ISO-8859-1转换为UTF-8给外国字符陌生的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!