问题描述
经过一番调查,我发现在java世界中有一些编码检测项目,如果 getEncoding
在 InputStreamReader
不起作用:
After certain survey, I come to discover that there are a few encoding detection project in java world, if the getEncoding
in InputStreamReader
does not work:
- juniversalchardet
- jchardet
- cpdetector
- ICU4J
但是,我真的不知道哪一个是最好的。
However, I really do not know which is the best among the all. Can anyone with hand-on experience tell me which one is the best in Java?
推荐答案
我已经检查juniversalchardet和ICU4J on某些 CSV文件,并且结果不一致:
juniversalchardet有更好的效果:
I've checked juniversalchardet and ICU4J on some CSV files, and the results are inconsistent:juniversalchardet had better results:
- UTF-
- Windows-1255:juniversalchardet检测到有足够的希伯来字母,ICU4J仍然认为它是ISO-8859-1。
- SHIFT_JIS(日语):juniversalchardet检测到,ICU4J检测到了这个问题,并且ICU4J检测到了它的另一个希伯来语编码的ISO-8859-8。认为是ISO-8859-2。
- ISO-8859-1:由ICU4J检测,不受juniversalchardet支持。
- UTF-8: Both detected.
- Windows-1255: juniversalchardet detected when it had enough hebrew letters, ICU4J still thought it was ISO-8859-1. With even more hebrew letters, ICU4J detected it as ISO-8859-8 which is the other hebrew encoding(and so the text was OK).
- SHIFT_JIS(Japanese): juniversalchardet detected and ICU4J thought it was ISO-8859-2.
- ISO-8859-1: detected by ICU4J, not supported by juniversalchardet.
因此,应该考虑他最有可能处理哪些编码。
最后,我选择了 ICU4J 。
So one should consider which encodings he will most likely have to deal with.In the end I chose ICU4J.
注意ICU4J仍然保留。
Notice that ICU4J is still maintained.
还要注意,你可能想使用ICU4J,如果它返回null,因为它没有成功,尝试使用juniversalchardet。
Also notice that you may want to use ICU4J, and in case that it returns null because it didn't succeed, try to use juniversalchardet. Or the opposite.
的AutoDetectReader正是这样 - 首先尝试使用HtmlEncodingDetector,然后使用UniversalEncodingDetector(基于juniversalchardet),然后尝试Icu4jEncodingDetector(基于ICU4J)。
AutoDetectReader of Apache Tika does exactly this - first tries to use HtmlEncodingDetector, then UniversalEncodingDetector(which is based on juniversalchardet), and then tries Icu4jEncodingDetector(based on ICU4J).
这篇关于什么是最准确的编码检测器?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!