问题描述
有问题的文件不在我的控制之下。大多数字节序列是有效的UTF-8,它不是ISO-8859-1(或其他编码)。
我想尽最大努力提取尽可能多的信息。
The file in question is not under my control. Most byte sequences are valid UTF-8, it is not ISO-8859-1 (or an other encoding).I want to do my best do extract as much information as possible.
该文件包含一些非法字节序列,那些应该替换为替换字符
The file contains a few illegal byte sequences, those should be replaces with the replacement character.
这不是一件容易的事情,它认为需要一些有关UTF-8状态机的知识。
It's not an easy task, it think it requires some knowledge about the UTF-8 state machine.
Oracle有一个可以满足我需要的包装器:
Oracle has a wrapper which does what I need:
UTF8ValidationFilter javadoc
有没有类似的东西(商业或免费软件)?
Is there something like that available (commercially or as free software)?
感谢
-stephan
Thanks
-stephan
解决方案:
final BufferedInputStream in = new BufferedInputStream(istream);
final CharsetDecoder charsetDecoder = StandardCharsets.UTF_8.newDecoder();
charsetDecoder.onMalformedInput(CodingErrorAction.REPLACE);
charsetDecoder.onUnmappableCharacter(CodingErrorAction.REPLACE);
final Reader inputReader = new InputStreamReader(in, charsetDecoder);
推荐答案
做你所需要的。此类为不同类型的错误提供了字符集解码与用户定义的操作(请参阅和)。
java.nio.charset.CharsetDecoder does what you need. This class provides charset decoding with user-definable actions on different kinds of errors (see onMalformedInput()
and onUnmappableCharacter()
).
CharsetDecoder
写入 OutputStream
,您可以使用,有效创建过滤的 InputStream
。
CharsetDecoder
writes to an OutputStream
, which you can pipe into an InputStream
using java.io.PipedOutputStream
, effectively creating a filtered InputStream
.
这篇关于如何检测非法的UTF-8字节序列来替代它们在java输入流?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!