如何检测非法的UTF

如何检测非法的UTF

本文介绍了如何检测非法的UTF-8字节序列来替代它们在java输入流?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有问题的文件不在我的控制之下。大多数字节序列是有效的UTF-8,它不是ISO-8859-1(或其他编码)。
我想尽最大努力提取尽可能多的信息。

The file in question is not under my control. Most byte sequences are valid UTF-8, it is not ISO-8859-1 (or an other encoding).I want to do my best do extract as much information as possible.

该文件包含一些非法字节序列,那些应该替换为替换字符

The file contains a few illegal byte sequences, those should be replaces with the replacement character.

这不是一件容易的事情,它认为需要一些有关UTF-8状态机的知识。

It's not an easy task, it think it requires some knowledge about the UTF-8 state machine.

Oracle有一个可以满足我需要的包装器:

Oracle has a wrapper which does what I need:
UTF8ValidationFilter javadoc

有没有类似的东西(商业或免费软件)?

Is there something like that available (commercially or as free software)?

感谢

-stephan

Thanks
-stephan

解决方案:

final BufferedInputStream in = new BufferedInputStream(istream);
final CharsetDecoder charsetDecoder = StandardCharsets.UTF_8.newDecoder();
charsetDecoder.onMalformedInput(CodingErrorAction.REPLACE);
charsetDecoder.onUnmappableCharacter(CodingErrorAction.REPLACE);
final Reader inputReader = new InputStreamReader(in, charsetDecoder);


推荐答案

做你所需要的。此类为不同类型的错误提供了字符集解码与用户定义的操作(请参阅和)。

java.nio.charset.CharsetDecoder does what you need. This class provides charset decoding with user-definable actions on different kinds of errors (see onMalformedInput() and onUnmappableCharacter()).

CharsetDecoder 写入 OutputStream ,您可以使用,有效创建过滤的 InputStream

CharsetDecoder writes to an OutputStream, which you can pipe into an InputStream using java.io.PipedOutputStream, effectively creating a filtered InputStream.

这篇关于如何检测非法的UTF-8字节序列来替代它们在java输入流?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-20 11:56