问题描述
InputStream
和 InputStreamReader
之间的区别是 InputStream
读为 byte
,而 InputStreamReader
读为 char
。例如,如果文件中的文本是 abc
,那么它们都可以正常工作。但是如果文本是一个你们
,它由一个 a
和两个汉字组成,那么 InputStream 无效。
所以我们应该使用 InputStreamReader
我的问题是:
InputStreamReader
如何识别字符? b
$ b
a
是一个字节,但是一个汉字是两个字节。是否读取 a
为一个字节,并将其他字符识别为两个字节,或者对于本文中的每个字符, InputStreamReader
读取它为两个字节?
InputStream
读取原始八位位组(8位)数据。在Java中, byte
类型等同于C中的 char
类型。在C中,此类型可用于表示字符数据或二进制数据。在Java中, char
类型与C wchar_t
类型具有更大的相似性。
InputStreamReader
然后将数据从一些编码转换为UTF-16。如果a你们在磁盘上编码为UTF-8,它将是字节序列 61 E4 BD A0 E4 BB AC
。当使用UTF-8编码将 InputStream
传递给 InputStreamReader
时,它将被读为char序列 0061 4F60 4EEC
。
Java中的字符编码API包含执行此转换的算法。您可以找到Oracle JRE支持的编码列表是一个很好的起点。
As Alexander Pogrebnyak ,你应该几乎总是提供明确的编码。 byte
-to - char
不指定编码的方法依赖于,这取决于操作系统和用户设置。
The difference between InputStream
and InputStreamReader
is that InputStream
reads as byte
, while InputStreamReader
reads as char
. For example, if the text in a file is abc
,then both of them work fine. But if the text is a你们
, which is composed of an a
and two Chinese characters, then the InputStream
does not work.
So we should use InputStreamReader
, but my question is:
How does InputStreamReader
recognize characters?
a
is one byte, but a Chinese character is two bytes. Does it read a
as one byte and recognize the other of characters as two bytes, or for every character in this text, does the InputStreamReader
read it as two bytes?
An InputStream
reads raw octet (8 bit) data. In Java, the byte
type is equivalent to the char
type in C. In C, this type can be used to represent character data or binary data. In Java, the char
type shares greater similarities with the C wchar_t
type.
An InputStreamReader
then will transform data from some encoding into UTF-16. If "a你们" is encoded as UTF-8 on disk, it will be the byte sequence 61 E4 BD A0 E4 BB AC
. When you pass the InputStream
to InputStreamReader
with the UTF-8 encoding, it will be read as the char sequence 0061 4F60 4EEC
.
The character encoding API in Java contains the algorithms to perform this transformation. You can find a list of encodings supported by the Oracle JRE here. The ICU project is a good place to start if you want to understand the internals of how this works in practice.
As Alexander Pogrebnyak points out, you should almost always provide the encoding explicitly. byte
-to-char
methods that do not specify an encoding rely on the JRE default, which is dependent on operating systems and user settings.
这篇关于读取多字节字符时,InputStream和InputStreamReader之间的差异的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!