问题描述
给出一个字节流(代表字符)并对该流进行编码,我如何获得字符的代码点?
Given a stream of bytes (that represent characters) and the encoding of the stream, how would I obtain the code points of the characters?
InputStreamReader r = new InputStreamReader(bla, Charset.forName("UTF-8"));
int whatIsThis = r.read();
上面的代码段中的read()返回了什么?是unicode代码点吗?
What is returned by read() in the above snippet? Is it the unicode codepoint?
推荐答案
Reader.read()
返回一个值,如果没有更多数据可用,该值可以强制转换为char
或-1.
A char
(隐式)是UTF-16BE编码中的16位代码单元.此编码可以用单个char
表示基本的多语言平面字符. 补充范围使用两个char
序列表示.
A char
is (implicitly) a 16-bit code unit in the UTF-16BE encoding. This encoding can represent basic multilingual plane characters with a single char
. The supplementary range is represented using two-char
sequences.
Character
类型包含将UTF-16代码单元转换为Unicode代码点的方法:
The Character
type contains methods for translating UTF-16 code units to Unicode code points:
需要两个char
的代码点将满足 isHighSurrogate 和 isLowSurrogate ,当您从序列. codePointAt 方法可用于从代码单元序列中提取代码点.从代码点到UTF-16代码单元,都有类似的工作方法.
A code point that requires two char
s will satisfy the isHighSurrogate and isLowSurrogate when you pass in two sequential values from a sequence. The codePointAt methods can be used to extract code points from code unit sequences. There are similar methods for working from code points to UTF-16 code units.
代码点流阅读器的示例实现:
A sample implementation of a code point stream reader:
import java.io.*;
public class CodePointReader implements Closeable {
private final Reader charSource;
private int codeUnit;
public CodePointReader(Reader charSource) throws IOException {
this.charSource = charSource;
codeUnit = charSource.read();
}
public boolean hasNext() { return codeUnit != -1; }
public int nextCodePoint() throws IOException {
try {
char high = (char) codeUnit;
if (Character.isHighSurrogate(high)) {
int next = charSource.read();
if (next == -1) { throw new IOException("malformed character"); }
char low = (char) next;
if(!Character.isLowSurrogate(low)) {
throw new IOException("malformed sequence");
}
return Character.toCodePoint(high, low);
} else {
return codeUnit;
}
} finally {
codeUnit = charSource.read();
}
}
public void close() throws IOException { charSource.close(); }
}
这篇关于如何建立编码字符的编码点?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!