问题描述
一直忽略它,我目前正在强迫自己更多地了解Java中的unicode。关于将UTF-16字符串转换为8位ASCII,我需要做一些练习。有人可以请教我如何用Java做到这一点?我知道你不能用ASCII代表所有可能的unicode值,所以在这种情况下我想要一个超过0xFF的代码只是被添加(坏的数据也应该只是静默添加)。
Having ignored it all this time, I am currently forcing myself to learn more about unicode in Java. There is an exercise I need to do about converting a UTF-16 string to 8-bit ASCII. Can someone please enlighten me how to do this in Java? I understand that you can't represent all possible unicode values in ASCII, so in this case I want a code which exceeds 0xFF to be merely added anyway (bad data should also just be added silently).
谢谢!
推荐答案
这个怎么样:
String input = ... // my UTF-16 string
StringBuilder sb = new StringBuilder(input.length());
for (int i = 0; i < input.length(); i++) {
char ch = input.charAt(i);
if (ch <= 0xFF) {
sb.append(ch);
}
}
byte[] ascii = sb.toString().getBytes("ISO-8859-1"); // aka LATIN-1
这可能不是对大字符串进行此转换的最有效方法因为我们复制了两次角色。但是,它具有直截了当的优点。
This is probably not the most efficient way to do this conversion for large strings since we copy the characters twice. However, it has the advantage of being straightforward.
BTW,严格来说,没有这样的字符集为8位ASCII。 ASCII是一个7位字符集。 LATIN-1是最接近8位ASCII字符集的东西(Unicode的块0相当于LATIN-1)所以我假设你的意思是这样。
BTW, strictly speaking there is no such character set as 8-bit ASCII. ASCII is a 7-bit character set. LATIN-1 is the nearest thing there is to an "8-bit ASCII" character set (and block 0 of Unicode is equivalent to LATIN-1) so I'll assume that's what you mean.
编辑:根据问题的更新,解决方案更简单:
in the light of the update to the question, the solution is even simpler:
String input = ... // my UTF-16 string
byte[] ascii = new byte[input.length()];
for (int i = 0; i < input.length(); i++) {
ascii[i] = (byte) input.charAt(i);
}
此解决方案效率更高。由于我们现在知道要多少字节,我们可以预先分配字节数组并复制(截断的)字符而不使用StringBuilder作为中间缓冲区。
This solution is more efficient. Since we now know how many bytes to expect, we can preallocate the byte array and in copy the (truncated) characters without using a StringBuilder as intermediate buffer.
但是,我我不相信以这种方式处理不良数据是明智的。
However, I'm not convinced that dealing with bad data in this way is sensible.
编辑2:还有一个模糊不清的陷阱。 Unicode实际上将代码点(字符)定义为大致21位值... 0x000000到0x10FFFF ...并使用代理来表示代码> 0x00FFFF。换句话说,Unicode代码点> 0x00FFFF实际上以UTF-16表示为两个字符。我的答案或任何其他人都没有考虑到这一点(诚然是深奥的)。事实上,在Java中处理代码点> 0x00FFFF一般来说相当棘手。这源于'char'是16位类型而String是用'char'定义的事实。
EDIT 2: there is one more obscure "gotcha" with this. Unicode actually defines code points (characters) to be "roughly 21 bit" values ... 0x000000 to 0x10FFFF ... and uses surrogates to represent codes > 0x00FFFF. In other words, a Unicode codepoint > 0x00FFFF is actually represented in UTF-16 as two "characters". Neither my answer or any of the others take account of this (admittedly esoteric) point. In fact, dealing with codepoints > 0x00FFFF in Java is rather tricky in general. This stems from the fact that 'char' is a 16 bit type and String is defined in terms of 'char'.
编辑3:可能是一个更明智的交易解决方案如果没有转换为ASCII的意外字符是用标准替换字符替换它们:
EDIT 3: maybe a more sensible solution for dealing with unexpected characters that don't convert to ASCII is to replace them with the standard replacement character:
String input = ... // my UTF-16 string
byte[] ascii = new byte[input.length()];
for (int i = 0; i < input.length(); i++) {
char ch = input.charAt(i);
ascii[i] = (ch <= 0xFF) ? (byte) ch : (byte) '?';
}
这篇关于Java中的UTF-16到ASCII转换的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!