问题描述
A java 的问题 string 的问题,其中包含特殊字符,例如ç
在每个特殊字符中占用两个字节的大小字符,但字符串长度方法或使用从 getBytes方法不返回计数为两个字节的特殊字符.
A java string containing special chars such as ç
takes two bytes of size in each special char, but String length method or getting the length of it with the byte array returned from getBytes method doesn't return special chars counted as two bytes.
如何正确计算字符串中的字节数?
How can I count correctly the number of bytes in a String?
示例:
单词endereço
应该使我返回9而不是8.
The word endereço
should return me length 9 instead of 8.
推荐答案
如果您希望长度为8个字符的"endereço"
字符串的大小为9个字节:7个ASCII
字符和1个非ASCII
字符,我想您想使用UTF-8
字符集,对于ASCII表中包含的字符,使用1个字节,对其他字符集使用更多.
If you expect to have a size of 9 bytes for the "endereço"
String that has a length of 8 characters : 7 ASCII
characters and 1 not ASCII
character, I suppose that you want to use UTF-8
charset that uses 1 byte for characters included in the ASCII table and more for the others.
String
length()
方法不能回答以下问题:使用了多少字节?,但是回答:"有多少个" UTF-16代码单元或更多只是char
包含在其中?"
String
length()
method doesn't answer to the question : how many bytes are used ? But answer to : "how many "UTF-16 code units" or more simply char
s are contained in?"
String
length()
Javadoc:
String
length()
Javadoc :
没有参数的byte[]
getBytes()
方法将String编码为字节数组.您可以使用返回数组的length
属性来了解编码的String使用了多少字节,但是结果将取决于编码期间使用的字符集.但是byte[]
getBytes()
方法不允许指定字符集:它使用平台的默认字符集.
因此,如果底层操作系统默认情况下使用的字符集不是您要使用的字符集(以字节为单位)编码,则使用它可能无法获得预期的结果.
此外,根据应用程序部署的平台,以字节为单位的字符串编码方式可能会发生变化.这可能是不希望的.
最后,如果无法将字符串编码为默认字符集,则行为未指定.
因此,应该非常谨慎地使用这种方法,或者根本不要使用这种方法.
The byte[]
getBytes()
method with no argument encodes the String into a byte array. You could use the length
property of the returned array to know how many bytes are used by the encoded String but the result will depend on the charset used during the encoding.But the byte[]
getBytes()
method doesn't allow to specify the charset : it uses the platform's default charset.
So, using it may not give the expected result if the underlying OS uses by default a charset that is not which one that you want to use to encode your Strings in bytes.
Besides, according to the platform where the application is deployed, the way which the String are encoded in bytes may change. Which may be undesirable.
At last, if the String cannot be encoded in the default charset, the behavior is unspecified.
So, this method should be used with very caution or not used at all.
byte[]
getBytes()
Javadoc:
byte[]
getBytes()
Javadoc :
无法在字符串中编码此字符串时此方法的行为 未指定默认字符集. java.nio.charset.CharsetEncoder 当对编码过程有更多控制时,应使用类 必填.
The behavior of this method when this string cannot be encoded in the default charset is unspecified. The java.nio.charset.CharsetEncoder class should be used when more control over the encoding process is required.
在您的字符串示例"endereço"
中,如果getBytes()
返回一个大小为8而不是9的数组,则意味着您的操作系统默认不使用UTF-8
,而是使用1字节固定宽度的字符集对于基于Windows OS的字符,例如ISO 8859-1
及其派生字符集(例如windows-1252
).
In your String example "endereço"
, if getBytes()
returns a array with a size of 8 and not 9, it means that your OS doesn't use by default UTF-8
but a charset using 1 byte fixed width by character such as ISO 8859-1
and its derived charsets such as windows-1252
for Windows OS based.
要了解运行该应用程序的当前Java虚拟机的默认字符集,可以使用以下实用程序方法:Charset defaultCharset = Charset.defaultCharset()
.
To know the default charset of the current Java virtual machine where the application runs, you can use this utility method : Charset defaultCharset = Charset.defaultCharset()
.
解决方案
byte[]
getBytes()
方法带有另外两个非常有用的重载:
byte[]
getBytes()
method comes with two other very useful overloads :
-
byte[] java.lang.String.getBytes(String charsetName) throws UnsupportedEncodingException
byte[] java.lang.String.getBytes(Charset charset)
与没有参数的getBytes()
方法相反,这些方法允许指定在字节编码期间使用的字符集.
Contrary to the getBytes()
method with no argument, these methods allow to specify the charset to use during the byte encoding.
byte[] java.lang.String.getBytes(String charsetName) throws UnsupportedEncodingException
Javadoc:
byte[] java.lang.String.getBytes(String charsetName) throws UnsupportedEncodingException
Javadoc :
无法在字符串中编码此字符串时此方法的行为 给定字符集未指定. java.nio.charset.CharsetEncoder 当对编码过程有更多控制时,应使用类 必填.
The behavior of this method when this string cannot be encoded in the given charset is unspecified. The java.nio.charset.CharsetEncoder class should be used when more control over the encoding process is required.
byte[] java.lang.String.getBytes(Charset charset)
Javadoc:
byte[] java.lang.String.getBytes(Charset charset)
Javadoc :
此方法始终替换格式错误的输入和不可映射的字符 具有此字符集的默认替换字节数组的序列.这 更多控制权时应使用java.nio.charset.CharsetEncoder类 在编码过程中是必需的.
This method always replaces malformed-input and unmappable-character sequences with this charset's default replacement byte array. The java.nio.charset.CharsetEncoder class should be used when more control over the encoding process is required.
您可以使用一个或另一个(虽然它们之间有些复杂)将您的String编码为带有 UTF-8 或任何其他字符集的字节数组,并为此获取其大小特定字符集.
You may use one or the other one (while there are some intricacies between them) to encode your String in a byte array with UTF-8 or any other charset and so get its size for this specific charset .
例如,要使用getBytes(String charsetName)
获得UTF-8
编码字节数组,可以执行以下操作:
For example to get an UTF-8
encoding byte array by using getBytes(String charsetName)
you can do that :
String yourString = "endereço";
byte[] bytes = yourString.getBytes("UTF-8");
int sizeInBytes = bytes.length;
您将根据需要获得9个字节的长度.
And you will get a length of 9 bytes as you wish.
这是一个更全面的示例,其中显示了默认编码,使用默认字符集平台UTF-8
和UTF-16
的字节编码:
Here is a more comprehensive example with default encoding displayed, byte encoding with default charset platform, UTF-8
and UTF-16
:
public static void main(String[] args) throws UnsupportedEncodingException {
// default charset
Charset defaultCharset = Charset.defaultCharset();
System.out.println("default charset = " + defaultCharset);
// String sample
String yourString = "endereço";
// getBytes() with default platform encoding
System.out.println("getBytes() with default charset, size = " + yourString.getBytes().length + System.lineSeparator());
// getBytes() with specific charset UTF-8
System.out.println("getBytes(\"UTF-8\"), size = " + yourString.getBytes("UTF-8").length);
System.out.println("getBytes(StandardCharsets.UTF_8), size = " + yourString.getBytes(StandardCharsets.UTF_8).length + System.lineSeparator());
// getBytes() with specific charset UTF-16
System.out.println("getBytes(\"UTF-16\"), size = " + yourString.getBytes("UTF-16").length);
System.out.println("getBytes(StandardCharsets.UTF_16), size = " + yourString.getBytes(StandardCharsets.UTF_16).length);
}
基于Windows操作系统的计算机上的输出:
getBytes(),默认字符集,大小= 8
getBytes() with default charset, size = 8
getBytes("UTF-8"),大小= 9
getBytes("UTF-8"), size = 9
getBytes(StandardCharsets.UTF_8),大小= 9
getBytes(StandardCharsets.UTF_8), size = 9
getBytes("UTF-16"),大小= 18
getBytes("UTF-16"), size = 18
getBytes(StandardCharsets.UTF_16),大小= 18
getBytes(StandardCharsets.UTF_16), size = 18
这篇关于如何正确计算字符串字节?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!