

A java 的问题 string 的问题,其中包含特殊字符,例如ç在每个特殊字符中占用两个字节的大小字符,但字符串长度方法或使用从 getBytes方法不返回计数为两个字节的特殊字符.

A java string containing special chars such as ç takes two bytes of size in each special char, but String length method or getting the length of it with the byte array returned from getBytes method doesn't return special chars counted as two bytes.


How can I count correctly the number of bytes in a String?



The word endereço should return me length 9 instead of 8.



If you expect to have a size of 9 bytes for the "endereço" String that has a length of 8 characters : 7 ASCII characters and 1 not ASCII character, I suppose that you want to use UTF-8 charset that uses 1 byte for characters included in the ASCII table and more for the others.

String length()方法不能回答以下问题:使用了多少字节?,但是回答:"有多少个" UTF-16代码单元或更多只是char包含在其中?"

String length() method doesn't answer to the question : how many bytes are used ? But answer to : "how many "UTF-16 code units" or more simply chars are contained in?"

String length() Javadoc:

String length() Javadoc :

没有参数的byte[] getBytes()方法将String编码为字节数组.您可以使用返回数组的length属性来了解编码的String使用了多少字节,但是结果将取决于编码期间使用的字符集.但是byte[] getBytes()方法不允许指定字符集:它使用平台的默认字符集.

The byte[] getBytes() method with no argument encodes the String into a byte array. You could use the length property of the returned array to know how many bytes are used by the encoded String but the result will depend on the charset used during the encoding.But the byte[] getBytes() method doesn't allow to specify the charset : it uses the platform's default charset.
So, using it may not give the expected result if the underlying OS uses by default a charset that is not which one that you want to use to encode your Strings in bytes.
Besides, according to the platform where the application is deployed, the way which the String are encoded in bytes may change. Which may be undesirable.
At last, if the String cannot be encoded in the default charset, the behavior is unspecified.
So, this method should be used with very caution or not used at all.

byte[] getBytes() Javadoc:

byte[] getBytes() Javadoc :

无法在字符串中编码此字符串时此方法的行为 未指定默认字符集. java.nio.charset.CharsetEncoder 当对编码过程有更多控制时,应使用类 必填.

The behavior of this method when this string cannot be encoded in the default charset is unspecified. The java.nio.charset.CharsetEncoder class should be used when more control over the encoding process is required.

在您的字符串示例"endereço"中,如果getBytes()返回一个大小为8而不是9的数组,则意味着您的操作系统默认不使用UTF-8,而是使用1字节固定宽度的字符集对于基于Windows OS的字符,例如ISO 8859-1及其派生字符集(例如windows-1252).

In your String example "endereço", if getBytes() returns a array with a size of 8 and not 9, it means that your OS doesn't use by default UTF-8 but a charset using 1 byte fixed width by character such as ISO 8859-1 and its derived charsets such as windows-1252 for Windows OS based.

要了解运行该应用程序的当前Java虚拟机的默认字符集,可以使用以下实用程序方法:Charset defaultCharset = Charset.defaultCharset().

To know the default charset of the current Java virtual machine where the application runs, you can use this utility method : Charset defaultCharset = Charset.defaultCharset().


byte[] getBytes()方法带有另外两个非常有用的重载:

byte[] getBytes() method comes with two other very useful overloads :

  • byte[] java.lang.String.getBytes(String charsetName) throws UnsupportedEncodingException

byte[] java.lang.String.getBytes(Charset charset)


Contrary to the getBytes() method with no argument, these methods allow to specify the charset to use during the byte encoding.

byte[] java.lang.String.getBytes(String charsetName) throws UnsupportedEncodingException Javadoc:

byte[] java.lang.String.getBytes(String charsetName) throws UnsupportedEncodingException Javadoc :

无法在字符串中编码此字符串时此方法的行为 给定字符集未指定. java.nio.charset.CharsetEncoder 当对编码过程有更多控制时,应使用类 必填.

The behavior of this method when this string cannot be encoded in the given charset is unspecified. The java.nio.charset.CharsetEncoder class should be used when more control over the encoding process is required.

byte[] java.lang.String.getBytes(Charset charset) Javadoc:

byte[] java.lang.String.getBytes(Charset charset) Javadoc :

此方法始终替换格式错误的输入和不可映射的字符 具有此字符集的默认替换字节数组的序列.这 更多控制权时应使用java.nio.charset.CharsetEncoder类 在编码过程中是必需的.

This method always replaces malformed-input and unmappable-character sequences with this charset's default replacement byte array. The java.nio.charset.CharsetEncoder class should be used when more control over the encoding process is required.

您可以使用一个或另一个(虽然它们之间有些复杂)将您的String编码为带有 UTF-8 或任何其他字符集的字节数组,并为此获取其大小特定字符集.

You may use one or the other one (while there are some intricacies between them) to encode your String in a byte array with UTF-8 or any other charset and so get its size for this specific charset .

例如,要使用getBytes(String charsetName)获得UTF-8编码字节数组,可以执行以下操作:

For example to get an UTF-8 encoding byte array by using getBytes(String charsetName) you can do that :

String yourString = "endereço";
byte[] bytes = yourString.getBytes("UTF-8");
int sizeInBytes = bytes.length;


And you will get a length of 9 bytes as you wish.


Here is a more comprehensive example with default encoding displayed, byte encoding with default charset platform, UTF-8 and UTF-16 :

public static void main(String[] args) throws UnsupportedEncodingException {

    // default charset
    Charset defaultCharset = Charset.defaultCharset();
    System.out.println("default charset = " + defaultCharset);

    // String sample
    String yourString = "endereço";

    //  getBytes() with default platform encoding
    System.out.println("getBytes() with default charset, size = " + yourString.getBytes().length + System.lineSeparator());

    // getBytes() with specific charset UTF-8
    System.out.println("getBytes(\"UTF-8\"), size = " + yourString.getBytes("UTF-8").length);       
    System.out.println("getBytes(StandardCharsets.UTF_8), size = " + yourString.getBytes(StandardCharsets.UTF_8).length + System.lineSeparator());

    // getBytes() with specific charset UTF-16      
    System.out.println("getBytes(\"UTF-16\"), size = " + yourString.getBytes("UTF-16").length);     
    System.out.println("getBytes(StandardCharsets.UTF_16), size = " + yourString.getBytes(StandardCharsets.UTF_16).length);


getBytes(),默认字符集,大小= 8

getBytes() with default charset, size = 8

getBytes("UTF-8"),大小= 9

getBytes("UTF-8"), size = 9

getBytes(StandardCharsets.UTF_8),大小= 9

getBytes(StandardCharsets.UTF_8), size = 9

getBytes("UTF-16"),大小= 18

getBytes("UTF-16"), size = 18

getBytes(StandardCharsets.UTF_16),大小= 18

getBytes(StandardCharsets.UTF_16), size = 18


