问题描述
显而易见的答案是使用 Charset.defaultCharset()
但我们最近发现这可能不是正确的答案.有人告诉我,结果与 java.io 类在几次使用的实际默认字符集不同.看起来 Java 保留了 2 组默认字符集.有没有人对这个问题有任何见解?
The obvious answer is to use Charset.defaultCharset()
but we recently found out that this might not be the right answer. I was told that the result is different from real default charset used by java.io classes in several occasions. Looks like Java keeps 2 sets of default charset. Does anyone have any insights on this issue?
我们能够重现一个失败案例.这是一种用户错误,但它仍然可能暴露所有其他问题的根本原因.这是代码,
We were able to reproduce one fail case. It's kind of user error but it may still expose the root cause of all other problems. Here is the code,
public class CharSetTest {
public static void main(String[] args) {
System.out.println("Default Charset=" + Charset.defaultCharset());
System.setProperty("file.encoding", "Latin-1");
System.out.println("file.encoding=" + System.getProperty("file.encoding"));
System.out.println("Default Charset=" + Charset.defaultCharset());
System.out.println("Default Charset in Use=" + getDefaultCharSet());
}
private static String getDefaultCharSet() {
OutputStreamWriter writer = new OutputStreamWriter(new ByteArrayOutputStream());
String enc = writer.getEncoding();
return enc;
}
}
我们的服务器需要使用 Latin-1 的默认字符集来处理旧协议中的一些混合编码 (ANSI/Latin-1/UTF-8).所以我们所有的服务器都使用这个 JVM 参数运行,
Our server requires default charset in Latin-1 to deal with some mixed encoding (ANSI/Latin-1/UTF-8) in a legacy protocol. So all our servers run with this JVM parameter,
-Dfile.encoding=ISO-8859-1
这是在 Java 5 上的结果,
Here is the result on Java 5,
Default Charset=ISO-8859-1
file.encoding=Latin-1
Default Charset=UTF-8
Default Charset in Use=ISO8859_1
有人试图通过在代码中设置 file.encoding 来更改编码运行时.我们都知道那是行不通的.然而,这显然会抛出 defaultCharset() 但它不会影响 OutputStreamWriter 使用的实际默认字符集.
Someone tries to change the encoding runtime by setting the file.encoding in the code. We all know that doesn't work. However, this apparently throws off defaultCharset() but it doesn't affect the real default charset used by OutputStreamWriter.
这是错误还是功能?
接受的答案显示了问题的根本原因.基本上,您不能信任 Java 5 中的 defaultCharset(),它不是 I/O 类使用的默认编码.看起来 Java 6 纠正了这个问题.
The accepted answer shows the root cause of the issue. Basically, you can't trust defaultCharset() in Java 5, which is not the default encoding used by I/O classes. Looks like Java 6 corrects this issue.
推荐答案
这真的很奇怪...一旦设置,默认字符集就会被缓存,并且不会在类在内存中时更改.使用 System.setProperty("file.encoding", "Latin-1");
设置 "file.encoding"
属性没有任何作用.每次调用 Charset.defaultCharset()
时,它都会返回缓存的字符集.
This is really strange... Once set, the default Charset is cached and it isn't changed while the class is in memory. Setting the "file.encoding"
property with System.setProperty("file.encoding", "Latin-1");
does nothing. Every time Charset.defaultCharset()
is called it returns the cached charset.
这是我的结果:
Default Charset=ISO-8859-1
file.encoding=Latin-1
Default Charset=ISO-8859-1
Default Charset in Use=ISO8859_1
不过我使用的是 JVM 1.6.
I'm using JVM 1.6 though.
(更新)
好的.我确实用 JVM 1.5 重现了您的错误.
Ok. I did reproduce your bug with JVM 1.5.
查看 1.5 的源代码,没有设置缓存的默认字符集.我不知道这是否是一个错误,但 1.6 更改了此实现并使用了缓存字符集:
Looking at the source code of 1.5, the cached default charset isn't being set. I don't know if this is a bug or not but 1.6 changes this implementation and uses the cached charset:
JVM 1.5:
public static Charset defaultCharset() {
synchronized (Charset.class) {
if (defaultCharset == null) {
java.security.PrivilegedAction pa =
new GetPropertyAction("file.encoding");
String csn = (String) AccessController.doPrivileged(pa);
Charset cs = lookup(csn);
if (cs != null)
return cs;
return forName("UTF-8");
}
return defaultCharset;
}
}
JVM 1.6:
public static Charset defaultCharset() {
if (defaultCharset == null) {
synchronized (Charset.class) {
java.security.PrivilegedAction pa =
new GetPropertyAction("file.encoding");
String csn = (String) AccessController.doPrivileged(pa);
Charset cs = lookup(csn);
if (cs != null)
defaultCharset = cs;
else
defaultCharset = forName("UTF-8");
}
}
return defaultCharset;
}
当您在下次调用 Charset.defaultCharset()
时将文件编码设置为 file.encoding=Latin-1
时,会发生什么,因为缓存的默认值字符集未设置,它将尝试为名称 Latin-1
找到合适的字符集.找不到此名称,因为它不正确,并返回默认的 UTF-8
.
When you set the file encoding to file.encoding=Latin-1
the next time you call Charset.defaultCharset()
, what happens is, because the cached default charset isn't set, it will try to find the appropriate charset for the name Latin-1
. This name isn't found, because it's incorrect, and returns the default UTF-8
.
至于为什么OutputStreamWriter
等IO类返回一个意外的结果,sun.nio.cs.StreamEncoder
(这些 IO 类使用 witch)的实现对于 JVM 1.5 和 JVM 1.6 也是不同的.JVM 1.6 实现基于 Charset.defaultCharset()
方法来获取默认编码,如果没有提供给 IO 类.JVM 1.5 实现使用不同的方法 Converters.getDefaultEncodingName();
来获取默认字符集.此方法使用它自己的缓存,该缓存是在 JVM 初始化时设置的默认字符集:
As for why the IO classes such as OutputStreamWriter
return an unexpected result,
the implementation of sun.nio.cs.StreamEncoder
(witch is used by these IO classes) is different as well for JVM 1.5 and JVM 1.6. The JVM 1.6 implementation is based in the Charset.defaultCharset()
method to get the default encoding, if one is not provided to IO classes. The JVM 1.5 implementation uses a different method Converters.getDefaultEncodingName();
to get the default charset. This method uses its own cache of the default charset that is set upon JVM initialization:
JVM 1.6:
public static StreamEncoder forOutputStreamWriter(OutputStream out,
Object lock,
String charsetName)
throws UnsupportedEncodingException
{
String csn = charsetName;
if (csn == null)
csn = Charset.defaultCharset().name();
try {
if (Charset.isSupported(csn))
return new StreamEncoder(out, lock, Charset.forName(csn));
} catch (IllegalCharsetNameException x) { }
throw new UnsupportedEncodingException (csn);
}
JVM 1.5:
public static StreamEncoder forOutputStreamWriter(OutputStream out,
Object lock,
String charsetName)
throws UnsupportedEncodingException
{
String csn = charsetName;
if (csn == null)
csn = Converters.getDefaultEncodingName();
if (!Converters.isCached(Converters.CHAR_TO_BYTE, csn)) {
try {
if (Charset.isSupported(csn))
return new CharsetSE(out, lock, Charset.forName(csn));
} catch (IllegalCharsetNameException x) { }
}
return new ConverterSE(out, lock, csn);
}
但我同意这些评论.您不应依赖此属性.这是一个实现细节.
But I agree with the comments. You shouldn't rely on this property. It's an implementation detail.
这篇关于如何在 Java 中查找默认字符集/编码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!