java - 从具有代码页1252的FileItem中读取包括欧元符号的内容

我的问题的设置如下：

在包括Web服务通信的客户端/服务器体系结构中，我在服务器端从客户端获取CSV文件。该API给了我一个org.apache.commons.fileupload.FileItem

这些文件的允许代码页为代码页850和代码页1252。

一切正常，唯一的问题是欧元符号（€）。如果使用代码页1252，我的代码将无法正确处理欧元符号。取而代之的是，我在Eclipse中将其打印到控制台时看到带有Unicode U + 00A4的符号：¤。

目前，我使用以下代码。它分布在某些班级。我已经提取了相关的行。

byte[] inputData = call.getImportDatei().get();

// the following method works correctly
// it returns Charset.forName("CP850") or Charset.forName("CP1252")
final Charset charset = retrieveCharset(inputData);

char[] stringContents;
final StringBuffer sb = new StringBuffer();

final String s = new String(inputData, charset.name());

// here I see the problem with the euro sign already
// the following code shouldn't be the problem

// here some special characters are converted, but this doesn't affect the problem, so I removed those lines
stringContents = s.toCharArray();
for(final char c : stringContents){
  sb.append(c);
}
final Reader stringReader = new StringReader(sb.toString());


// org.supercsv.io.CsvListReader
CsvListReader reader = new CsvListReader(stringReader, CsvPreference.EXCEL_NORTH_EUROPE_PREFERENCE);
// now this reader is used to read the CSV content...

我尝试了不同的东西：

FileItem.getInputStream（）

我使用FileItem.getInputStream（）来获取byte []，但结果是相同的。

FileItem.getString（）

当我使用FileItem.getString（）时，它与代码页1252完美配合：正确读取了欧元符号。我将它打印到Eclipse中的控制台时看到。
但是使用代码页850，许多特殊字符是错误的。

FileItem.getString（字符串编码）

所以我的想法是使用FileItem.getString（String encoding）。但是，我尝试告诉他使用代码页1252的所有String都没有产生异常，但结果错误。

例如getString（Charset.forName（“ CP1252”）。name（））导致出现问号而不是欧元符号。

使用org.apache.commons.fileupload.FileItem时如何指定编码？

还是这是错误的方式？

谢谢您的帮助！

最佳答案

我将它打印到Eclipse中的控制台时看到。但是使用代码页850时，特殊字符可能是错误的。

您过于关注Eclipse控制台提供的结果而被误导了。基础数据是正确的，但是Eclipse错误地显示了它。在Windows上，默认情况下已将其配置为使用cp1252来显示System.out.println()打印的字符。这样，原来用其他字符集解码的字符显然不会正确显示。

您最好将Eclipse控制台重新配置为使用UTF-8呈现那些字符。 UTF-8涵盖了世界所知道的每个角色。您可以通过将窗口>首选项>常规>工作区>文本文件编码属性设置为UTF-8来实现。

然后，假设您显然是从Apache Commons FileUpload使用FileItem的，则可以通过以下更为简单的方式获得正确编码的FileItem内容。

byte[] content = fileItem.get();
Charset charset = retrieveCharset(content); // No idea what you're doing there, but kudos that it's returning the right charset.
Reader reader = new InputStreamReader(new ByteArrayInputStream(content), charset);
// ...

请注意，当您打算随后将此CSV写到Reader以外的基于字符的输出流（例如System.out.println()）时，请不要忘记将显式字符集也指定为UTF-8！您可以在FileWriter中执行此操作。否则，仍将使用平台默认编码，在Windows中为cp1252。

也可以看看：

Unicode - How to get the characters right?