本文介绍了使用 .NET 如何将包含 Latin-1 重音字符的 ISO 8859-1 编码文本文件转换为 UTF-8的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我收到了以 ISO 88591-1 格式保存的文本文件包含来自 Latin-1 范围的重音字符(以及正常的 ASCII az 等).如何使用 C# 将这些文件转换为 UTF-8 以便单字节重音ISO 8859-1 中的字符会变成有效的 UTF-8 字符吗?

I am being sent text files saved in ISO 88591-1 format that contain accented characters from the Latin-1 range (as well as normal ASCII a-z, etc.). How do I convert these files to UTF-8 using C# so that the single-byte accented characters in ISO 8859-1 become valid UTF-8 characters?

我尝试使用带有 ASCIIEncoding 的 StreamReader,然后通过实例化编码 ascii 和编码 utf8 然后使用 将 ASCII 字符串转换为 UTF-8Encoding.Convert(ascii, utf8, ascii.GetBytes( asciiString) ) —但重音字符被呈现为问号.

I have tried to use a StreamReader with ASCIIEncoding, and then converting the ASCII string to UTF-8 by instantiating encoding ascii and encoding utf8 and then using Encoding.Convert(ascii, utf8, ascii.GetBytes( asciiString) ) — but the accented characters are being rendered as question marks.

我遗漏了哪一步?

推荐答案

您需要获得正确的 Encoding 对象.ASCII 顾名思义:ASCII,意思是它只支持 7 位 ASCII 字符.如果您想做的是转换文件,那么这可能比直接处理字节数组更容易.

You need to get the proper Encoding object. ASCII is just as it's named: ASCII, meaning that it only supports 7-bit ASCII characters. If what you want to do is convert files, then this is likely easier than dealing with the byte arrays directly.

using (System.IO.StreamReader reader = new System.IO.StreamReader(fileName,
                                       Encoding.GetEncoding("iso-8859-1")))
{
    using (System.IO.StreamWriter writer = new System.IO.StreamWriter(
                                           outFileName, Encoding.UTF8))
    {
        writer.Write(reader.ReadToEnd());
    }
}

但是,如果您想自己拥有字节数组,使用 Encoding.Convert 很容易做到.

However, if you want to have the byte arrays yourself, it's easy enough to do with Encoding.Convert.

byte[] converted = Encoding.Convert(Encoding.GetEncoding("iso-8859-1"),
    Encoding.UTF8, data);

但是,重要的是要注意,如果您想沿着这条路走下去,那么您应该不要为您的文件使用基于编码的字符串阅读器,例如 StreamReaderIO.FileStream 会更合适,因为它会读取文件的实际字节.

It's important to note here, however, that if you want to go down this road then you should not use an encoding-based string reader like StreamReader for your file IO. FileStream would be better suited, as it will read the actual bytes of the files.

为了充分探索这个问题,这样的事情会起作用:

In the interest of fully exploring the issue, something like this would work:

using (System.IO.FileStream input = new System.IO.FileStream(fileName,
                                    System.IO.FileMode.Open,
                                    System.IO.FileAccess.Read))
{
    byte[] buffer = new byte[input.Length];

    int readLength = 0;

    while (readLength < buffer.Length)
        readLength += input.Read(buffer, readLength, buffer.Length - readLength);

    byte[] converted = Encoding.Convert(Encoding.GetEncoding("iso-8859-1"),
                       Encoding.UTF8, buffer);

    using (System.IO.FileStream output = new System.IO.FileStream(outFileName,
                                         System.IO.FileMode.Create,
                                         System.IO.FileAccess.Write))
    {
        output.Write(converted, 0, converted.Length);
    }
}

在这个例子中,buffer 变量被文件中的实际数据填充为 byte[],所以没有进行任何转换.Encoding.Convert 指定源和目标编码,然后将转换后的字节存储在名为...converted 的变量中.然后将其直接写入输出文件.

In this example, the buffer variable gets filled with the actual data in the file as a byte[], so no conversion is done. Encoding.Convert specifies a source and destination encoding, then stores the converted bytes in the variable named...converted. This is then written to the output file directly.

就像我说的,如果这就是你所做的一切,使用 StreamReaderStreamWriter 的第一个选项会简单得多,但后一个例子应该给你更多关于实际情况的提示.

Like I said, the first option using StreamReader and StreamWriter will be much simpler if this is all you're doing, but the latter example should give you more of a hint as to what's actually going on.

这篇关于使用 .NET 如何将包含 Latin-1 重音字符的 ISO 8859-1 编码文本文件转换为 UTF-8的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

07-31 09:28