本文介绍了使用C#检测文本文件的编码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一组markdown文件要传递给jekyll项目,需要找到它们的编码格式,即使用程序或API使用带BOM的UTF-8或不带BOM或ANSI的UTF-8.

I have a set of markdown files to be passed to jekyll project , need to find the encoding format of them i.e UTF-8 with BOM or UTF-8 without BOM or ANSI using a program or a API .

如果我通过文件的位置,则必须列出文件,读取并作为结果产生编码.

if i pass the location of the files , the files have to be listed,read and the encoding should be produced as result .

是否有任何代码或API?

Is there any Code or API for it ?

我已经按照有效方法中所述尝试为流读取器使用sr.CurrentEncoding来查找任何文件的Encoding,但是结果随notepad ++结果而有所不同.

i have already tried the sr.CurrentEncoding for stream reader as mentioned in Effective way to find any file's Encoding but the result varies with the result from a notepad++ result .

还尝试使用 https://github.com/errepi/ude (Mozilla通用字符集检测器),如"> https://social.msdn.microsoft.com/Forums/vstudio/zh-CN/862e3342-cc88-478f-bca2-e2de6f60d2fb/detect-encoding-of-the-file-of-the-file?通过在c#项目中实现ude.dll来= csharpgeneral ,但结果不如notepad ++那样有效,文件编码显示为utf-8,但是在程序中,结果是带有BOM的utf-8.

also tried to use https://github.com/errepi/ude ( Mozilla Universal Charset Detector) as suggested in https://social.msdn.microsoft.com/Forums/vstudio/en-US/862e3342-cc88-478f-bca2-e2de6f60d2fb/detect-encoding-of-the-file?forum=csharpgeneral by implementing the ude.dll in the c# project but the result is not effective as in notepad++ , the file encoding is shown as utf-8 , but from the program , the result is utf-8 with BOM.

但是我从两种方法都应该得到相同的结果,那么问题出在哪里?

but i should get same result from both ways , so where the problem has occurred?

推荐答案

检测编码始终是一项棘手的事情,但是检测BOM仍然非常简单.要将BOM作为字节数组获取,只需使用编码对象的 GetPreamble()函数.这样一来,您就可以通过前导码检测整个编码范围.

Detecting encoding is always a tricky business, but detecting BOMs is dead simple. To get the BOM as byte array, just use the GetPreamble() function of the encoding objects. This should allow you to detect a whole range of encodings by preamble.

现在,要检测没有前导码的UTF-8,实际上也不是很困难.请参见,UTF8 对于在有效序列中期望的值具有严格的按位规则,您可以初始化UTF8Encoding对象在这些序列为错误.

Now, as for detecting UTF-8 without preamble, actually that's not very hard either. See, UTF8 has strict bitwise rules about what values are expected in a valid sequence, and you can initialize a UTF8Encoding object in a way that will fail by throwing an exception when these sequences are incorrect.

因此,如果您先执行BOM表检查,然后进行严格的解码检查,最后又退回到Win-1252编码(您称其为"ANSI"),那么您的检测就完成了.

So if you first do the BOM check, and then the strict decoding check, and finally fall back to Win-1252 encoding (what you call "ANSI") then your detection is done.

Byte[] bytes = File.ReadAllBytes(filename);
Encoding encoding = null;
String text = null;
// Test UTF8 with BOM. This check can easily be copied and adapted
// to detect many other encodings that use BOMs.
UTF8Encoding encUtf8Bom = new UTF8Encoding(true, true);
Boolean couldBeUtf8 = true;
Byte[] preamble = encUtf8Bom.GetPreamble();
Int32 prLen = preamble.Length;
if (bytes.Length >= prLen && preamble.SequenceEqual(bytes.Take(prLen)))
{
    // UTF8 BOM found; use encUtf8Bom to decode.
    try
    {
        // Seems that despite being an encoding with preamble,
        // it doesn't actually skip said preamble when decoding...
        text = encUtf8Bom.GetString(bytes, prLen, bytes.Length - prLen);
        encoding = encUtf8Bom;
    }
    catch (ArgumentException)
    {
        // Confirmed as not UTF-8!
        couldBeUtf8 = false;
    }
}
// use boolean to skip this if it's already confirmed as incorrect UTF-8 decoding.
if (couldBeUtf8 && encoding == null)
{
    // test UTF-8 on strict encoding rules. Note that on pure ASCII this will
    // succeed as well, since valid ASCII is automatically valid UTF-8.
    UTF8Encoding encUtf8NoBom = new UTF8Encoding(false, true);
    try
    {
        text = encUtf8NoBom.GetString(bytes);
        encoding = encUtf8NoBom;
    }
    catch (ArgumentException)
    {
        // Confirmed as not UTF-8!
    }
}
// fall back to default ANSI encoding.
if (encoding == null)
{
    encoding = Encoding.GetEncoding(1252);
    text = encoding.GetString(bytes);
}

请注意,Windows-1252(美国/西欧ANSI)是每个字符一个字节的编码,这意味着其中的所有内容都会产生技术上有效的字符,因此,否则无法对其进行进一步的检测以将其与其他每个字符一个字节的编码区分开来.

Note that Windows-1252 (US / Western European ANSI) is a one-byte-per-character encoding, meaning everything in it produces a technically valid character, so unless you go for heuristic methods, no further detection can be done on it to distinguish it from other one-byte-per-character encodings.

这篇关于使用C#检测文本文件的编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-21 07:28