本文介绍了如何正确确定文本文件的字符编码?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这里是我的情况:我需要正确确定哪个字符编码用于给定的文本文件。希望它可以正确返回以下类型之一:

Here is my situation: I need to correctly determine which character encoding is used for given text file. Hopefully, it can correctly return one of the following types:

enum CHARACTER_ENCODING
{
    ANSI,
    Unicode,
    Unicode_big_endian,
    UTF8_with_BOM,
    UTF8_without_BOM
};

到目前为止,我可以正确地告诉文本文件是通过调用以下函数调用code>, Unicode大尾数 UTF-8与BOM 如果给定的文本文件最初不是 UTF-8没有BOM ,它还可以正确地确定 ANSI 问题是,当文本文件 UTF-8没有BOM 时,以下函数会错误地将其视为 ANSI 档案

Up to now, I can correctly tell a text file is Unicode, Unicode big endian or UTF-8 with BOM by calling the following function. It also can correctly determine for ANSI if the given text file is not originally a UTF-8 without BOM. The problem is that when the text file is UTF-8 without BOM, the following function will mistakenly regard it as a ANSI file.

CHARACTER_ENCODING get_text_file_encoding(const char *filename)
{
    CHARACTER_ENCODING encoding;

    unsigned char uniTxt[] = {0xFF, 0xFE};// Unicode file header
    unsigned char endianTxt[] = {0xFE, 0xFF};// Unicode big endian file header
    unsigned char utf8Txt[] = {0xEF, 0xBB};// UTF_8 file header

    DWORD dwBytesRead = 0;
    HANDLE hFile = CreateFile(filename, GENERIC_READ, FILE_SHARE_READ, NULL, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL);
    if (hFile == INVALID_HANDLE_VALUE)
    {
        hFile = NULL;
        CloseHandle(hFile);
        throw runtime_error("cannot open file");
    }
    BYTE *lpHeader = new BYTE[2];
    ReadFile(hFile, lpHeader, 2, &dwBytesRead, NULL);
    CloseHandle(hFile);

    if (lpHeader[0] == uniTxt[0] && lpHeader[1] == uniTxt[1])// Unicode file
        encoding = CHARACTER_ENCODING::Unicode;
    else if (lpHeader[0] == endianTxt[0] && lpHeader[1] == endianTxt[1])//  Unicode big endian file
        encoding = CHARACTER_ENCODING::Unicode_big_endian;
    else if (lpHeader[0] == utf8Txt[0] && lpHeader[1] == utf8Txt[1])// UTF-8 file
        encoding = CHARACTER_ENCODING::UTF8_with_BOM;
    else
        encoding = CHARACTER_ENCODING::ANSI;   //Ascii

    delete []lpHeader;
    return encoding;
}

这个问题已经封锁了我很长时间,解。

This problem has blocked me for a long time and I still cannot find a good solution. Any hint will be appreciated.

推荐答案

对于初学者,没有像Unicode这样的物理编码。这可能是什么意思是UTF-16。其次,任何文件在ANSI中有效,或任何单字节编码。您唯一可以做的是以最好的顺序,最有可能丢弃无效匹配的

For starters, there's no such physical encoding as "Unicode". What you probably mean by this is UTF-16. Secondly, any file is valid in "ANSI", or any single-byte encoding for that matter. The only thing you can do is guess in the best order which is most likely to throw out invalid matches.

订单:


  • 开始时是否有UTF-16 BOM?那么它可能是UTF-16。

  • 开始时是否有一个UTF-8 BOM?是否有一个UTF-8 BOM?那么它可能是UTF-8。检查文件的其余部分。

  • 如果上述没有导致正匹配,请检查整个文件是否为有效的UTF-8。如果是,它可能是UTF-8。

  • 如果上述没有产生正匹配,则可能是ANSI。

  • Is there a UTF-16 BOM at the beginning? Then it's probably UTF-16. Use the BOM as indicator whether it's big endian or little endian, then check the rest of the file whether it conforms.
  • Is there a UTF-8 BOM at the beginning? Then it's probably UTF-8. Check the rest of the file.
  • If the above didn't result in a positive match, check if the entire file is valid UTF-8. If it is, it's probably UTF-8.
  • If the above didn't result in a positive match, it's probably ANSI.

如果您还希望没有 BOM的UTF-16文件(例如,可能在XML声明中指定编码的XML文件)不得不在那里推这个规则。虽然上面的任何一个可能产生一个假阳性,错误地识别一个ANSI文件为UTF- *(虽然不太可能)。您应该始终拥有元数据,告诉您文件是什么编码,在100%准确性无法检测之后检测。

If you expect UTF-16 files without BOM as well (it's possible for, for example, XML files which specify the encoding in the XML declaration), then you have to shove that rule in there as well. Though any of the above may produce a false positive, falsely identifying an ANSI file as UTF-* (though it's unlikely). You should always have metadata that tells you what encoding a file is in, detecting it after the fact is not possible with 100% accuracy.

这篇关于如何正确确定文本文件的字符编码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-21 07:28