c++ - 如何从C stdio.h getline()替换/忽略无效的Unicode/UTF8字符？

在Python上， errors='ignore' Python函数具有以下选项open:

open( '/filepath.txt', 'r', encoding='UTF-8', errors='ignore' )

这样，读取具有无效UTF8字符的文件将不会用任何内容替换它们，即它们将被忽略。例如，具有字符FÃ¸Ã¶»BÃ¥r的文件将被读取为FøöBår。

如果从FÃ¸Ã¶»BÃ¥r中读取了带有getline()的行作为stdio.h，它将被读取为Føö�Bår:

FILE* cfilestream = fopen( "/filepath.txt", "r" );
int linebuffersize = 131072;
char* readline = (char*) malloc( linebuffersize );

while( true )
{
    if( getline( &readline, &linebuffersize, cfilestream ) != -1 ) {
        std::cerr << "readline=" readline << std::endl;
    }
    else {
        break;
    }
}

如何使stdio.h getline()读取为FøöBår而不是Føö�Bår，即忽略无效的UTF8字符？

我能想到的一个压倒性的解决方案是在读取的每一行中的所有字符上进行迭代，并在不包含任何这些字符的情况下构建新的readline。例如:

FILE* cfilestream = fopen( "/filepath.txt", "r" );
int linebuffersize = 131072;
char* readline = (char*) malloc( linebuffersize );
char* fixedreadline = (char*) malloc( linebuffersize );

int index;
int charsread;
int invalidcharsoffset;

while( true )
{
    if( ( charsread = getline( &readline, &linebuffersize, cfilestream ) ) != -1 )
    {
        invalidcharsoffset = 0;
        for( index = 0; index < charsread; ++index )
        {
            if( readline[index] != '�' ) {
                fixedreadline[index-invalidcharsoffset] = readline[index];
            }
            else {
                ++invalidcharsoffset;
            }
        }
        std::cerr << "fixedreadline=" << fixedreadline << std::endl;
    }
    else {
        break;
    }
}

相关问题:

Fixing invalid UTF8 characters

Replacing non UTF8 characters

python replace unicode characters

Python unicode: how to replace character that cannot be decoded using utf8 with whitespace?

最佳答案

您将看到的与实际发生的事情混淆了。 getline函数不执行任何字符替换。 [注1]
您看到的是替换字符(U + FFFD)，因为当要求您呈现无效的UTF-8代码时，您的控制台会输出该字符。如果大多数控制台都处于UTF-8模式，则将执行此操作。也就是说，当前语言环境为UTF-8。
另外，说一个文件包含“characters FÃ¸Ã¶»BÃ¥r”是最不精确的。文件实际上不包含字符。它包含一些字节序列，根据某种编码，这些字节序列可以解释为字符(例如，通过控制台或其他用户呈现软件，将其呈现为字形)。不同的编码产生不同的结果。在这种特殊情况下，您有一个文件，该文件是由软件使用Windows-1252编码(或大致等同于ISO 8859-15)创建的，并且正在使用UTF-8在控制台上进行渲染。
这意味着getline读取的数据包含无效的UTF-8序列，但是(可能)不包含替换字符代码。根据您显示的字符串，它包含十六进制字符\xbb，它是Windows代码页1252中的guillemot(»)。
在getline(或任何其他读取文件的C库函数)读取的字符串中查找所有无效的UTF-8序列都需要扫描该字符串，但无需扫描特定的代码序列。相反，您需要一次解码一个UTF-8序列，以查找无效的序列。这不是一个简单的任务，但是 mbtowc 函数可以提供帮助(如果您启用了UTF-8语言环境)。正如您将在链接的联机帮助页中看到的那样，mbtowc返回有效的“多字节序列”(在UTF-8语言环境中为UTF-8)中包含的字节数，或-1表示无效或不完整的序列。在扫描中，您应该按有效顺序遍历字节，或者删除/忽略开始无效序列的单个字节，然后继续扫描直到到达字符串末尾。
这是一些经过严格测试的示例代码(用C语言编写):

#include <stdlib.h>
#include <string.h>

/* Removes in place any invalid UTF-8 sequences from at most 'len' characters of the
 * string pointed to by 's'. (If a NUL byte is encountered, conversion stops.)
 * If the length of the converted string is less than 'len', a NUL byte is
 * inserted.
 * Returns the length of the possibly modified string (with a maximum of 'len'),
 * not including the NUL terminator (if any).
 * Requires that a UTF-8 locale be active; since there is no way to test for
 * this condition, no attempt is made to do so. If the current locale is not UTF-8,
 * behaviour is undefined.
 */
size_t remove_bad_utf8(char* s, size_t len) {
  char* in = s;
  /* Skip over the initial correct sequence. Avoid relying on mbtowc returning
   * zero if n is 0, since Posix is not clear whether mbtowc returns 0 or -1.
   */
  int seqlen;
  while (len && (seqlen = mbtowc(NULL, in, len)) > 0) { len -= seqlen; in += seqlen; }
  char* out = in;

  if (len && seqlen < 0) {
    ++in;
    --len;
    /* If we find an invalid sequence, we need to start shifting correct sequences.  */
    for (; len; in += seqlen, len -= seqlen) {
      seqlen = mbtowc(NULL, in, len);
      if (seqlen > 0) {
        /* Shift the valid sequence (if one was found) */
        memmove(out, in, seqlen);
        out += seqlen;
      }
      else if (seqlen < 0) seqlen = 1;
      else /* (seqlen == 0) */ break;
    }
    *out++ = 0;
  }
  return out - s;
}

笔记

除了可能的基础I/O库的行尾转换之外，该转换将在Windows之类的系统上使用单个\n替换CR-LF，在Windows中，两个字符的CR-LF序列用作行尾指示。