我正在尝试从文件中读取和操作乌尔都语文本。但是,似乎字符不能完全读入wchar_t变量。这是我的代码,该代码读取文本并在新行中打印每个字符:

#include <stdio.h>
#include <wchar.h>
#include <locale.h>

void main(int argc, char* argv[]) {
    setlocale(LC_ALL, "");
    printf("This program tests Urdu reading:\n");
    wchar_t c;
    FILE *f = fopen("urdu.txt", "r");
    while ((c = fgetwc(f)) != WEOF) {
        wprintf(L"%lc\n", c);
    }
    fclose(f);
}


这是我的示例文本:

میرا نام ابراھیم ھے۔

میں وینڈربلٹ یونیورسٹی میں پڑھتا ھوں۔


但是,似乎打印的字符是文本中字母的两倍。我知道宽或多字节字符使用多个字节,但是我认为wchar_t类型会将与字母中的字母相对应的所有字节存储在一起。

如何阅读文本,以便在任何时候都能将整个字符存储在变量中?

有关我的环境的详细信息:
gcc:(x86_64-posix-seh-rev0,由MinGW-W64项目构建)5.3.0
操作系统:Windows 10 64 bit
文字档案编码:UTF-8

这是我的文本以十六进制格式显示的样子:

d9 85 db 8c d8 b1 d8 a7 20 d9 86 d8 a7 d9 85 20 d8 a7 d8 a8 d8 b1 d8 a7 da be db 8c d9 85 20 da be db 92 db 94 ad 98 5d b8 cd ab a2 0d 98 8d b8 cd 98 6d a8 8d 8b 1d 8a 8d 98 4d 9b 92 0d b8 cd 98 8d 98 6d b8 cd 98 8d 8b 1d 8b 3d 9b 9d b8 c2 0d 98 5d b8 cd ab a2 0d 9b ed a9 1d ab ed 8a ad 8a 72 0d ab ed 98 8d ab ad b9 4a

最佳答案

Windows对Unicode的支持大部分是专有的,不可能编写使用UTF-8并在Windows上使用Windows本机库的便携式软件。如果您愿意考虑非便携式解决方案,请使用以下解决方案:

#include <stdio.h>
#include <wchar.h>
#include <locale.h>
#include <fcntl.h>

void main(int argc, char* argv[]) {
    setlocale(LC_ALL, "");

    // Next line is needed to output wchar_t data to the console. Note that
    // Urdu characters are not supported by standard console fonts. You may
    // have to install appropriate fonts to see Urdu on the console.
    // Failing that, redirecting to a file and opening with a text editor
    // should show Urdu characters.

    _setmode(_fileno(stdout), _O_U16TEXT);

    // Mixing wide-character and narrow-character output to stdout is not
    // a good idea. Using wprintf throughout. (Not Windows-specific)

    wprintf(L"This program tests UTF-8 reading:\n");

    // WEOF is not guaranteed to fit into wchar_t. It is necessary
    // to use wint_t to keep a result of fgetwc, or to print with
    // %lc. (Not Windows-specific)

    wint_t c;

    // Next line has a non-standard parameter passed to fopen, ccs=...
    // This is a Windows way to support different file encodings.
    // There are no UTF-8 locales in Windows.

    FILE *f = fopen("urdu.txt", "r,ccs=UTF-8");

    while ((c = fgetwc(f)) != WEOF) {
        wprintf(L"%lc", c);
    }
    fclose(f);
}


使用glibc的OTOH(例如使用cygwin)不需要这些Windows扩展,因为glibc在内部处理这些事情。

10-06 09:12