c - 如何用C中的其他符号更改多字符符号？

我有一个 UTF-8 文本文件，其中包含几个我想由其他 更改的符号(仅限 |( 和 |) 之间的符号)，但问题是其中一些符号不被视为字符但作为多字符标志。 (我的意思是它们不能放在“∞”之间，而只能像“∞”这样，所以 char * ？)

这是我的文本文件:

Text : |(abc∞∪v=|)

例如 :

∞ 应由 ¤c 更改

∪ by ¸!

= 由 更改"

因此，由于某些符号(∞ 和 ∪)是多字符，我决定使用 fscanf 逐字获取所有文本。这种方法的问题是我必须在每个字符之间放置空格......我的文件应该是这样的:

Text : |( a b c ∞ ∪ v = |)

不能使用 fgetc，因为像 ∞ 这样的字符不能被视为一个单独的字符。如果我使用它，我将无法使用每个符号 (char * ) 对 char 进行 strcmp，我尝试将我的 char 转换为 char * 但 strcmp !=0。

这是我用 C 编写的代码，可帮助您理解我的问题:

#include <stdlib.h>
#include <stdio.h>
#include <string.h>

int main(void){
    char *carac[]={"∞","=","∪"}; //array with our signs
    FILE *flot,*flot3;
    flot=fopen("fichierdeTest2.txt","r"); // input text file
    flot3=fopen("resultat.txt","w"); //output file
    int i=0,j=0;
    char a[1024]; //array that will contain each read word.
    while(!feof(flot))
    {
        fscanf(flot,"%s",&a[i]);
        if (strstr(&a[i], "|(") != NULL){ // if the word read contains |(  then j=1
            j=1;
            fprintf(flot3,"|(");
        }
        if (strcmp(&a[i], "|)") == 0)
            j=0;
        if(j==1) { //it means we are between |( and |) so the conversion can begin
            if (strcmp(carac[0], &a[i]) == 0) { fprintf(flot3, "¤c"); }
            else if (strcmp(carac[1], &a[i]) == 0) { fprintf(flot3,"\"" ); }
            else if (strcmp(carac[2], &a[i]) == 0) { fprintf(flot3, " ¸!"); }
            else fprintf(flot3,"%s",&a[i]); // when it's a letter, number or sign that doesn't need to be converted
        }
        else { // when we are not between |( and |) just copy the word to the output file with a space after it
            fprintf(flot3, "%s", &a[i]);
            fprintf(flot3, " ");
        }
        i++;
    }
}

非常感谢 future 的帮助!

编辑: 如果我在每个符号之间放置一个空格，每个符号都会正确更改，但没有，它将无法工作，这就是我要解决的问题。

最佳答案

首先，正确使用术语。正确的术语有点令人困惑，但至少其他人会理解你在说什么。

在 C 中，char 与 byte 相同。但是，字符是抽象的东西，例如 ∞ 或 ¤ 或 c 。一个字符可能包含几个字节(即几个 char )。此类字符称为多字节字符。

将字符转换为字节序列(编码)并非易事。不同的系统做的不同；有些使用 UTF-8 ，而其他人可能使用 UTF-16 big-endian、UTF-16 little endian、8 位 codepage 或任何其他编码。

当您的 C 程序在引号中包含某些内容时，例如 "∞" - 它是一个 C 字符串，即以零字节结尾的几个字节。当您的代码使用 strcmp 比较字符串时，它会比较两个字符串的每个字节，以确保它们相等。因此，如果您的源代码和输入文件使用不同的编码，则字符串(字节序列)将不匹配，即使您在检查它们时会看到相同的字符!

因此，为了排除任何编码不匹配，您可能希望在源代码中使用字节序列而不是字符。例如，如果您知道您的输入文件使用 UTF-8 编码:

char *carac[]={
    "\xe2\x88\x9e", // ∞
    "=",
    "\xe2\x88\xaa"}; // ∪

或者，确保(源代码和程序输入文件的)编码相同。

另一个不太微妙的问题:当比较字符串时，您实际上有一个大字符串和一个小字符串，并且您想检查大字符串是否以小字符串开头。这里 strcmp 做错了!您必须在此处使用 strncmp :

if (strncmp(carac[0], &a[i], strlen(carac[0])) == 0)
{
    fprintf(flot3, "\xC2\xA4""c"); // ¤c
}

另一个问题(实际上是一个主要错误):fscanf 函数从输入文件中读取字 (由空格分隔的文本)。如果只检查该字的第一个字节，则不会处理其他字节。要修复，请对所有字节进行循环:

fscanf(flot,"%s",a);
for (i = 0; a[i] != '\0'; )
{
    if (strncmp(&a[i], "|(", 2)) // start pattern
    {
        now_replacing = 1;
        i += 2;
        continue;
    }
    if (now_replacing)
    {
        if (strncmp(&a[i], whatever, strlen(whatever)))
        {
            fprintf(...);
            i += strlen(whatever);
        }
    }
    else
    {
        fputc(a[i], output);
        i += 1; // processed just one char
    }
}

关于c - 如何用C中的其他符号更改多字符符号？，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/41265571/