本文介绍了为什么Mac OS上的C运行时允许既分解又分解的UTF-8?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

因此,我们都知道Mac OS上的文件系统具有使用完全分解的UTF-8的古怪功能.例如,如果调用诸如 realpath()之类的POSIX API,则会从Mac OS中获得完全分解的UTF-8字符串.但是,当使用像 fopen()这样的API时,传递预先组合的UTF-8似乎也可以.

这是一个演示程序,试图打开一个名为ä的文件.第一次调用 fopen()传递了一个预先编写的UTF-8字符串,第二次调用传递了一个分解后的UTF-8字符串,令我惊讶的是,两者都有效.我希望只有第二个可以工作,但是预先组合的UTF-8也可以.

  #include< stdio.h>int main(int argc,char * argv []){文件* fp,* fp2;fp = fopen("\ xc3 \ xa4","rb");//ä作为预先组合的UTF-8fp2 = fopen("\ x61 \ xcc \ x88","rb");//ä分解为UTF-8printf("CHECK:%p%p \ n",fp,fp2);if(fp)fclose(fp);if(fp2)fclose(fp2);返回0;} 

现在我的问题是

  1. 这是定义的行为吗?即是否允许将预分解的UTF-8传递给POSIX API,还是我应该始终传递已分解的UTF-8?

  2. fopen() 之类的函数如何知道传递的文件是否包含预组合或分解的 UTF-8?这甚至不会导致各种各样的问题,例如打开错误的文件,因为可以用两种不同的方式解释所传递的字符串,从而可能指向两个不同的文件?这让我有些困惑.

编辑

为使混乱更完整,这种奇怪的行为甚至似乎并不局限于文件I/O.看一下这段代码:

  #include< stdio.h>int main(int argc,char * argv []){printf("\ xc3 \ xa4 \ n");printf("\ x61 \ xcc \ x88 \ n");返回0;} 

两个 printf 调用的功能完全相同,即它们都打印字符ä,第一个调用使用预组合的UTF-8,第二个调用使用分解的UTF-8.真的很奇怪.

解决方案

Unicode字符串有两种不同的等效类型:一件事是规范等效,另一件事是 compatibility .由于您的问题与软件似乎认为相同的字符串有关,因此我们将重点放在规范对等上(OTOH, compatibility 允许语义差异,因此在这个问题).

引用Wikipedia中 Unicode等价:

换句话说,如果两个字符串在规范上是等效的,则软件应考虑两个字符串表示完全相同的事物.因此,MacOS在这里做正确的事情:您有两个不同的UTF-8字符串(一个分解,另一个是预先分解),但是它们规范上是等价的,因此它们映射到相同的对象(相同的文件)您的示例中的名称).是正确的(请记住,"应在应用程序中以相同的方式处理,例如按字母顺序排列的名称或进行搜索,并且在上面的引用中可以用"替换").

我不太了解您关于 printf()的第二个示例.是的,分解字符和预分解字符都呈现相同的输出.这正是Unicode支持的字符的双重表示形式的要点:您可以选择是使用预先组合的字节序列还是分解的字节序列来表示组合字符.它们打印相同的视觉结果,但表示形式不同.如果两个表示形式都规范上等价(在某些情况下是不相同,在某些情况下不是),那么系统必须将它们视为同一对象的两个表示形式.

为了更轻松地在您的软件中管理所有这些,您应该规范化Unicode字符串与他们合作之前.

So we all know that the filesystem on Mac OS has this wacky feature of using fully decomposed UTF-8. If you call POSIX APIs like realpath(), for example, you'll get such a fully decomposed UTF-8 string back from Mac OS. When using APIs like fopen(), however, passing precomposed UTF-8 seems to work as well.

Here is a little demo program which attempts to open a file named ä. The first call to fopen() passes a precomposed UTF-8 string, the second call passes a decomposed UTF-8 string and to my surprise both work. I'd expect only the second one to work but precomposed UTF-8 works as well.

#include <stdio.h>

int main(int argc, char *argv[])
{
    FILE *fp, *fp2;

    fp = fopen("\xc3\xa4", "rb");       // ä as precomposed UTF-8
    fp2 = fopen("\x61\xcc\x88", "rb");  // ä as decomposed UTF-8

    printf("CHECK: %p %p\n", fp, fp2);

    if(fp) fclose(fp);
    if(fp2) fclose(fp2);

    return 0;
}

Now to my questions:

  1. Is this defined behaviour? i.e. is it allowed to pass precomposed UTF-8 to POSIX APIs or should I always pass decomposed UTF-8?

  2. How can functions like fopen() even know whether the file passed contains precomposed or decomposed UTF-8? Couldn't this even lead to all sorts of issues, e.g. wrong files being opened because the passed string can be interpreted in two different ways and thus potentially point to two different files? This is somewhat confusing me.

EDIT

To make the confusion complete, this weird behaviour doesn't even seem to be limited to file I/O. Take a look at this code:

#include <stdio.h>

int main(int argc, char *argv[])
{
    printf("\xc3\xa4\n");
    printf("\x61\xcc\x88\n");

    return 0;
}

Both printf calls do exactly the same, i.e. they both print the character ä, the first call using precomposed UTF-8 and the second one using decomposed UTF-8. It's really weird.

解决方案

There're two different types of equivalence in Unicode strings: One thing is canonical equivalence, and another is compatibility. Since your question is about strings that seem to be considered identical by the software, let's focus in canonical equivalence (OTOH, compatibility allows for semantic differences, so it's off-topic in this question).

Citing from Unicode equivalence in Wikipedia:

In other words, if two strings are canonically equivalent, the software should consider the two strings represent exactly the same thing. So, MacOS is doing the correct thing here: You have two different UTF-8 strings (one decomposed, another precomposed), but they are canonically equivalent, so they map to the same object (the same file name in your example). That's correct (remember the "should be treated in the same way by applications such as alphabetizing names or searching, and may be substituted for each other" line in the quote above).

I don't really understand your second example about printf(). Yes, both a decomposed character and a precomposed character render the same output. That's precisely the point in the dual representation of characters supported by Unicode: You can choose whether to represent a combined character with a precomposed sequence of bytes, or a decomposed sequence of bytes. They print the same visual result, but their representation is different. If both representations are canonically equivalent (in some cases they are, in some cases they are not), then the system must consider them as two representations of the same object.

In order to manage all of this more comfortably in your software, you should normalize your Unicode strings before working with them.

这篇关于为什么Mac OS上的C运行时允许既分解又分解的UTF-8?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-28 13:40