本文介绍了什么是ARGV的编码?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这不是很清楚,我所使用的地方用C的的argv 什么编码。特别是,我感兴趣的是以下情况:

It's not clear to me what encodings are used where in C's argv. In particular, I'm interested in the following scenario:


  • 的用户使用区域L1到创建一个名称, N ,包含非ASCII字符
  • 文件
  • 后来,用户使用区域L2在命令行上文件,该文件被送入程序P作为命令行参数
  • 的标签完成名称
  • A user uses locale L1 to create a file whose name, N, contains non-ASCII characters
  • Later on, a user uses locale L2 to tab-complete the name of that file on the command line, which is fed into a program P as a command line argument

确实P中的命令行上看到的序列字节?

What sequence of bytes does P see on the command line?

我观察到,在Linux上,开创了UTF-8编码,然后制表完成了它的文件名(如)中的 zw_TW.big5 区域似乎导致我的程序P要供给UTF-8,而不是中文。然而,在OS X上的同一系列的行动的结果在我的程序P得到一个中文连接codeD文件名。

I have observed that on Linux, creating a filename in the UTF-8 locale and then tab-completing it in (e.g.) the zw_TW.big5 locale seems to cause my program P to be fed UTF-8 rather than Big5. However, on OS X the same series of actions results in my program P getting a Big5 encoded filename.

下面是我认为是怎么回事,到目前为止(长,我可能是错的,需要加以纠正):

Here is what I think is going on so far (long, and I'm probably wrong and need to be corrected):

文件名存储在磁盘上的一些统一code格式。因此,Windows会的名称 N ,从L1(当前code页)的<$​​ C $ C> N 我们将称之为 N1 ,并将 N1 在磁盘上。

File names are stored on disk in some Unicode format. So Windows takes the name N, converts from L1 (the current code page) to a Unicode version of N we will call N1, and stores N1 on disk.

我那么的假设的情况是,当制表完成以后,这个名字 N1 转换为区域设置L2(新的当前$ C $ç页)显示。幸运的是,这将产生原始名称 N - 但是,如果 N 包含unre字符,这将不会是真的presentable在L2。我们把新的名称 N2

What I then assume happens is that when tab-completing later on, the name N1 is converted to locale L2 (the new current code page) for display. With luck, this will yield the original name N -- but this won't be true if N contained characters unrepresentable in L2. We call the new name N2.

当用户实际presses进入到与该参数运行P,名称 N2 再转换成统一code,产生 N1 一次。这 N1 现在可以在UCS2格式程序通过 GetCommandLineW / wmain / tmain ,但 GetCommandLine / 将看到当前区域的名称 N2 (code页)。

When the user actually presses enter to run P with that argument, the name N2 is converted back into Unicode, yielding N1 again. This N1 is now available to the program in UCS2 format via GetCommandLineW/wmain/tmain, but users of GetCommandLine/main will see the name N2 in the current locale (code page).

磁盘存储的故事是一样的,据我所知。 OS X存储文件的名称统一为code。

The disk-storage story is the same, as far as I know. OS X stores file names as Unicode.

通过一个统一code端,我的认为的会发生什么是终端建立一个统一code缓冲区中的命令行。所以当你完成标签,它复制文件名作为统一code文件名该缓冲区。

With a Unicode terminal, I think what happens is that the terminal builds the command line in a Unicode buffer. So when you tab complete, it copies the file name as a Unicode file name to that buffer.

当您运行命令,即统一code缓冲区通过的argv 转换为当前的语言环境,L2,并送入程序,该程序可以与当前的区域设置成单向德code的argv code进行显示。

When you run the command, that Unicode buffer is converted to the current locale, L2, and fed to the program via argv, and the program can decode argv with the current locale into Unicode for display.

在Linux中,一切都不一样了,我特困惑到底是怎么回事。 Linux的存储文件名作为的字节串的,而不是单向code。所以,如果你创建一个名为 N 在区域L1的文件 N 作为一个字节串是什么是存储在磁盘上

On Linux, everything is different and I'm extra-confused about what is going on. Linux stores file names as byte strings, not in Unicode. So if you create a file with name N in locale L1 that N as a byte string is what is stored on disk.

当我后来运行终端和尝试,并制表完整的名字,我不知道会发生什么。喜欢命令行做成一个字节的缓冲区,并且该文件的作为一个字节的字符串的名字只是串联到该缓冲区在我看来。我认为,当你键入一个标准的字符是带飞coded到字节被附加到缓冲区中。

When I later run the terminal and try and tab-complete the name, I'm not sure what happens. It looks to me like the command line is constructed as a byte buffer, and the name of the file as a byte string is just concatenated onto that buffer. I assume that when you type a standard character it is encoded on the fly to bytes that are appended to that buffer.

当你运行一个程序,我认为缓冲区直接发送到的argv 。现在,编码确实的argv 有哪些?它看起来像你在命令行中键入任何字符,而在区域L2将在L2编码,但的文件名会在L1编码。因此,的argv 包含两个编码的混合!

When you run a program, I think that buffer is sent directly to argv. Now, what encoding does argv have? It looks like any characters you typed in the command line while in locale L2 will be in the L2 encoding, but the file name will be in the L1 encoding. So argv contains a mixture of two encodings!

我真的喜欢它,如果有人可以让我知道是怎么回事。所有我目前所面对的是半猜测和炒作,它并没有真正适合在一起。我真的很想为真实是的argv 是连接在当前code页面(Windows)或当前区域codeD(Linux的/ OS X),但似乎并没有这样的情况...

I'd really like it if someone could let me know what is going on here. All I have at the moment is half-guesses and speculation, and it doesn't really fit together. What I'd really like to be true is for argv to be encoded in the current code page (Windows) or the current locale (Linux / OS X) but that doesn't seem to be the case...

下面是一个简单的候选程序P,可以让你观察编码自己:

Here is a simple candidate program P that lets you observe encodings for yourself:

#include <stdio.h>

int main(int argc, char **argv)
{
    if (argc < 2) {
        printf("Not enough arguments\n");
        return 1;
    }

    int len = 0;
    for (char *c = argv[1]; *c; c++, len++) {
        printf("%d ", (int)(*c));
    }

    printf("\nLength: %d\n", len);

    return 0;
}

您可以使用区域设置-a 来查看可用的语言环境,并使用出口LC_ALL = my_encoding 来改变你的语言环境

You can use locale -a to see available locales, and use export LC_ALL=my_encoding to change your locale.

推荐答案

谢谢大家对您的答复。我学到了很多关于这个问题,并已发现了下面的事情已经解决了我的问题:

Thanks everyone for your responses. I have learnt quite a lot about this issue and have discovered the following things that has resolved my question:


  1. 如前所述,在Windows上的argv是连接使用当前的code页面codeD。但是,您可以检索命令行使用GetCommandLineW UTF-16。不建议在现代的Windows的argv使用与UNI code支持应用服务,因为code页面去precated。

  1. As discussed, on Windows the argv is encoded using the current code page. However, you can retrieve the command line as UTF-16 using GetCommandLineW. Use of argv is not recommended for modern Windows apps with unicode support because code pages are deprecated.

在Unix系统中,argv的没有固定的编码:

On Unixes, the argv has no fixed encoding:

一)制表完成插入/通配符会发生argv中的文件名的逐字的作为完全字节序列由它们被命名为在磁盘上。这是真实的,即使这些字节序列使在当前区域设置没有意义的。

a) File names inserted by tab-completion/globbing will occur in argv verbatim as exactly the byte sequences by which they are named on disk. This is true even if those byte sequences make no sense in the current locale.

二)直接输入用自己的IME将出现​​在本地编码argv的用户输入。 (Ubuntu的似乎用区域设置决定如何连接code IME输入,而OS X使用Terminal.app编码preference。)

b) Input entered directly by the user using their IME will occur in argv in the locale encoding. (Ubuntu seems to use LOCALE to decide how to encode IME input, whereas OS X uses the Terminal.app encoding Preference.)

这是恼人语言如Python,哈斯克尔或Java,其中要正确对待的命令行参数为字符串。他们需要决定如何去code 的argv 成任何编码内部使用了字符串(这是UTF-16这些语言)。但是,如果他们仅仅使用了本地编码做这个解码,在输入有效的话文件名可能无法去code,引起异常。

This is annoying for languages such as Python, Haskell or Java, which want to treat command line arguments as strings. They need to decide how to decode argv into whatever encoding is used internally for a String (which is UTF-16 for those languages). However, if they just use the locale encoding to do this decoding, then valid filenames in the input may fail to decode, causing an exception.

要被Python 3采用了这种问题的解决方案是代理字节编码方案(),它重新presents任何不可解码字节argv中特殊的Uni code code点。当code点是德codeD回到一个字节流,它只是再次成为原始字节。这允许从argv的不在当前编码有效(即,在比当前区域以外的东西名为文件名)通过机Python串类型,并返回到无信息丢失字节

The solution to this problem adopted by Python 3 is a surrogate-byte encoding scheme (http://www.python.org/dev/peps/pep-0383/) which represents any undecodable byte in argv as special Unicode code points. When that code point is decoded back to a byte stream, it just becomes the original byte again. This allows for roundtripping data from argv that is not valid in the current encoding (i.e. a filename named in something other than the current locale) through the native Python string type and back to bytes with no loss of information.

正如你所看到的情况是pretty凌乱: - )

As you can see, the situation is pretty messy :-)

这篇关于什么是ARGV的编码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-23 01:56