多字节到宽字符串转换函数“mbstowcs”在传递字符串文字时是否使用源文件的编码？

本文介绍了多字节到宽字符串转换函数“mbstowcs”在传递字符串文字时是否使用源文件的编码？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述 29岁程序员，3月因学历无情被辞！ ADDENDUM 我自己的暂定答案出现在问题的底部。根据建议在utf8everywhere.org ，我将一个古老的VC6 C ++ / MFC项目转换为VS2013和Unicode。 / p> 一路上，我一直在学习Unicode，UTF-16，UCS-2，UTF-8，标准库和STL支持的Unicode& UTF-8（或者说，标准库缺乏支持）， ICU ， Boost.Locale ，当然还有Windows SDK和MFC的API，需要UTF-16 wchar 。正如我一直在研究上述问题一样，一个问题仍然重现，我无法以澄清的方式回答我的满意。考虑C库函数 mbstowcs 。此函数具有以下签名： size_t mbstowcs（wchar_t * dest，const char * src，size_t max）; 第二个参数 src 是（根据文档）a 要解释的多字节字符的C字符串。多字节序列将从初始移位状态开始。我的问题是关于这个多字节字符串即可。我的理解是，多字节字符串的编码可能不同于字符串到字符串，并且编码不是由标准指定。 MSVC文档中似乎没有指定特定的编码这个功能。我现在的理解是，在Windows上，这个多字节字符串预计将使用活动语言环境的ANSI代码页进行编码。但是我的清晰度在这一点开始淡出。我一直在想，源代码文件本身的编码是否会至少在Windows上的 mbstowcs 行为的差异。而且，我也对在编译时 vs 。假设你有一个字符串文字传递给 code> mbstowcs ，如下所示： wchar_t dest [1024]; mbstowcs（dest，Hello，world！，1024）; 假设这个代码在Windows机器上编译为。假设源代码文件本身的代码页与编译器运行的机器上当前语言环境的代码页不同。编译器会考虑源代码文件的编码吗？由于源代码文件的代码页与编译器运行的活动语言环境的代码页不同，所产生的二进制文件是否会实现另一方面，也许我错了 - 也许运行时机器的活动区域设置确定了字符串文字的期望的代码页。因此，保存源代码文件的代码页需要匹配程序最终运行的计算机的代码页？对我来说似乎是如此的震惊，我觉得很难相信会是这样。但是正如你所看到的，我的清晰度在这里是缺乏的。另一方面，如果我们将调用更改为 mbstowcs 显式传递UTF-8字符串： wchar_t dest [1024]; mbstowcs（dest，u8Hello，world！，1024）; ...我假设 mbstowcs 将始终做正确的事情 - 不管源文件的代码页，编译器的当前语言环境，或运行代码的计算机的当前语言环境。我是否正确的说明了这一点？我很清楚这些事情，特别是关于上面提到的具体问题。如果我的任何或所有的问题都是不正确的，那么我也会很高兴知道这一点。 ADDENDUM 从@ TheUndeadFish的答案下面的冗长评论以及这个非常相似的主题的问题的答案，我相信我有一个我想提出的问题的初步答案。让我们按照原始字节，以查看实际字节如何通过整个编译过程转换为运行时行为： C ++标准表面上要求任何源代码文件中的所有字符都是（特定）96个字符的ASCII字符，称为基本源字符集。（但请参见以下项目符号。）在源代码文件中这些96个字符的实际字节级编码方面，标准没有指定任何特定的编码，但是所有96个字符都是ASCII字符，所以在实践中，源文件的编码从来没有问题，因为现有的所有编码都代表这些96个ASCII字符，使用相同的原始字节。但是，字符文字和代码注释可能通常包含这些基本的96以外的字符。编译器通常支持这一点（尽管这不是C ++标准）。源代码的字符集称为源字符集。但是编译器需要在其内部字符集（称为执行字符集）中使用相同的字符，否则那些缺少的字符将被其他（虚拟）编译器实际处理源代码之前的字符（如一个正方形或一个问号） - 参见下面的讨论。编译器如何确定编码用于对源代码文件的字符进行编码（在基本源字符集之外的字符出现时）是实现定义的。请注意，编译器可以使用不同的字符集（编码，但它喜欢）为其内部执行字符集比由源代码文件的编码表示的字符集！这意味着即使编译器知道源代码文件的编码（这意味着编译器也知道ab输出源代码字符集中的所有字符），编译器可能仍然被迫将源代码的字符集中的某些字符转换为执行字符集（从而丢失信息）。标准规定这是可以接受的，但编译器不得将源字符集中的任何字符转换为执行字符集中的NULL字符。 C ++标准中没有关于用于执行的编码字符集，就像在执行字符集中需要支持的字符一样，除了基本执行字符集中的字符，其中包含基本源字符集中的所有字符，加上少数其他的例如 NULL 字符和退格字符）。这不是真的在任何地方非常清楚地记录，即使是微软，MSVC中如何处理这个过程。即，编译器如何知道源代码文件i的编码和对应的字符集s，和/或执行字符集的选择是和/或将用于执行字符集的编码， / code>在源代码文件的编译期间。看起来在MSVC的情况下，编译器会在尝试中做出最好的猜测为任何给定的源代码文件选择一个编码（和相应的字符集），落在编译器运行的机器的当前语言环境的默认代码页上。或者您可以采取特殊步骤，使用编辑器将源代码文件保存为Unicode，该编辑器将在每个源代码文件的开头提供正确的字节顺序标记（BOM）。这包括UTF-8，BOM通常是可选的或排除的 - 在MSVC编译器读取源代码文件的情况下，您必须包含UTF-8 BOM。根据执行字符集及其对MSVC的编码，继续下一个项目符号。编译器继续读取源文件，并将源代码文件的字符的原始字节从源字符集进入执行字符集中的相应字符的（潜在不同的）编码（如果给定的字符存在于两者中，那将是相同的字符字符集）忽略代码注释和字符文字，所有这些字符通常位于基本执行字符集中如上所述。这是ASCII字符集的一个子集，所以编码问题是无关紧要的（实际上，所有这些字符在所有编译器上都是相同的编码）。 strong>代码注释和字符文字，但代码注释被丢弃，如果字符文字只包含基本源字符集中的字符，则不会问题 - 这些字符将属于基本执行字符集，仍然是ASCII。但是如果字符文字在源代码中包含基本源字符集之外的字符，则如上所述，这些字符转换为执行字符集（可能有一些损失）。但是如上所述，这个字符集的编码也不是由C ++标准定义的。同样，MSVC文档似乎对这个编码和字符集将是非常弱的。也许这是在编译器运行的机器上的活动语言环境指示的默认ANSI编码？也许是UTF-16？无论如何，将被烧录到可执行文件中的字符串字面量的原始字节与编译器的编码完全对应的执行字符集中的字符。在运行时现在是时候C运行时库来解释传递给 mbstowcs 的字节。因为没有提供语言环境通过调用 mbstowcs ，C运行时不知道接收这些字节时使用什么编码 - 这可以说是这个链中最弱的链接。 / p> C ++（或C）标准没有记录什么编码应用于读取传递给的字节mbstowcs 。我不知道标准是否声明输入到 mbstowcs 预计在同一个执行字符集作为编译器的执行字符集中的字符，或者如果编译器的编码与C运行时实现 mbstowcs 。但我的初步猜测是，在MSVC C运行时，显然是当前运行线程将用于确定运行时执行字符集，代表此字符集的编码，将用于解释传递的字节到 mbstowcs 。这意味着这些字节将被非常容易地被误解为不同的字符比编码在源代码文件 - 非常丑，就我而言。如果我是正确的所有这一切，那么如果你想力 C运行时使用特定的编码，你应该像HarryJohnston的评论指出的那样调用Window SDK的 MultiByteToWideChar ，因为你可以将所需的编码传递给该函数。 p> 由于上述混乱，在源代码文件中真的没有一种自动方式来处理字符文字。因此，作为 https://stackoverflow.com/a/1866668/368896提到，如果有机会在字符文字中使用非ASCII字符，则应使用资源（例如 GetText '该方法也可以通过Windows上的 xgettext .exe附带的 Boost.Locale href =http://poedit.net/ =nofollow noreferrer> Poedit ），在您的源代码中，只需编写函数即可将资源加载为原始（未更改）字节。确保保存您的资源ce文件作为UTF-8，然后确保在运行时调用明确支持UTF-8的功能，使它们的 char * 和 std：：string ，例如（来自 utf8everywhere.org 的建议），使用 Boost.Nowide （不是真的在Boost中，我认为）在最后可能的时刻将UTF-8转换为 wchar_t 之前调用任何Windows API函数，将文本写入对话框等（并使用这些Windows API函数的 W 形式）。对于控制台输出，您必须调用 SetConsoleOutputCP 类型函数，例如也被描述在 https://stackoverflow.com/a/1866668/368896 。感谢那些谁花了时间阅读这里冗长的提议答案。解决方案源代码文件的编码不会影响行为的 mbstowcs 。毕竟，函数的内部实现不知道什么源代码可能调用它。在链接的MSDN文档中： mbstowcs使用当前语言环境进行任何与语言环境相关的行为; _mbstowcs_l是相同的，除了它使用传递的区域设置。有关详细信息，请参阅区域设置。关于语言环境的链接页面引用 setlocale ，这可以如何影响 mbstowcs 的行为。看看你提出的传递UTF-8的方式： mbstowcs（dest，u8Hello，world！，1024） ; 不幸的是，只要我使用有趣的数据，就不会正常工作。如果它甚至编译，它只会做，因为编译器必须处理 u8 与 char * 。而至于 mbstowcs ，它会相信该字符串是根据所设置的区域设置编码的。更不幸的是，我不相信（在Windows / Visual Studio平台上）设置一个区域设置，以便使用UTF-8。所以这样会发生在ASCII字符（前128个字符）中，因为它们在各种ANSI编码以及UTF-8中恰好具有完全相同的二进制值。如果您尝试使用任何字符（例如任何具有口音或变音符号的字符），那么您将看到问题。 mbstowcs 等等都是相当有限和笨重的。我发现了Window的API函数 MultiByteToWideChar 通常更有效。特别是可以通过传递代码页参数 CP_UTF8 轻松处理UTF-8。 ADDENDUM A tentative answer of my own appears at the bottom of the question.I am converting an archaic VC6 C++/MFC project to VS2013 and Unicode, based on the recommendations at utf8everywhere.org.Along the way, I have been studying Unicode, UTF-16, UCS-2, UTF-8, the standard library and STL support of Unicode & UTF-8 (or, rather, the standard library's lack of support), ICU, Boost.Locale, and of course the Windows SDK and MFC's API that requires UTF-16 wchar's.As I have been studying the above issues, a question continues to recur that I have not been able to answer to my satisfaction in a clarified way.Consider the C library function mbstowcs. This function has the following signature:size_t mbstowcs (wchar_t* dest, const char* src, size_t max);The second parameter src is (according to the documentation) a C-string with the multibyte characters to be interpreted. The multibyte sequence shall begin in the initial shift state.My question is in regards to this multibyte string. It is my understanding that the encoding of a multibyte string can differ from string to string, and the encoding is not specified by the standard. Nor does a particular encoding seem to be specified by the MSVC documentation for this function.My understanding at this point is that on Windows, this multibyte string is expected to be encoded with the ANSI code page of the active locale. But my clarity begins to fade at this point.I have been wondering whether the encoding of the source code file itself makes a difference in the behavior of mbstowcs, at least on Windows. And, I'm also confused about what happens at compile time vs. what happens at run time for the code snippet above.Suppose you have a string literal passed to mbstowcs, like this:wchar_t dest[1024];mbstowcs (dest, "Hello, world!", 1024);Suppose this code is compiled on a Windows machine. Suppose that the code page of the source code file itself is different than the code page of the current locale on the machine on which the compiler runs. Will the compiler take into consideration the source code file's encoding? Will the resulting binary be effected by the fact that the code page of the source code file is different than the code page of the active locale on which the compiler runs?On the other hand, maybe I have it wrong - maybe the active locale of the runtime machine determines the code page that is expected of the string literal. Therefore, does the code page with which the source code file is saved need to match the code page of the computer on which the program ultimately runs? That seems so whacked to me that I find it hard to believe this would be the case. But as you can see, my clarity is lacking here.On the other hand, if we change the call to mbstowcs to explicitly pass a UTF-8 string:wchar_t dest[1024];mbstowcs (dest, u8"Hello, world!", 1024);... I assume that mbstowcs will always do the right thing - regardless of the code page of the source file, the current locale of the compiler, or the current locale of the computer on which the code runs. Am I correct about this?I would appreciate clarity on these matters, in particular in regards to the specific questions I have raised above. If any or all of my questions are ill-formed, I would appreciate knowing that, as well.ADDENDUM From the lengthy comments beneath @TheUndeadFish's answer, and from the answer to a question on a very similar topic here, I believe I have a tentative answer to my own question that I'd like to propose.Let's follow the raw bytes of the source code file to see how the actual bytes are transformed through the entire process of compilation to runtime behavior:The C++ standard 'ostensibly' requires that all characters in any source code file be a (particular) 96-character subset of ASCII called the basic source character set. (But see following bullet points.)In terms of the actual byte-level encoding of these 96 characters in the source code file, the standard does not specify any particular encoding, but all 96 characters are ASCII characters, so in practice, there is never a question about what encoding the source file is in, because all encodings in existence represent these 96 ASCII characters using the same raw bytes.However, character literals and code comments might commonly contain characters outside these basic 96.This is typically supported by the compiler (even though this isn't required by the C++ standard). The source code's character set is called the source character set. But the compiler needs to have these same characters available in its internal character set (called the execution character set), or else those missing characters will be replaced by some other (dummy) character (such as a square or a question mark) prior to the compiler actually processing the source code - see the discussion that follows.How the compiler determines the encoding that is used to encode the characters of the source code file (when characters appear that are outside the basic source character set) is implementation-defined.Note that it is possible for the compiler to use a different character set (encoded however it likes) for its internal execution character set than the character set represented by the encoding of the source code file!This means that even if the compiler knows about the encoding of the source code file (which implies that the compiler also knows about all the characters in the source code's character set), the compiler might still be forced to convert some characters in the source code's character set to different characters in the execution character set (thereby losing information). The standard states that this is acceptable, but that the compiler must not convert any characters in the source character set to the NULL character in the execution character set.Nothing is said by the C++ standard about the encoding used for the execution character set, just as nothing is said about the characters that are required to be supported in the execution character set (other than the characters in the basic execution character set, which include all characters in the basic source character set plus a handful of additional ones such as the NULL character and the backspace character).It is not really seemingly documented anywhere very clearly, even by Microsoft, how any of this process is handled in MSVC. I.e., how the compiler figures out what the encoding and corresponding character set of the source code file is, and/or what the choice of execution character set is, and/or what the encoding is that will be used for the execution character set during compilation of the source code file.It seems that in the case of MSVC, the compiler will make a best-guess effort in its attempt to select an encoding (and corresponding character set) for any given source code file, falling back on the current locale's default code page of the machine the compiler is running on. Or you can take special steps to save the source code files as Unicode using an editor that will provide the proper byte-order mark (BOM) at the beginning of each source code file. This includes UTF-8, for which the BOM is typically optional or excluded - in the case of source code files read by the MSVC compiler, you must include the UTF-8 BOM.And in terms of the execution character set and its encoding for MSVC, continue on with the next bullet point.The compiler proceeds to read the source file and converts the raw bytes of the characters of the source code file from the encoding for the source character set into the (potentially different) encoding of the corresponding character in the execution character set (which will be the same character, if the given character is present in both character sets).Ignoring code comments and character literals, all such characters are typically in the basic execution character set noted above. This is a subset of the ASCII character set, so encoding issues are irrelevant (all of these characters are, in practice, encoded identically on all compilers).Regarding the code comments and character literals, though: the code comments are discarded, and if the character literals contain only characters in the basic source character set, then no problem - these characters will belong in the basic execution character set and still be ASCII.But if the character literals in the source code contain characters outside of the basic source character set, then these characters are, as noted above, converted to the execution character set (possibly with some loss). But as noted, neither the characters, nor the encoding for this character set is defined by the C++ standard. Again, the MSVC documentation seems to be very weak on what this encoding and character set will be. Perhaps it is the default ANSI encoding indicated by the active locale on the machine on which the compiler runs? Perhaps it is UTF-16?In any case, the raw bytes that will be burned into the executable for the character string literal correspond exactly to the compiler's encoding of the characters in the execution character set.At runtime, mbstowcs is called and it is passed the bytes from the previous bullet point, unchanged.It is now time for the C runtime library to interpret the bytes that are passed to mbstowcs.Because no locale is provided with the call to mbstowcs, the C runtime has no idea what encoding to use when it receives these bytes - this is arguably the weakest link in this chain.It is not documented by the C++ (or C) standard what encoding should be used to read the bytes passed to mbstowcs. I am not sure if the standard states that the input to mbstowcs is expected to be in the same execution character set as the characters in the execution character set of the compiler, OR if the encoding is expected to be the same for the compiler as for the C runtime implementation of mbstowcs.But my tentative guess is that in the MSVC C runtime, apparently the locale of the current running thread will be used to determine both the runtime execution character set, and the encoding representing this character set, that will be used to interpret the bytes passed to mbstowcs.This means that it will be very easy for these bytes to be mis-interpreted as different characters than were encoded in the source code file - very ugly, as far as I'm concerned.If I'm right about all this, then if you want to force the C runtime to use a particular encoding, you should call the Window SDK's MultiByteToWideChar, as @HarryJohnston's comment indicates, because you can pass the desired encoding to that function.Due to the above mess, there really isn't an automatic way to deal with character literals in source code files.Therefore, as https://stackoverflow.com/a/1866668/368896 mentions, if there's a chance you'll have non-ASCII characters in your character literals, you should use resources (such as GetText's method, which also works via Boost.Locale on Windows in conjunction with the xgettext .exe that ships with Poedit), and in your source code, simply write functions to load the resources as raw (unchanged) bytes.Make sure to save your resource files as UTF-8, and then make sure to call functions at runtime that explicitly support UTF-8 for their char *'s and std::string's, such as (from the recommendations at utf8everywhere.org) using Boost.Nowide (not really in Boost yet, I think) to convert from UTF-8 to wchar_t at the last possible moment prior to calling any Windows API functions that write text to dialog boxes, etc. (and using the W forms of these Windows API functions). For console output, you must call the SetConsoleOutputCP-type functions, such as is also described at https://stackoverflow.com/a/1866668/368896.Thanks to those who took the time to read the lengthy proposed answer here. 解决方案 The encoding of the source code file doesn't affect the behavior of mbstowcs. After all, the internal implementation of the function is unaware of what source code might be calling it.On the MSDN documentation you linked is: mbstowcs uses the current locale for any locale-dependent behavior; _mbstowcs_l is identical except that it uses the locale passed in instead. For more information, see Locale.That linked page about locales then references setlocale which is how the behavior of mbstowcs can be affected.Now, taking a look at your proposed way of passing UTF-8:mbstowcs (dest, u8"Hello, world!", 1024);Unfortunately, that isn't going to work properly as far as I know once you use interesting data. If it even compiles, it only does do because the compiler would have to be treating u8 the same as a char*. And as far as mbstowcs is concerned, it will believe the string is encoded under whatever the locale is set for.Even more unfortunately, I don't believe there's any way (on the Windows / Visual Studio platform) to set a locale such that UTF-8 would be used.So that would happen to work for ASCII characters (the first 128 characters) only because they happen to have the exact same binary values in various ANSI encodings as well as UTF-8. If you try with any characters beyond that (for instance anything with an accent or umlaut) then you'll see problems.Personally, I think mbstowcs and such are rather limited and clunky. I've found the Window's API function MultiByteToWideChar to be more effective in general. In particular it can easily handle UTF-8 just by passing CP_UTF8 for the code page parameter. 这篇关于多字节到宽字符串转换函数“mbstowcs”在传递字符串文字时是否使用源文件的编码？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！上岸，阿里云！