冲突：在C ++标准和Windows实现中wchar_t字符串的定义？

本文介绍了冲突：在C ++标准和Windows实现中wchar_t字符串的定义？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

C ++ 2003中的语句相当含糊。但是在C ++ 0x中，当计算字符串的长度时，宽字符串文字wchar_t将被视为与char32_t相同，而不同于char16_t。

post状态明确说明windows如何实现wchar_t在

总之，windows中的wchar_t是16位，使用UTF-16编码。标准中的语句显然在Windows中留下了冲突。

例如，

  wchar_t kk [] = \\ U000E0005;

这超过了16位，而对于UTF-16，需要两个16位来编码它（一个代理对）。

 
 
 但是，从标准，kk是一个2 wchar_t的数组（通用名为\U000E005的1，\0为1）。
 
 
 但是在内部存储中，Windows需要3个16位wchar_t对象来存储它，2个wchar_t用于代理对，1个wchar_t用于\0。因此，从数组的定义，kk是一个3 wchar_t的数组。
 
 
 这显然是相互冲突的。
 
 
 我认为Windows的一个最简单的解决方案是禁止任何需要代理对在wchar_t（禁止任何unicode外BMP）。 
 
 
 我的理解有什么问题吗？ 
 
 
 感谢。
解决方案

c> wchar_t 足够大以容纳支持的字符集中的任何字符。基于这个，我认为你的前提是正确的 - 这是错误的VC ++代表单个字符 \U000E0005 使用两个 wchar_t 单位。

BMP外的字符很少使用，Windows本身内部使用UTF-16编码，因此它简单方便（即使不正确） VC ++以这种方式表现。但是， wchar_t 的大小在将来会增加，而不是禁止这样的字符，而 char16_t

您链接的答案有点误导：

wchar_t 的大小完全取决于编译器，与操作系统无关。它只是发生，VC ++为 wchar_t 使用2个字节，但是再次，这可能会在未来很好地改变。

The statement in C++2003 is quite vague. But in C++0x, when counting the length of the string, the wide string literal wchar_t shall be treated as same as char32_t, and different from char16_t.

There's a post that states clearly how windows implements wchar_t in http://stackoverflow.com/questions/402283?tab=votes%23tab-top

In short, wchar_t in windows is 16bits and encoded using UTF-16. The statement in standard apparently leaves something conflicting in Windows.

for example,

wchar_t kk[] = L"\U000E0005";

This exceeds 16bits and for UTF-16 it needs two 16 bits to encode it (a surrogate pair).

However, from standard, kk is an array of 2 wchar_t (1 for the universal-name \U000E005, 1 for \0).

But in the internal storage, Windows need 3 16-bit wchar_t objects to store it, 2 wchar_t for the surrogate pair, and 1 wchar_t for the \0. Therefore, from array's definition, kk is an array of 3 wchar_t.

It's apparently conflicting to each other.

I think one simplest solution for Windows is to "ban" anything that requires surrogate pair in wchar_t ("ban" any unicode outside BMP).

Is there anything wrong with my understanding?

Thanks.

解决方案

The standard requires that wchar_t be large enough to hold any character in the supported character set. Based on this, I think your premise is correct -- it is wrong for VC++ to represent the single character \U000E0005 using two wchar_t units.

Characters outside the BMP are rarely used, and Windows itself internally uses UTF-16 encoding, so it is simply convenient (even if incorrect) for VC++ to behave this way. However, rather than "banning" such characters, it is likely that the size of wchar_t will increase in the future while char16_t takes its place in the Windows API.

The answer you linked to is somewhat misleading as well:

The size of wchar_t depends solely on the compiler and has nothing to do with the operating system. It just happens that VC++ uses 2 bytes for wchar_t, but once again, this could very well change in the future.

这篇关于冲突：在C ++标准和Windows实现中wchar_t字符串的定义？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！