问题描述
C ++ 2003中的语句相当含糊。但是在C ++ 0x中,当计算字符串的长度时,宽字符串文字wchar_t将被视为与char32_t相同,而不同于char16_t。
post状态明确说明windows如何实现wchar_t在
总之,windows中的wchar_t是16位,使用UTF-16编码。标准中的语句显然在Windows中留下了冲突。
例如,
wchar_t kk [] = \\ U000E0005;
这超过了16位,而对于UTF-16,需要两个16位来编码它(一个代理对) 。
但是,从标准,kk是一个2 wchar_t的数组(通用名为\U000E005的1,\0为1)。
但是在内部存储中,Windows需要3个16位wchar_t对象来存储它,2个wchar_t用于代理对,1个wchar_t用于\0。因此,从数组的定义,kk是一个3 wchar_t的数组。
这显然是相互冲突的。
我认为Windows的一个最简单的解决方案是禁止任何需要代理对在wchar_t(禁止任何unicode外BMP)。
我的理解有什么问题吗?
感谢。
c> wchar_t 足够大以容纳支持的字符集中的任何字符。基于这个,我认为你的前提是正确的 - 这是错误的VC ++代表单个字符 \U000E0005
使用两个 wchar_t
单位。
BMP外的字符很少使用,Windows本身内部使用UTF-16编码,因此它简单方便(即使不正确) VC ++以这种方式表现。但是, wchar_t
的大小在将来会增加,而不是禁止这样的字符,而 char16_t
您链接的答案有点误导:
wchar_t
的大小完全取决于编译器,与操作系统无关。它只是发生,VC ++为 wchar_t
使用2个字节,但是再次,这可能会在未来很好地改变。
The statement in C++2003 is quite vague. But in C++0x, when counting the length of the string, the wide string literal wchar_t shall be treated as same as char32_t, and different from char16_t.
There's a post that states clearly how windows implements wchar_t in http://stackoverflow.com/questions/402283?tab=votes%23tab-top
In short, wchar_t in windows is 16bits and encoded using UTF-16. The statement in standard apparently leaves something conflicting in Windows.
for example,
wchar_t kk[] = L"\U000E0005";
This exceeds 16bits and for UTF-16 it needs two 16 bits to encode it (a surrogate pair).
However, from standard, kk is an array of 2 wchar_t (1 for the universal-name \U000E005, 1 for \0).
But in the internal storage, Windows need 3 16-bit wchar_t objects to store it, 2 wchar_t for the surrogate pair, and 1 wchar_t for the \0. Therefore, from array's definition, kk is an array of 3 wchar_t.
It's apparently conflicting to each other.
I think one simplest solution for Windows is to "ban" anything that requires surrogate pair in wchar_t ("ban" any unicode outside BMP).
Is there anything wrong with my understanding?
Thanks.
The standard requires that wchar_t
be large enough to hold any character in the supported character set. Based on this, I think your premise is correct -- it is wrong for VC++ to represent the single character \U000E0005
using two wchar_t
units.
Characters outside the BMP are rarely used, and Windows itself internally uses UTF-16 encoding, so it is simply convenient (even if incorrect) for VC++ to behave this way. However, rather than "banning" such characters, it is likely that the size of wchar_t
will increase in the future while char16_t
takes its place in the Windows API.
The answer you linked to is somewhat misleading as well:
The size of wchar_t
depends solely on the compiler and has nothing to do with the operating system. It just happens that VC++ uses 2 bytes for wchar_t
, but once again, this could very well change in the future.
这篇关于冲突:在C ++标准和Windows实现中wchar_t字符串的定义?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!