问题描述
您可以在C ++ 11中使用 u8
/ u 前缀字符串文字来编写UTF-8/16/32字符串文字。 code> /
U
。编译器必须如何解释这些新类型的字符串文字中具有非ASCII字符的UTF-8文件?我知道标准没有指定文件编码,这个事实单独会使解释源代码中的非ASCII字符完全未定义的行为,使该功能只是一点点不太有用。
我知道你仍然可以使用 \uNNNN
来转义单个Unicode字符,但是对于一个完整的俄语或法语句子,通常包含多个unicode字符。
我从各种来源了解到, u
code> L 在当前Windows实现上, U
Linux实现。所以考虑到这一点,我也想知道什么所需的行为是旧的字符串文字修饰符...
对于代码样本monkeys:
string utf8string a = u8L'hôtelde ville doitêtrelà-bas。Çac'est un fait!
string utf16string b = uL'hôtelde ville doitêtrelà-bas。Çac'est un fait!;
string utf32string c = UL'hôtelde ville doitêtrelà-bas。Çac'est un fait!;
在理想情况下,所有这些字符串都会产生相同的内容(如:转换后的字符) ,但我的C ++经验告诉我,这是最明确的实现定义,可能只有第一个将做我想要的。
在GCC中,使用 -finput-charset = charset
:
还要查看选项 fexec-charset
和 -fwide-exec-charset
。
文字:
char a [] =Hello;
wchar_t b [] = LHello;
char16_t c [] = uHello;
char32_t d [] = UHello;
字符串文字的大小修饰符( L
, u
, U
)只决定文字的类型 >
You can write UTF-8/16/32 string literals in C++11 by prefixing the string literal with u8
/u
/U
respectively. How must the compiler interpret a UTF-8 file that has non-ASCII characters inside of these new types of string literals? I understand the standard does not specify file encodings, and that fact alone would make the interpretation of non-ASCII characters inside source code completely undefined behavior, making the feature just a tad less useful.
I understand you can still escape single unicode characters with \uNNNN
, but that is not very readable for, say, a full Russian, or French sentence, which typically contain more than one unicode character.
What I understand from various sources is that u
should become equivalent to L
on current Windows implementations and U
on e.g. Linux implementations. So with that in mind, I'm also wondering what the required behavior is for the old string literal modifiers...
For the code-sample monkeys:
string utf8string a = u8"L'hôtel de ville doit être là-bas. Ça c'est un fait!";
string utf16string b = u"L'hôtel de ville doit être là-bas. Ça c'est un fait!";
string utf32string c = U"L'hôtel de ville doit être là-bas. Ça c'est un fait!";
In an ideal world, all of these strings produce the same content (as in: characters after conversion), but my experience with C++ has taught me that this is most definitely implementation defined and probably only the first will do what I want.
In GCC, use -finput-charset=charset
:
Also check out the options -fexec-charset
and -fwide-exec-charset
.
Finally, about string literals:
char a[] = "Hello";
wchar_t b[] = L"Hello";
char16_t c[] = u"Hello";
char32_t d[] = U"Hello";
The size modifier of the string literal (L
, u
, U
) merely determines the type of the literal.
这篇关于文件编码如何影响C ++ 11字符串文字?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!