问题描述
我正在编写一个程序,该程序必须能够处理所有语言的文本.我的理解是UTF-8可以胜任,但是我遇到了一些问题.
I am writing a program that needs to be able to work with text in all languages. My understanding is that UTF-8 will do the job, but I am experiencing a few problems with it.
我是说UTF-8可以存储在C ++中的简单char
中吗?如果是这样,为什么在使用带有char
,string
和stringstream
的程序时出现以下警告:warning C4566: character represented by universal-character-name '\uFFFD' cannot be represented in the current code page (1252)
. (当我使用wchar_t
,wstring
和wstringstream
时,不会出现该错误.)
Am I right to say that UTF-8 can be stored in a simple char
in C++? If so, why do I get the following warning when I use a program with char
, string
and stringstream
: warning C4566: character represented by universal-character-name '\uFFFD' cannot be represented in the current code page (1252)
. (I do not get that error when I use wchar_t
, wstring
and wstringstream
.)
此外,我知道UTF是可变长度的.当我使用at
或substr
字符串方法时,我会得到错误的答案吗?
Additionally, I know that UTF is variable length. When I use the at
or substr
string methods would I get the wrong answer?
推荐答案
要使用UTF-8字符串文字,您需要在它们前面加上u8
前缀,否则您将获得实现的字符集(在您的情况下,似乎是Windows-1252):u8"\uFFFD"
是以NTF表示的替换字符(U + FFFD)的空终止字节序列.它的类型为char const[4]
.
To use UTF-8 string literals you need to prefix them with u8
, otherwise you get the implementation's character set (in your case, it seems to be Windows-1252): u8"\uFFFD"
is null-terminated sequence of bytes with the UTF-8 representation of the replacement character (U+FFFD). It has type char const[4]
.
由于UTF-8具有可变长度,因此各种索引将以代码单位而不是代码点进行索引.由于它是可变长度的,因此不可能对UTF-8序列中的代码点进行随机访问.如果要随机访问,则需要使用固定长度的编码,例如UTF-32.为此,您可以在字符串上使用U
前缀.
Since UTF-8 has variable length, all kinds of indexing will do indexing in code units, not codepoints. It is not possible to do random access on codepoints in an UTF-8 sequence because of it's variable length nature. If you want random access you need to use a fixed length encoding, like UTF-32. For that you can use the U
prefix on strings.
这篇关于C ++中的UTF-8兼容性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!