问题描述
我来自python,您可以在其中使用'string [10]'来顺序访问字符.如果字符串是用Unicode编码的,它将给我带来预期的结果.但是,当我在C ++中对字符串使用索引时,只要字符是ASCII即可工作,但是当我在字符串中使用Unicode字符并使用索引时,在输出中我将得到类似/201的八进制表示形式.例如:
I come from python where you can use 'string[10]' to access a character in sequence. And if the string is encoded in Unicode it will give me expected results. However when I use indexing on a string in C++, as long the characters are ASCII it works, but when I use a Unicode character inside the string and use indexing, in the output I'll get an octal representation like /201.For example:
string ramp = "ÐðŁłŠšÝýÞþŽž";
cout << ramp << "\n";
cout << ramp[5] << "\n";
输出:
ÐðŁłŠšÝýÞþŽž
/201
为什么会发生这种情况,如何在字符串表示形式中访问该字符,或者如何将八进制表示形式转换为实际字符?
Why this is happening and how can I access that character in the string representation or how can I convert the octal representation to the actual character?
推荐答案
标准C ++不能正确处理Unicode,给您带来类似于您所观察到的问题.
Standard C++ is not equipped for proper handling of Unicode, giving you problems like the one you observed.
这里的问题是C ++ 早于 Unicode.这意味着即使您的字符串文字也将以实现定义的方式进行解释,因为这些字符未在基本源字符"集中定义(基本上是ASCII-7字符减去 @
, $
和反引号).
The problem here is that C++ predates Unicode by a comfortable margin. This means that even that string literal of yours will be interpreted in an implementation-defined manner because those characters are not defined in the Basic Source Character set (which is, basically, the ASCII-7 characters minus @
, $
, and the backtick).
C ++ 98完全没有提到Unicode.它提到 wchar_t
和 wstring
基于它,并指定 wchar_t
能够表示当前语言环境中的任何字符".但是那造成的伤害大于好处...
C++98 does not mention Unicode at all. It mentions wchar_t
, and wstring
being based on it, specifying wchar_t
as being capable of "representing any character in the current locale". But that did more damage than good...
Microsoft将 wchar_t
定义为16位,这足以满足当时 的Unicode代码点.但是,此后Unicode扩展到了16位范围之外... Windows的16位 wchar_t
不再宽"了,因为您需要其中两个来表示 BMP ,并且Microsoft文档对于 wchar_t
表示UTF-16(带有代理对的多字节编码)或UCS-2(宽编码,不支持BMP以外的字符).
Microsoft defined wchar_t
as 16 bit, which was enough for the Unicode code points at that time. However, since then Unicode has been extended beyond the 16-bit range... and Windows' 16-bit wchar_t
is not "wide" anymore, because you need two of them to represent characters beyond the BMP -- and the Microsoft docs are notoriously ambiguous as to where wchar_t
means UTF-16 (multibyte encoding with surrogate pairs) or UCS-2 (wide encoding with no support for characters beyond the BMP).
一直以来,Linux wchar_t
是32位的,它的 宽度足以容纳UTF-32 ...
All the while, a Linux wchar_t
is 32 bit, which is wide enough for UTF-32...
C ++ 11对该主题进行了重大改进,添加了 char16_t
和 char32_t
及其相关的 string
变体,以消除歧义,但仍无法完全支持Unicode操作.
C++11 made significant improvements to the subject, adding char16_t
and char32_t
including their associated string
variants to remove the ambiguity, but still it is not fully equipped for Unicode operations.
仅举一个例子,尝试转换例如德语Fuß"为大写字母,您将明白我的意思.(单个字母'ß'
需要扩展为'SS'
,标准功能-一次处理一个字符,一次处理一个字符-不能做.)
Just as one example, try to convert e.g. German "Fuß" to uppercase and you will see what I mean. (The single letter 'ß'
would need to expand to 'SS'
, which the standard functions -- handling one character in, one character out at a time -- cannot do.)
但是, 有帮助 . Unicode的国际组件(ICU)库 完全可以处理C ++中的Unicode.至于在源代码中指定特殊字符,则必须使用 u8"
, u"
和 U"
来强制解释使用八进制/十六进制转义或依靠您的编译器实现来适当地处理非ASCII-7编码的字符串文字分别为UTF-8,UTF-16和UTF-32.
However, there is help. The International Components for Unicode (ICU) library is fully equipped to handle Unicode in C++. As for specifying special characters in source code, you will have to use u8""
, u""
, and U""
to enforce interpretation of the string literal as UTF-8, UTF-16, and UTF-32 respectively, using octal / hexadecimal escapes or relying on your compiler implementation to handle non-ASCII-7 encodings appropriately.
即使这样,您也将获得 std :: cout<<的整数值.ramp [5]
,因为对于C ++,字符只是具有语义含义的整数.ICU的 ustream.h
为 icu :: UnicodeString
类提供了 operator<<
重载,但提供了 ramp [5]
只是一个16位无符号整数(1),如果他们的 unsigned short
突然被解释为字符,人们会向您问.您需要 C-API u_fputs()
/ u_printf()
/ u_fprintf()
函数.
And even then you will get an integer value for std::cout << ramp[5]
, because for C++, a character is just an integer with semantic meaning. ICU's ustream.h
provides operator<<
overloads for the icu::UnicodeString
class, but ramp[5]
is just a 16-bit unsigned integer (1), and people would look askance at you if their unsigned short
would suddenly be interpreted as characters. You need the C-API u_fputs()
/ u_printf()
/ u_fprintf()
functions for that.
#include <unicode/unistr.h>
#include <unicode/ustream.h>
#include <unicode/ustdio.h>
#include <iostream>
int main()
{
// make sure your source file is UTF-8 encoded...
icu::UnicodeString ramp( icu::UnicodeString::fromUTF8( "ÐðŁłŠšÝýÞþŽž" ) );
std::cout << ramp << "\n";
std::cout << ramp[5] << "\n";
u_printf( "%C\n", ramp[5] );
}
使用 g ++ -std = c ++ 11 testme.cpp -licuio -licuuc
编译.
ÐðŁłŠšÝýÞþŽž
353
š
(1)ICU在内部使用UTF-16,并且 UnicodeString :: operator []
返回一个代码 unit ,而不是一个代码 point ,因此您可能最终只能获得代理对的一半.查找 API文档,以了解索引unicode字符串的各种其他方式.
(1) ICU uses UTF-16 internally, and UnicodeString::operator[]
returns a code unit, not a code point, so you might end up with one half of a surrogate pair. Look up the API docs for the various other ways to index a unicode string.
这篇关于C ++中的Unicode字符串索引的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!