问题描述
我无法理解 std::string
和 std::wstring
之间的区别.我知道 wstring
支持宽字符,例如 Unicode 字符.我有以下问题:
I am not able to understand the differences between std::string
and std::wstring
. I know wstring
supports wide characters such as Unicode characters. I have got the following questions:
- 我什么时候应该使用
std::wstring
而不是std::string
? std::string
能否保存整个 ASCII 字符集,包括特殊字符?- 所有流行的 C++ 编译器都支持
std::wstring
吗? - 什么是宽字符"?
- When should I use
std::wstring
overstd::string
? - Can
std::string
hold the entire ASCII character set, including the special characters? - Is
std::wstring
supported by all popular C++ compilers? - What is exactly a "wide character"?
推荐答案
string
?wstring
?
std::string
是一个 basic_string
在 char
上模板化,std::wstring
在 wchar_t
.
string
? wstring
?
std::string
is a basic_string
templated on a char
, and std::wstring
on a wchar_t
.
char
应该保存一个字符,通常是一个 8 位字符.wchar_t
应该包含一个宽字符,然后,事情变得棘手:在 Linux 上,一个 wchar_t
是 4 个字节,而在 Windows 上,它是 2 个字节.
char
is supposed to hold a character, usually an 8-bit character.wchar_t
is supposed to hold a wide character, and then, things get tricky:On Linux, a wchar_t
is 4 bytes, while on Windows, it's 2 bytes.
问题在于 char
和 wchar_t
都没有直接绑定到 unicode.
The problem is that neither char
nor wchar_t
is directly tied to unicode.
让我们以 Linux 操作系统为例:我的 Ubuntu 系统已经支持 unicode.当我使用字符字符串时,它以 UTF-8(即字符).代码如下:
Let's take a Linux OS: My Ubuntu system is already unicode aware. When I work with a char string, it is natively encoded in UTF-8 (i.e. Unicode string of chars). The following code:
#include <cstring>
#include <iostream>
int main()
{
const char text[] = "olé";
std::cout << "sizeof(char) : " << sizeof(char) << "
";
std::cout << "text : " << text << "
";
std::cout << "sizeof(text) : " << sizeof(text) << "
";
std::cout << "strlen(text) : " << strlen(text) << "
";
std::cout << "text(ordinals) :";
for(size_t i = 0, iMax = strlen(text); i < iMax; ++i)
{
unsigned char c = static_cast<unsigned_char>(text[i]);
std::cout << " " << static_cast<unsigned int>(c);
}
std::cout << "
";
// - - -
const wchar_t wtext[] = L"olé" ;
std::cout << "sizeof(wchar_t) : " << sizeof(wchar_t) << "
";
//std::cout << "wtext : " << wtext << "
"; <- error
std::cout << "wtext : UNABLE TO CONVERT NATIVELY." << "
";
std::wcout << L"wtext : " << wtext << "
";
std::cout << "sizeof(wtext) : " << sizeof(wtext) << "
";
std::cout << "wcslen(wtext) : " << wcslen(wtext) << "
";
std::cout << "wtext(ordinals) :";
for(size_t i = 0, iMax = wcslen(wtext); i < iMax; ++i)
{
unsigned short wc = static_cast<unsigned short>(wtext[i]);
std::cout << " " << static_cast<unsigned int>(wc);
}
std::cout << "
";
}
输出以下文本:
sizeof(char) : 1
text : olé
sizeof(text) : 5
strlen(text) : 4
text(ordinals) : 111 108 195 169
sizeof(wchar_t) : 4
wtext : UNABLE TO CONVERT NATIVELY.
wtext : ol�
sizeof(wtext) : 16
wcslen(wtext) : 3
wtext(ordinals) : 111 108 233
你会看到olé"char
中的文本实际上由四个字符构成:110、108、195 和 169(不包括尾随零).(我会让你学习 wchar_t
代码作为练习)
You'll see the "olé" text in char
is really constructed by four chars: 110, 108, 195 and 169 (not counting the trailing zero). (I'll let you study the wchar_t
code as an exercise)
因此,当在 Linux 上使用 char
时,您通常应该在不知不觉中使用 Unicode.由于 std::string
与 char
一起工作,所以 std::string
已经准备好 unicode.
So, when working with a char
on Linux, you should usually end up using Unicode without even knowing it. And as std::string
works with char
, so std::string
is already unicode-ready.
请注意,std::string
与 C 字符串 API 一样,会考虑olé"字符串有 4 个字符,而不是三个.因此,在截断/播放 unicode 字符时应谨慎,因为 UTF-8 中禁止某些字符组合.
Note that std::string
, like the C string API, will consider the "olé" string to have 4 characters, not three. So you should be cautious when truncating/playing with unicode chars because some combination of chars is forbidden in UTF-8.
在 Windows 上,这有点不同.Win32 必须支持许多使用 char
和不同字符集/的应用程序代码页,在 Unicode 出现之前,全世界都已生成.
On Windows, this is a bit different. Win32 had to support a lot of application working with char
and on different charsets/codepages produced in all the world, before the advent of Unicode.
所以他们的解决方案很有趣:如果应用程序使用 char
,那么使用机器上的本地字符集/代码页对字符字符串进行编码/打印/显示在 GUI 标签上,不能很长时间是 UTF-8.例如,olé"将是olé"在法语本地化的 Windows 中,但在西里尔文本地化的 Windows 上会有所不同(olй",如果您使用 Windows-1251).因此,历史应用程序"通常仍会以同样的方式工作.
So their solution was an interesting one: If an application works with char
, then the char strings are encoded/printed/shown on GUI labels using the local charset/codepage on the machine, which could not be UTF-8 for a long time. For example, "olé" would be "olé" in a French-localized Windows, but would be something different on an cyrillic-localized Windows ("olй" if you use Windows-1251). Thus, "historical apps" will usually still work the same old way.
对于基于 Unicode 的应用程序,Windows 使用 wchar_t
,它是 2 字节宽,并以 UTF-16,这是在 2 字节字符上编码的 Unicode(或者至少是 UCS-2,它只是缺少代理对,因此缺少 BMP 之外的字符(>= 64K)).
For Unicode based applications, Windows uses wchar_t
, which is 2-bytes wide, and is encoded in UTF-16, which is Unicode encoded on 2-bytes characters (or at the very least, UCS-2, which just lacks surrogate-pairs and thus characters outside the BMP (>= 64K)).
使用 char
的应用程序被称为多字节";(因为每个字形由一个或多个 char
组成),而使用 wchar_t
的应用程序被称为widechar".(因为每个字形由一两个 wchar_t
组成.参见 MultiByteToWideChar 和 WideCharToMultiByte Win32 转换 API 了解更多信息.
Applications using char
are said "multibyte" (because each glyph is composed of one or more char
s), while applications using wchar_t
are said "widechar" (because each glyph is composed of one or two wchar_t
. See MultiByteToWideChar and WideCharToMultiByte Win32 conversion API for more info.
因此,如果您在 Windows 上工作,您非常希望使用 wchar_t
(除非您使用隐藏它的框架,例如 GTK 或 QT>...).事实是,在幕后,Windows 使用 wchar_t
字符串,所以即使是历史应用程序在使用 API 时也会将它们的 char
字符串转换为 wchar_t
像 SetWindowText()
(在 Win32 GUI 上设置标签的低级 API 函数).
Thus, if you work on Windows, you badly want to use wchar_t
(unless you use a framework hiding that, like GTK or QT...). The fact is that behind the scenes, Windows works with wchar_t
strings, so even historical applications will have their char
strings converted in wchar_t
when using API like SetWindowText()
(low level API function to set the label on a Win32 GUI).
UTF-32 是每个字符 4 个字节,所以没有什么可添加的,只要 UTF-8 文本和 UTF-16 文本总是比 UTF-32 文本使用更少或相同的内存量(通常更少).
UTF-32 is 4 bytes per characters, so there is no much to add, if only that a UTF-8 text and UTF-16 text will always use less or the same amount of memory than an UTF-32 text (and usually less).
如果存在内存问题,那么您应该知道,与大多数西方语言相比,UTF-8 文本将比相同的 UTF-16 文本使用更少的内存.
If there is a memory issue, then you should know than for most western languages, UTF-8 text will use less memory than the same UTF-16 one.
不过,对于其他语言(中文、日语等),使用的内存与 UTF-8 相同,或者比 UTF-16 稍大.
Still, for other languages (chinese, japanese, etc.), the memory used will be either the same, or slightly larger for UTF-8 than for UTF-16.
总而言之,UTF-16 每个字符将主要使用 2 个字节,偶尔使用 4 个字节(除非您正在处理某种深奥的语言字形(克林贡语?精灵语?),而 UTF-8 将花费 1 到 4 个字节)字节.
All in all, UTF-16 will mostly use 2 and occassionally 4 bytes per characters (unless you're dealing with some kind of esoteric language glyphs (Klingon? Elvish?), while UTF-8 will spend from 1 to 4 bytes.
参见 https://en.wikipedia.org/wiki/UTF-8#Compared_to_UTF-16 了解更多信息.
See https://en.wikipedia.org/wiki/UTF-8#Compared_to_UTF-16 for more info.
什么时候我应该使用 std::wstring 而不是 std::string?
在 Linux 上?几乎从不 (§).在 Windows 上?几乎总是 (§).关于跨平台代码?取决于您的工具包...
On Linux? Almost never (§).On Windows? Almost always (§).On cross-platform code? Depends on your toolkit...
(§) : 除非你使用工具包/框架另有说明
(§) : unless you use a toolkit/framework saying otherwise
std::string
可以保存所有 ASCII 字符集,包括特殊字符吗?
Can std::string
hold all the ASCII character set including special characters?
注意:std::string
适用于保存二进制"缓冲区,而 std::wstring
则不适用于!
Notice: A std::string
is suitable for holding a 'binary' buffer, where a std::wstring
is not!
在 Linux 上?是的.在 Windows 上?仅适用于 Windows 用户当前区域设置的特殊字符.
On Linux? Yes.On Windows? Only special characters available for the current locale of the Windows user.
编辑(根据 Johann Gerell 的评论):std::string
足以处理所有基于 char
的字符串(每个 char
是一个从 0 到 255 的数字).但是:
Edit (After a comment from Johann Gerell):a std::string
will be enough to handle all char
-based strings (each char
being a number from 0 to 255). But:
- ASCII 应该从 0 到 127.更高的
char
不是 ASCII. - 从 0 到 127 的
char
将被正确保存 - 从 128 到 255 的
char
将根据您的编码(unicode、非 unicode 等)具有含义,但它能够保存所有 Unicode 字形,只要它们是以 UTF-8 编码.
- ASCII is supposed to go from 0 to 127. Higher
char
s are NOT ASCII. - a
char
from 0 to 127 will be held correctly - a
char
from 128 to 255 will have a signification depending on your encoding (unicode, non-unicode, etc.), but it will be able to hold all Unicode glyphs as long as they are encoded in UTF-8.
几乎所有流行的 C++ 编译器都支持 std::wstring
吗?
大多数情况下,移植到 Windows 的基于 GCC 的编译器除外.它适用于我的 g++ 4.3.2(在 Linux 下),并且我从 Visual C++ 6 开始在 Win32 上使用 Unicode API.
Mostly, with the exception of GCC based compilers that are ported to Windows.It works on my g++ 4.3.2 (under Linux), and I used Unicode API on Win32 since Visual C++ 6.
什么是宽字符?
在 C/C++ 上,它是一种写成 wchar_t
的字符类型,它比简单的 char
字符类型大.它应该用于放置索引(如 Unicode 字形)大于 255(或 127,取决于...)的字符.
On C/C++, it's a character type written wchar_t
which is larger than the simple char
character type. It is supposed to be used to put inside characters whose indices (like Unicode glyphs) are larger than 255 (or 127, depending...).
这篇关于std::wstring VS std::string的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!