问题描述
libxml2
似乎将其所有字符串存储为xmlChar *
在UTF-8中.
libxml2
seems to store all its strings in UTF-8, as xmlChar *
.
/**
* xmlChar:
*
* This is a basic byte in an UTF-8 encoded string.
* It's unsigned allowing to pinpoint case where char * are assigned
* to xmlChar * (possibly making serialization back impossible).
*/
typedef unsigned char xmlChar;
由于libxml2
是C库,因此没有提供从xmlChar *
中获取std::wstring
的例程.我想知道在C ++ 11中将xmlChar *
转换为std::wstring
的谨慎方法是否使用 mbstowcs C函数,通过类似这样的方法(正在进行中):
As libxml2
is a C library, there's no provided routines to get an std::wstring
out of an xmlChar *
. I'm wondering whether the prudent way to convert xmlChar *
to a std::wstring
in C++11 is to use the mbstowcs C function, via something like this (work in progress):
std::wstring xmlCharToWideString(const xmlChar *xmlString) {
if(!xmlString){abort();} //provided string was null
int charLength = xmlStrlen(xmlString); //excludes null terminator
wchar_t *wideBuffer = new wchar_t[charLength];
size_t wcharLength = mbstowcs(wideBuffer, (const char *)xmlString, charLength);
if(wcharLength == (size_t)(-1)){abort();} //mbstowcs failed
std::wstring wideString(wideBuffer, wcharLength);
delete[] wideBuffer;
return wideString;
}
编辑:仅供参考,我非常了解xmlStrlen
返回的内容;这是用于存储字符串的xmlChar
的数字;我知道这不是字符的数目,而是unsigned char
的数目.如果我将其命名为byteLength
,本来就不会那么混乱,但是我认为,由于同时拥有charLength
和wcharLength
,它会更加清晰.至于代码的正确性,wideBuffer总是大于或等于到保持缓冲区所需的大小(我相信).因为需要的空间比wide_t
大的字符将被截断(我认为).
Just an FYI, I'm very aware of what xmlStrlen
returns; it's the number of xmlChar
used to store the string; I know it's not the number of characters but rather the number of unsigned char
. It would have been less confusing if I had named it byteLength
, but I thought it would have been clearer as I have both charLength
and wcharLength
. As for the correctness of the code, the wideBuffer will be larger or equal to the required size to hold the buffer, always (I believe). As characters that require more space than wide_t
will be truncated (I think).
推荐答案
xmlStrlen()
返回xmlChar*
字符串中UTF-8编码的代码单元的数量.这将与转换数据时所需的wchar_t
编码代码单元数量不同,因此请不要使用xmlStrlen()
分配wchar_t
字符串的大小.您需要调用 std::mbtowc()
一次,以获取正确的长度,然后分配内存,然后再次调用mbtowc()
以填充内存.您还必须使用 std::setlocale()
告诉mbtowc()
使用UTF. -8(使用语言环境可能不是一个好主意,尤其是在涉及多个线程的情况下).例如:
xmlStrlen()
returns the number of UTF-8 encoded codeunits in the xmlChar*
string. That is not going to be the same number of wchar_t
encoded codeunits needed when the data is converted, so do not use xmlStrlen()
to allocate the size of your wchar_t
string. You need to call std::mbtowc()
once to get the correct length, then allocate the memory, and call mbtowc()
again to fill the memory. You will also have to use std::setlocale()
to tell mbtowc()
to use UTF-8 (messing with the locale may not be a good idea, especially if multiple threads are involved). For example:
std::wstring xmlCharToWideString(const xmlChar *xmlString)
{
if (!xmlString) { abort(); } //provided string was null
std::wstring wideString;
int charLength = xmlStrlen(xmlString);
if (charLength > 0)
{
char *origLocale = setlocale(LC_CTYPE, NULL);
setlocale(LC_CTYPE, "en_US.UTF-8");
size_t wcharLength = mbtowc(NULL, (const char*) xmlString, charLength); //excludes null terminator
if (wcharLength != (size_t)(-1))
{
wideString.resize(wcharLength);
mbtowc(&wideString[0], (const char*) xmlString, charLength);
}
setlocale(LC_CTYPE, origLocale);
if (wcharLength == (size_t)(-1)) { abort(); } //mbstowcs failed
}
return wideString;
}
自从您提到C ++ 11以来,一个更好的选择是将std::codecvt_utf8
与std::wstring_convert
结合使用,这样您就不必处理语言环境:
A better option, since you mention C++11, is to use std::codecvt_utf8
with std::wstring_convert
instead so you do not have to deal with locales:
std::wstring xmlCharToWideString(const xmlChar *xmlString)
{
if (!xmlString) { abort(); } //provided string was null
try
{
std::wstring_convert<std::codecvt_utf8<wchar_t>, wchar_t> conv;
return conv.from_bytes((const char*)xmlString);
}
catch(const std::range_error& e)
{
abort(); //wstring_convert failed
}
}
另一种选择是使用实际的Unicode库(例如ICU或ICONV)来处理Unicode转换.
An alternative option is to use an actual Unicode library, such as ICU or ICONV, to handle Unicode conversions.
这篇关于libxml2 xmlChar *到std :: wstring的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!