问题描述
我试图检测一些 Unicode 字符的组合(如 ​)来清理字符串,对于单个 Unicode 字符,它正在检测但未检测到 Unicode 组合.
I am trying to detect some of the combination of Unicode character (like ​) to cleanup the string, For a single Unicode character it is detecting but combination of Unicode is not detecting.
这些字符串我用来从另一个需要清理的 HTML 页面制作 HTML 页面.我只想清理具有这种在浏览器的 html 页面中甚至不可见的 unicode 的字符串.
These string I am using to make HTML page from another HTML page which need to be cleanup. I want to clean only string which have these kind of unicode that not even visible in html page in browser.
以下是示例代码:
void detect_Unicode(string& str) {
if(!str.empty() && str.find_first_not_of("
fvu00A0u00C2u00E2u20ACu2039")==string::npos)
str.assign(" ");
return;
}
输入字符串:
1. " ​ ​ " ;
2. "are   there is something    ​ combination ​"
3. " Â Â "
4. "​   ​"
5 . "Â Â â â"
预期输出:
1. " "
2. "are   there is something    ​ combination ​"
3. " "
4. " "
5. " "
也请告诉我其他方式.
推荐答案
好的,根据上面的评论,我认为输入字符串很有可能是 UTF-8(毕竟,在 HTML 上下文中,什么否则会是?).
OK, following on from the comments above, I think it's highly likely that the input string is in UTF-8 (after all, in an HTML context, what else would it be?).
在此基础上,我谦虚地提交:
On that basis, I humbly submit this:
#include <string>
#include <codecvt>
#include <locale>
std::string narrow (const std::wstring& ws)
{
std::wstring_convert <std::codecvt_utf8 <wchar_t>, wchar_t> convert;
return convert.to_bytes (ws);
}
std::wstring widen (const std::string& s)
{
std::wstring_convert <std::codecvt_utf8 <wchar_t>, wchar_t> convert;
return convert.from_bytes (s);
}
std::string detect_Unicode (const std::string& s)
{
std::wstring ws = widen (s);
if (ws.empty() || ws.find_first_not_of (L"
fvu00A0u00C2u00E2u20ACu2039") != std::wstring::npos)
return " ";
return s;
}
#include <iostream>
int main ()
{
std::cout << narrow (L"u00A0 u00C2 u00E2 u20AC u2039
");
std::cout << "0. "" << detect_Unicode (u8"abcde") << ""
";
std::cout << "1. "" << detect_Unicode (u8" ​ ​ ") << ""
";
std::cout << "2. "" << detect_Unicode (u8"are   there is something    ​ combination ​") << ""
";
std::cout << "3. "" << detect_Unicode (u8" Â Â ") << ""
";
std::cout << "4. "" << detect_Unicode (u8"​   ​") << ""
";
std::cout << "5. "" << detect_Unicode (u8"Â Â â â") << ""
";
}
输出:
 ⠀ ‹
0. " "
1. " ​ ​ "
2. " "
3. " Â Â "
4. "​   ​"
5. "Â Â â â"
现在这不是 OP 期望的输出,但我认为这仅仅是因为 detect_Unicode()
的 逻辑(与实现相反)看起来有缺陷.这里的重点是将输入字符串转换为宽字符串意味着您可以可靠地对其使用标准的basic_string
操作,因为现在不存在多字节问题.
Now this is not the output the OP expects, but I think that's simply because the logic (as opposed to the implementation) of detect_Unicode()
looks flawed. The point here is that converting the input string to a wide string means that you can use standard basic_string
operations on it reliably, because there are no multibyte issues now.
另一种稍微激进的 detect_Unicode()
实现可能是:
An alternative, slightly radical, implementation of detect_Unicode()
might be:
for (auto wide_char : ws)
{
if (wide_char > 0xff)
return " ";
}
return s;
但实际上,现在您有一个很宽的字符串要提交detect_Unicode
,一切皆有可能,所以放手一搏吧.
But really, now you have a wide string to hand in detect_Unicode
, anything is possible, so go wild OP.
其他注意事项:
std::codecvt
在 C++17 中已弃用,但由于没有其他明显的选择,您不妨使用它.如果需要,您可以随时更改narrow
和widen
的实现.- 视平台而定,
std::wstring
可能不是最佳选择,但可能还不错.您还可以查看std::u16string
和std::u32string
.
std::codecvt
is deprecated in C++17, but since there is no other obvious choice you might as well run with it. You can always change the implementations ofnarrow
andwiden
if it comes to it.- Depending on platform,
std::wstring
might not be the best choice but it's probably fine. You could also look atstd::u16string
andstd::u32string
.
现场演示.
灵感来自 这里.
这篇关于如何检测“"(unicode的组合)在c++字符串中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!