问题描述
我需要在二进制数据(文件)中找到unicode文本.
I need to find unicode text inside binary data (files).
我正在寻找可以在macOS上使用的任何C或C ++代码或库.由于我认为这对其他平台也有用,所以我宁愿使这个问题不特定于macOS.
I'm seeking any C or C++ code or library that I can use on macOS. Since I guess this is also useful to other platforms, so I rather make this question not specific to macOS.
在macOS上,不能使用NSString
函数来满足我对unicode的精明需求,因为它们不适用于二进制数据.
On macOS, the NSString
functions, meeting my unicode savvyness needs, can't be used because they do not work on binary data.
作为替代方案,我尝试了macOS上提供的符合POSIX的regex
函数,但是它们有一些局限性:
As an alternative I've tried the POSIX complient regex
functions provided on macOS, but they have some limitations:
- 它们不是标准化专家,也就是说,如果我搜索一个预组合(NFC)字符,则如果目标数据中以分解(NFD)形式出现该字符,则找不到该字符.
- 不区分大小写的搜索不适用于拉丁文NFC字符(搜索Ü找不到ü).
显示这些结果的示例代码如下.
Example code showing these results is below.
那里有什么代码或库可以满足这些需求?
What code or library is out there that fulfills these needs?
我不需要正则表达式功能,但是如果有一个可以满足这些要求的正则表达式库,我也很满意.
I do not need regex capabilities, but if there's a regex lib that can handle these requirements, I'm fine with that, too.
基本上,我需要使用以下选项进行Unicode文本搜索:
- 不区分大小写
- 对标准化不敏感
- 对变音符号不敏感
- 处理任意二进制数据,找到匹配的UTF-8文本片段
- case-insensitive
- normalization-insensitive
- diacritics-insensitive
- works on arbitrary binary data, finding matching UTF-8 text fragments
以下是测试代码,显示了在macOS上使用TRE regex实现的结果:
Here's the test code showing the results from using the TRE regex implementation on macOS:
#include <stdio.h>
#include <regex.h>
void findIn (const char *what, const char *data, int whatPre, int dataPre) {
regex_t re;
regcomp (&re, what, REG_ICASE | REG_LITERAL);
int found = regexec(&re, data, 0, NULL, 0) == 0;
printf ("Found %s (%s) in %s (%s): %s\n", what, whatPre?"pre":"dec", data, dataPre?"pre":"dec", found?"yes":"no");
}
void findInBoth (const char *what, int whatPre) {
char dataPre[] = { '<', 0xC3, 0xA4, '>', 0}; // precomposed
char dataDec[] = { '<', 0x61, 0xCC, 0x88, '>', 0}; // decomposed
findIn (what, dataPre, whatPre, 1);
findIn (what, dataDec, whatPre, 0);
}
int main(int argc, const char * argv[]) {
char a_pre[] = { 0xC3, 0xA4, 0}; // precomposed ä
char a_dec[] = { 0x61, 0xCC, 0x88, 0}; // decomposed ä
char A_pre[] = { 0xC3, 0x84, 0}; // precomposed Ä
char A_dec[] = { 0x41, 0xCC, 0x88, 0}; // decomposed Ä
findInBoth (a_pre, 1);
findInBoth (a_dec, 0);
findInBoth (A_pre, 1);
findInBoth (A_dec, 0);
return 0;
}
输出为:
Found ä (pre) in <ä> (pre): yes
Found ä (pre) in <ä> (dec): no
Found ä (dec) in <ä> (pre): no
Found ä (dec) in <ä> (dec): yes
Found Ä (pre) in <ä> (pre): no
Found Ä (pre) in <ä> (dec): no
Found Ä (dec) in <ä> (pre): no
Found Ä (dec) in <ä> (dec): yes
期望的输出:所有情况都应为是"
Desired output: All cases should give "yes"
推荐答案
我已经解决了这个问题,方法是编写自己的前任代码,生成一个将所有交替项(大小写和规范化但不包括变音符)组合在一起的正则表达式,并通过到正则表达式功能.
I've solved the issue by writing my own pre-precessor, generating a regular expression that combines all the alternatices (case and normalization but not diacritics) and passing that to the regex function.
完整的解决方案是此处记录.
这篇关于寻求精通Unicode的功能来搜索二进制数据中的文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!