正则表达式替换性能

正则表达式替换性能

本文介绍了提高 C++ 正则表达式替换性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是一名 C++ 初学者,正在从事一个小型 C++ 项目,我必须处理一些相对较大的 XML 文件并从中删除 XML 标记.我已经使用 C++0x regex 库成功地做到了这一点.但是,我遇到了一些性能问题.在我的 PC 上读取文件并对其内容执行 regex_replace 函数大约需要 6 秒.我可以通过添加一些编译器优化标志将其降低到 2.但是,使用 Python,我可以在不到 100 毫秒的时间内完成它.显然,我在 C++ 代码中做了一些非常低效的事情.我可以做些什么来加快速度?

I'm a beginner C++ programmer working on a small C++ project for which I have to process a number of relatively large XML files and remove the XML tags out of them. I've succeeded doing so using the C++0x regex library. However, I'm running into some performance issues. Just reading in the files and executing the regex_replace function over its contents takes around 6 seconds on my PC. I can bring this down to 2 by adding some compiler optimization flags. Using Python, however, I can get it done it less than 100 milliseconds. Obviously, I'm doing something very inefficient in my C++ code. What can I do to speed this up a bit?

我的 C++ 代码:

std::regex xml_tags_regex("<[^>]*>");

for (std::vector<std::string>::iterator it = _files.begin(); it !=
        _files.end(); it++) {

    std::ifstream file(*it);
    file.seekg(0, std::ios::end);
    size_t size = file.tellg();

    std::string buffer(size, ' ');

    file.seekg(0);
    file.read(&buffer[0], size);

    buffer = regex_replace(buffer, xml_tags_regex, "");

    file.close();
}

我的 Python 代码:

My Python code:

regex = re.compile('<[^>]*>')

for filename in filenames:
    with open(filename) as f:
        content = f.read()
        content = regex.sub('', content)

附言我真的不在乎一次处理完整的文件.我只是发现逐行、逐字或逐字符读取文件会大大减慢它的速度.

P.S. I don't really care about processing the complete file at once. I just found that reading a file line by line, word by word or character by character slowed it down considerably.

推荐答案

我不认为你在做任何错误"的事情,C++ regex 库只是没有 python 库那么快(对于这个用例至少在这个时候).这并不太令人惊讶,请记住,python 正则表达式代码也是所有 C/C++ 引擎盖下的,并且多年来一直被调整为非常快,因为这是 python 中一个相当重要的特性,所以它自然会要相当快.

I don't think you're doing anything "wrong" per-say, the C++ regex library just isn't as fast as the python one (for this use case at this time at least). This isn't too surprising, keeping in mind the python regex code is all C/C++ under the hood as well, and has been tuned over the years to be pretty fast as that's a fairly important feature in python, so naturally it is going to be pretty fast.

但是如果需要,C++ 中还有其他选项可以让事情变得更快.我过去使用过 PCRE ( http://pcre.org/ ) 取得了很好的效果,尽管我当然,这些天也有其他好的.

But there are other options in C++ for getting things faster if you need. I've used PCRE ( http://pcre.org/ ) in the past with great results, though I'm sure there are other good ones out there these days as well.

然而,特别是对于这种情况,您也可以在没有正则表达式的情况下实现您所追求的目标,在我的快速测试中,这使性能提高了 10 倍.例如,以下代码扫描您的输入字符串,将所有内容复制到新缓冲区,当它遇到 < 时,它开始跳过字符,直到看到结束的 >

For this case in particular however, you can also achieve what you're after without regexes, which in my quick tests yielded a 10x performance improvement. For example, the following code scans your input string copying everything to a new buffer, when it hits a < it starts skipping over characters until it sees the closing >

std::string buffer(size, ' ');
std::string outbuffer(size, ' ');

... read in buffer from your file

size_t outbuffer_len = 0;
for (size_t i=0; i < buffer.size(); ++i) {
    if (buffer[i] == '<') {
        while (buffer[i] != '>' && i < buffer.size()) {
            ++i;
        }
    } else {
        outbuffer[outbuffer_len] = buffer[i];
        ++outbuffer_len;
    }
}
outbuffer.resize(outbuffer_len);

这篇关于提高 C++ 正则表达式替换性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-22 13:29