问题描述
我只是为 approved="no"
模式搜索一些 Xliff 文件.我有一个Shell脚本和一个Python脚本,性能差异很大(一组393个文件,总共3686329行,Shell脚本0.1s用户时间,Python脚本6.6s).
外壳:grep 'approved="no"' FILE
蟒蛇:
def grep(pattern, file_path):ret = 错误使用 codecs.open(file_path, "r", encoding="utf-8") 作为 f:而 1 而不是 ret:行 = f.readlines(100000)如果不是行:休息对于线中线:如果研究(模式,行):ret = 真休息返回 ret
是否有使用多平台解决方案提高性能的想法?
结果
以下是应用一些建议的解决方案后的一些结果.
测试在 RHEL6 Linux 机器上运行,使用 Python 2.6.6.
工作集:393 个 Xliff 文件,总共 3,686,329 行.
数字是以秒为单位的用户时间.
grep_1(io,加入 100,000 行文件):50s
grep_3(mmap):0.7s
外壳版本(Linux grep):0.130s
Python,作为一种解释型语言,与编译后的 C 版本 grep
相比,总是会比较慢.
除此之外,您的 Python 实现与您的 grep
示例不同.它不返回匹配的行,它只是测试模式是否与任何一行上的字符匹配.更仔细的比较是:
grep -q 'approved="no"' 文件
一旦找到匹配项就会返回并且不产生任何输出.
通过更有效地编写 grep()
函数,您可以显着加快代码速度:
def grep_1(pattern, file_path):使用 io.open(file_path, "r", encoding="utf-8") 作为 f:而真:行 = f.readlines(100000)如果不是行:返回错误如果 re.search(pattern, ''.join(lines)):返回真
这使用 io
而不是 codecs
,我发现它要快一点.while 循环条件不需要检查 ret
并且您可以在知道结果后立即从函数中返回.无需为每个单独的 ilne 运行 re.search()
- 只需加入行并执行单个搜索.
以内存使用为代价,你可以试试这个:
导入iodef grep_2(模式,文件路径):使用 io.open(file_path, "r", encoding="utf-8") 作为 f:返回 re.search(pattern, f.read())
如果内存有问题,您可以 mmap
文件并在 mmap
上运行正则表达式搜索:
导入io导入映射def grep_3(模式,文件路径):使用 io.open(file_path, "r", encoding="utf-8") 作为 f:返回 re.search(pattern, mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ))
mmap
将有效地从文件中读取页面中的数据,而不会消耗大量内存.此外,您可能会发现 mmap
比其他解决方案运行得更快.
对这些函数中的每一个使用 timeit
表明情况确实如此:
该文件是 /usr/share/dict/words
,包含大约 480,000 行,搜索模式是 zymurgies
,它出现在文件末尾附近.为了进行比较,当模式接近文件的开头时,例如算盘
,时间是:
这再次表明 mmap
版本是最快的.
现在将 grep
命令与 Python mmap
版本进行比较:
$ time grep -q zymurgies/usr/share/dict/words真实 0m0.010s用户 0m0.007s系统 0m0.003s$ time python x.py grep_3 # 使用 mmap真实 0m0.023s用户 0m0.019s系统 0m0.004s
考虑到 grep
的优势,这还算不错.
I'm just grepping some Xliff files for the pattern approved="no"
. I have a Shell script and a Python script, and the difference in performance is huge (for a set of 393 files, and a total of 3,686,329 lines, 0.1s user time for the Shell script, and 6.6s for the Python script).
Shell: grep 'approved="no"' FILE
Python:
def grep(pattern, file_path):
ret = False
with codecs.open(file_path, "r", encoding="utf-8") as f:
while 1 and not ret:
lines = f.readlines(100000)
if not lines:
break
for line in lines:
if re.search(pattern, line):
ret = True
break
return ret
Any ideas to improve performance with a multiplatform solution?
Results
Here are a couple of results after applying some of the proposed solutions.
Tests were run on a RHEL6 Linux machine, with Python 2.6.6.
Working set: 393 Xliff files, 3,686,329 lines in total.
Numbers are user time in seconds.
grep_1 (io, joining 100,000 file lines): 50s
grep_3 (mmap): 0.7s
Shell version (Linux grep): 0.130s
Python, being an interpreted language vs. a compiled C version of grep
will always be slower.
Apart from that your Python implementation is not the same as your grep
example. It is not returning the matching lines, it is merely testing to see if the pattern matches the characters on any one line. A closer comparison would be:
grep -q 'approved="no"' FILE
which will return as soon as a match is found and not produce any output.
You can substantially speed up your code by writing your grep()
function more efficiently:
def grep_1(pattern, file_path):
with io.open(file_path, "r", encoding="utf-8") as f:
while True:
lines = f.readlines(100000)
if not lines:
return False
if re.search(pattern, ''.join(lines)):
return True
This uses io
instead of codecs
which I found was a little faster. The while loop condition does not need to check ret
and you can return from the function as soon as the result is known. There's no need to run re.search()
for each individual ilne - just join the lines and perform a single search.
At the cost of memory usage you could try this:
import io
def grep_2(pattern, file_path):
with io.open(file_path, "r", encoding="utf-8") as f:
return re.search(pattern, f.read())
If memory is an issue you could mmap
the file and run the regex search on the mmap
:
import io
import mmap
def grep_3(pattern, file_path):
with io.open(file_path, "r", encoding="utf-8") as f:
return re.search(pattern, mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ))
mmap
will efficiently read the data from the file in pages without consuming a lot of memory. Also, you'll probably find that mmap
runs faster than the other solutions.
Using timeit
for each of these functions shows that this is the case:
10 loops, best of 3: 639 msec per loop # grep() 10 loops, best of 3: 78.7 msec per loop # grep_1() 10 loops, best of 3: 19.4 msec per loop # grep_2() 100 loops, best of 3: 5.32 msec per loop # grep_3()
The file was /usr/share/dict/words
containing approx 480,000 lines and the search pattern was zymurgies
, which occurs near the end of the file. For comparison, when pattern is near the start of the file, e.g. abaciscus
, the times are:
10 loops, best of 3: 62.6 msec per loop # grep() 1000 loops, best of 3: 1.6 msec per loop # grep_1() 100 loops, best of 3: 14.2 msec per loop # grep_2() 10000 loops, best of 3: 37.2 usec per loop # grep_3()
which again shows that the mmap
version is fastest.
Now comparing the grep
command with the Python mmap
version:
$ time grep -q zymurgies /usr/share/dict/words
real 0m0.010s
user 0m0.007s
sys 0m0.003s
$ time python x.py grep_3 # uses mmap
real 0m0.023s
user 0m0.019s
sys 0m0.004s
Which is not too bad considering the advantages that grep
has.
这篇关于Python grep 代码比命令行的 grep 慢得多的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!