问题描述
我需要一个大的CSV文件egrep命令拥有200万线,我希望egrep的时间缩短至0.5秒,这是可能的呢?不,我不希望在这个时候数据库(sqlite3的或MySQL)。
I need to egrep from a large csv file with 2 million lines, I want to cut down the egrep time to 0.5 sec, is this possible at all? no, I don't want database (sqlite3 or MySQL) at this time..
$ time wc foo.csv
2000000 22805420 334452932 foo.csv
real 0m3.396s
user 0m3.261s
sys 0m0.115s
我已经能够从40秒减少运行时间为1.75秒。
I've been able to cut down the run time from 40 secs to 1.75 secs
$ time egrep -i "storm|broadway|parkway center|chief financial" foo.csv|wc -l
108292
real 0m40.707s
user 0m40.137s
sys 0m0.309s
$ time LC_ALL=C egrep -i "storm|broadway|parkway center|chief financial" foo.csv|wc -l
108292
real 0m1.751s
user 0m1.590s
sys 0m0.140s
但我想egrep的实时不到半秒钟,任何技巧将大大AP preciated,不断的文件的变化,所以我不能使用任何缓存机制...
but I want the egrep real time to be less than half a second, any tricks will be greatly appreciated, the file changes continuously, so I can't use any cache mechanism...
推荐答案
如果你只是搜索关键字,你可以使用 fgrep一样
(或的grep -F
)而不是 egrep的
:
If you are just searching for keywords, you could use fgrep
(or grep -F
) instead of egrep
:
LC_ALL=C grep -F -i -e storm -e broadway -e "parkway center" -e "chief financial"
尝试接下来的事情会被分解出 -i
,这可能是目前的瓶颈。如果你确定只有第一个字母可能是大写,例如,你可以这样做:
The next thing to try would be factoring out -i
, which is probably now the bottleneck. If you're sure that only the first letter might be capitalized, for example, you could do:
LC_ALL=C grep -F \
-e{S,s}torm -e{B,b}roadway -e{P,p}"arkway "{C,c}enter -e{C,c}"hief "{F,f}inancial
这篇关于什么是最快的egrep的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!