问题描述
我试图做一些非常简单的事情; grep从一个列表,一个完全匹配的字符串,在目录中的文件:听起来像 -f 标志为 grep 在这里适合:
-f FILE, --file = FILE
从FILE获取模式,每行一个。空文件
包含零模式,因此不会匹配任何内容。 (-f是由POSIX指定的
)。
so grep 已经可以做你的循环所做的事情了,你可以用下面的代码来替换循环:
grep - F -r -f / data / datafile / data / filestosearch>> /data/output.txt
I’m trying to do something pretty simple; grep from a list, an exact match for the string, on the files in a directory:
#try grep each line from the files for i in $(cat /data/datafile); do LOOK=$(echo $i); fgrep -r $LOOK /data/filestosearch >>/data/output.txt doneThe file with the matches to grep with has 20 million lines, and the directory has ~600 files, with a total of ~40Million linesI can see that this is going to be slow but we estimated it will take 7 years. Even if I use 300 cores on our HPC splitting the job by files to search, it looks like it could take over a week.
there are similar questions:
here and although they are on different platforms, I think possibly if else might help me.or fgrep which is potentially faster (but seems to be a bit slow as I'm testing it now)Can anyone see a faster way to do this?Thank you in advance
解决方案sounds like the -f flag for grep would be suitable here:
-f FILE, --file=FILE Obtain patterns from FILE, one per line. The empty file contains zero patterns, and therefore matches nothing. (-f is specified by POSIX.)so grep can already do what your loop is doing, and you can replace the loop with:
grep -F -r -f /data/datafile /data/filestosearch >>/data/output.txtNow I'm not sure about the performance of 20 million patterns, but at least you aren't starting 20 million processes this way so it's probably significantly faster.
这篇关于在大型数据集上使用grep或fgrep非常慢的循环的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!