在大型数据集上使用grep或fgrep非常慢的循环

本文介绍了在大型数据集上使用grep或fgrep非常慢的循环的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我试图做一些非常简单的事情; grep从一个列表，一个完全匹配的字符串，在目录中的文件：

听起来像 -f 标志为 grep 在这里适合：

  -f FILE， --file = FILE 
从FILE获取模式，每行一个。空文件
包含零模式，因此不会匹配任何内容。 （-f是由POSIX指定的
）。

so grep 已经可以做你的循环所做的事情了，你可以用下面的代码来替换循环：

  grep  - F -r -f / data / datafile / data / filestosearch>> /data/output.txt

I’m trying to do something pretty simple; grep from a list, an exact match for the string, on the files in a directory:

#try grep each line from the files
for i in $(cat /data/datafile); do 
LOOK=$(echo $i);
fgrep -r $LOOK /data/filestosearch >>/data/output.txt
done

The file with the matches to grep with has 20 million lines, and the directory has ~600 files, with a total of ~40Million linesI can see that this is going to be slow but we estimated it will take 7 years. Even if I use 300 cores on our HPC splitting the job by files to search, it looks like it could take over a week.

there are similar questions:

Loop Running VERY Slow:

Very slow foreach loop

here and although they are on different platforms, I think possibly if else might help me.or fgrep which is potentially faster (but seems to be a bit slow as I'm testing it now)Can anyone see a faster way to do this?Thank you in advance

解决方案

sounds like the -f flag for grep would be suitable here:

-f FILE, --file=FILE
    Obtain  patterns  from  FILE,  one  per  line.   The  empty file
    contains zero patterns, and therefore matches nothing.   (-f  is
    specified by POSIX.)

so grep can already do what your loop is doing, and you can replace the loop with:

grep -F -r -f /data/datafile /data/filestosearch >>/data/output.txt

Now I'm not sure about the performance of 20 million patterns, but at least you aren't starting 20 million processes this way so it's probably significantly faster.

这篇关于在大型数据集上使用grep或fgrep非常慢的循环的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！