algorithm - 在两个给定的输入文件中搜索公用字符串

我有两个20Gb大小的文件我得在他们中间寻找共同的线索。假设字符串的最大长度是20字节。为了解决这个问题，我使用了下面的算法，我使用了一个8gbram的四核i3 cpu系统。

sort the files using any suitable sorting utility
open files A and B for reading
read wordA from A
read wordB from B
while (A not EOF and B not EOF)
{
    if (wordA < wordB)
        read wordA from A
    else if (wordA > wordB)
        read wordB from B
    else
        /* match found, store it into some other files */
        write wordA into output
        read wordA from A
}

对于上述系统配置，它成功地运行了，但是当我在一个6GB内存、120GB可用磁盘空间、6核i3处理器的系统中运行该算法时…我的系统被关闭了很多次为什么会这样？
请告诉我解决这个问题的其他方法！我们能提高IT性能吗？

最佳答案

是的，您可以使用非常短的awk1-liner显著提高性能

awk 'FNR==NR{a[$0]++;next}a[$0]' file1 file2

使用awk可以找到唯一的行，而无需首先对它们进行排序你没说你想用普通线做什么，所以我以为你想把它们打印出来。
如果您只想打印一次公用线，无论重复多少次，都可以使用以下命令：

awk 'FNR==NR{a[$0]=1;next}a[$0]-- > 0' file1 file2