问题描述
我有一个包含组合哈希列表的12Gb文件.我需要在其中找到重复项,但遇到了一些问题.
I have a 12Gb file of combined hash lists. I need to find the duplicates in it but I've been having some issues.
使用cat *.txt > _uniq_combined.txt
合并了920个(唯一)列表,从而产生了大量的哈希表.合并后,最终列表将包含重复项.
Some 920 (uniq'd) lists were merged using cat *.txt > _uniq_combined.txt
resulting in a huge list of hashes. Once merged, the final list WILL contain duplicates.
我以为我已经用awk '!seen[$0]++' _uniq_combined.txt > _AWK_duplicates.txt && say finished ya jabroni
awk '!seen[$0]++' _uniq_combined.txt > _AWK_duplicates.txt
生成的文件大小为4574766572
个字节.
awk '!seen[$0]++' _uniq_combined.txt > _AWK_duplicates.txt
results in a file with a size of 4574766572
bytes.
有人告诉我,文件太大了,请重试.
I was told that a file that large is not possible and to try again.
sort _uniq_combined.txt | uniq -c | grep -v '^ *1 ' > _SORTEDC_duplicates.txt
生成的文件大小为1624577643
个字节.明显更小.
sort _uniq_combined.txt | uniq -c | grep -v '^ *1 ' > _SORTEDC_duplicates.txt
results in a file with a size of 1624577643
bytes. Significantly smaller.
sort _uniq_combined.txt | uniq -d > _UNIQ_duplicates.txt
生成的文件大小为1416298458
个字节.
sort _uniq_combined.txt | uniq -d > _UNIQ_duplicates.txt
results in a file with a size of 1416298458
bytes.
我开始认为我不知道这些命令的作用,因为文件大小应该相同.
I'm beginning to think I don't know what these commands do since the file sizes should be the same.
同样,目标是浏览一个庞大的列表并保存多次出现的哈希实例.这些结果中的哪一个(如果有)是正确的?我以为他们都做同样的事情.
Again, the goal is to look through a giant list and save instances of hashes seen more than once. Which (if any) of these results is correct? I thought they all do the same thing.
推荐答案
sort
也是专门为应对大型文件而设计的.您可以这样做:
sort
is designed especially to cope with huge files too. You could do:
cat *.txt | sort >all_sorted
uniq all_sorted >unique_sorted
sdiff -sld all_sorted unique_sorted | uniq >all_duplicates
这篇关于如何使用sort,uniq或awk从大量列表中筛选出重复项?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!