本文介绍了如何使用sort,uniq或awk从大量列表中筛选出重复项?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含组合哈希列表的12Gb文件.我需要在其中找到重复项,但遇到了一些问题.

I have a 12Gb file of combined hash lists. I need to find the duplicates in it but I've been having some issues.

使用cat *.txt > _uniq_combined.txt合并了920个(唯一)列表,从而产生了大量的哈希表.合并后,最终列表将包含重复项.

Some 920 (uniq'd) lists were merged using cat *.txt > _uniq_combined.txt resulting in a huge list of hashes. Once merged, the final list WILL contain duplicates.

我以为我已经用awk '!seen[$0]++' _uniq_combined.txt > _AWK_duplicates.txt && say finished ya jabroni

awk '!seen[$0]++' _uniq_combined.txt > _AWK_duplicates.txt生成的文件大小为4574766572个字节.

awk '!seen[$0]++' _uniq_combined.txt > _AWK_duplicates.txt results in a file with a size of 4574766572 bytes.

有人告诉我,文件太大了,请重试.

I was told that a file that large is not possible and to try again.

sort _uniq_combined.txt | uniq -c | grep -v '^ *1 ' > _SORTEDC_duplicates.txt生成的文件大小为1624577643个字节.明显更小.

sort _uniq_combined.txt | uniq -c | grep -v '^ *1 ' > _SORTEDC_duplicates.txt results in a file with a size of 1624577643 bytes. Significantly smaller.

sort _uniq_combined.txt | uniq -d > _UNIQ_duplicates.txt生成的文件大小为1416298458个字节.

sort _uniq_combined.txt | uniq -d > _UNIQ_duplicates.txt results in a file with a size of 1416298458 bytes.

我开始认为我不知道这些命令的作用,因为文件大小应该相同.

I'm beginning to think I don't know what these commands do since the file sizes should be the same.

同样,目标是浏览一个庞大的列表并保存多次出现的哈希实例.这些结果中的哪一个(如果有)是正确的?我以为他们都做同样的事情.

Again, the goal is to look through a giant list and save instances of hashes seen more than once. Which (if any) of these results is correct? I thought they all do the same thing.

推荐答案

sort也是专门为应对大型文件而设计的.您可以这样做:

sort is designed especially to cope with huge files too. You could do:

cat *.txt | sort >all_sorted 
uniq all_sorted >unique_sorted
sdiff -sld all_sorted unique_sorted | uniq >all_duplicates

这篇关于如何使用sort,uniq或awk从大量列表中筛选出重复项?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-30 04:32