如何使用sort，uniq或awk从大量列表中筛选出重复项?

本文介绍了如何使用sort，uniq或awk从大量列表中筛选出重复项?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个包含组合哈希列表的12Gb文件.我需要在其中找到重复项，但遇到了一些问题.

I have a 12Gb file of combined hash lists. I need to find the duplicates in it but I've been having some issues.

使用cat *.txt > _uniq_combined.txt合并了920个(唯一)列表，从而产生了大量的哈希表.合并后，最终列表将包含重复项.

Some 920 (uniq'd) lists were merged using cat *.txt > _uniq_combined.txt resulting in a huge list of hashes. Once merged, the final list WILL contain duplicates.

我以为我已经用awk '!seen[$0]++' _uniq_combined.txt > _AWK_duplicates.txt && say finished ya jabroni

awk '!seen[$0]++' _uniq_combined.txt > _AWK_duplicates.txt生成的文件大小为4574766572个字节.

awk '!seen[$0]++' _uniq_combined.txt > _AWK_duplicates.txt results in a file with a size of 4574766572 bytes.

有人告诉我，文件太大了，请重试.

I was told that a file that large is not possible and to try again.

sort _uniq_combined.txt | uniq -c | grep -v '^ *1 ' > _SORTEDC_duplicates.txt生成的文件大小为1624577643个字节.明显更小.

sort _uniq_combined.txt | uniq -c | grep -v '^ *1 ' > _SORTEDC_duplicates.txt results in a file with a size of 1624577643 bytes. Significantly smaller.

sort _uniq_combined.txt | uniq -d > _UNIQ_duplicates.txt生成的文件大小为1416298458个字节.

sort _uniq_combined.txt | uniq -d > _UNIQ_duplicates.txt results in a file with a size of 1416298458 bytes.

我开始认为我不知道这些命令的作用，因为文件大小应该相同.

I'm beginning to think I don't know what these commands do since the file sizes should be the same.

同样，目标是浏览一个庞大的列表并保存多次出现的哈希实例.这些结果中的哪一个(如果有)是正确的?我以为他们都做同样的事情.

Again, the goal is to look through a giant list and save instances of hashes seen more than once. Which (if any) of these results is correct? I thought they all do the same thing.

txt

如何使用sort，uniq或awk从大量列表中筛选出重复项?

问题描述

推荐答案