问题描述
如何在列中查找重复项?
How do I find duplicates in a column?
$ head countries_lat_long_int_code3.csv | cat -n
1 country,latitude,longitude,name,code
2 AD,42.546245,1.601554,Andorra,376
3 AE,23.424076,53.847818,United Arab Emirates,971
4 AF,33.93911,67.709953,Afghanistan,93
5 AG,17.060816,-61.796428,Antigua and Barbuda,1
6 AI,18.220554,-63.068615,Anguilla,1
7 AL,41.153332,20.168331,Albania,355
8 AM,40.069099,45.038189,Armenia,374
9 AN,12.226079,-69.060087,Netherlands Antilles,599
10 AO,-11.202692,17.873887,Angola,244
例如,在第5列中有重复项.
For instance this has duplicates in the 5th column.
5 AG,17.060816,-61.796428,Antigua and Barbuda,1
6 AI,18.220554,-63.068615,Anguilla,1
如何查看此文件中的所有其他文件?
How do I view all the others in this file?
我知道我可以做到:
awk -F, 'NR>1{print $5}' countries_lat_long_int_code3.csv | sort
我可以注视一下是否有重复项,但是有更好的方法吗?
And I can eyeball and see if there is any duplicates, but is there a better way?
或者我可以这样做:找出可能完全存在的地方
Or I can do this:Find out how may are there completely
$ awk -F, 'NR>1{print $5}' countries_lat_long_int_code3.csv | sort | wc -l
210
找出有多少个唯一值
$ awk -F, 'NR>1{print $5}' countries_lat_long_int_code3.csv | sort | uniq | wc -l
183
因此最多可以重复27(210-183)次.
Therefore there are at most 27 (210-183) duplicates.
EDIT1
我想要的输出将如下所示,基本上是所有列,但仅显示重复的行:
My desired output would be something as follows, basically all the columns but just showing the rows that are duplicates:
5 AG,17.060816,-61.796428,Antigua and Barbuda,1
6 AI,18.220554,-63.068615,Anguilla,1
推荐答案
这将为您提供重复的代码
This will give you the duplicated codes
awk -F, 'a[$5]++{print $5}'
如果您只对重复代码的计数感兴趣
if you're only interested in count of duplicate codes
awk -F, 'a[$5]++{count++} END{print count}'
要打印重复的行,请尝试
To print duplicated rows try this
awk -F, '$5 in a{print a[$5]; print} {a[$5]=$0}'
这篇关于awk +如何在列中查找重复项?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!