问题描述
我有一个包含多个条目的CSV文件.范例csv:
I have a CSV file with multiple entries. Example csv:
user, phone, email
joe, 123, [email protected]
mary, 456, [email protected]
ed, 123, [email protected]
我正在尝试通过CSV中的特定列删除重复项,但是使用下面的代码,我得到列表索引超出范围".我认为通过将row[1]
与newrows[1]
进行比较,我会找到所有重复项,并且只重写file2.csv
中的唯一条目.但这不起作用,我也不明白为什么.
I'm trying to remove the duplicates by a specific column in the CSV however with the code below I'm getting an "list index out of range". I thought by comparing row[1]
with newrows[1]
I would find all duplicates and only rewrite the unique entries in file2.csv
. This doesn't work though and I can't understand why.
f1 = csv.reader(open('file1.csv', 'rb'))
newrows = []
for row in f1:
if row[1] not in newrows[1]:
newrows.append(row)
writer = csv.writer(open("file2.csv", "wb"))
writer.writerows(newrows)
我的最终结果是拥有一个维护文件顺序的列表(set
将无法正常工作...对吗?),其外观应如下所示:
My end result is to have a list that maintains the sequence of the file (set
won't work...right?) which should look like this:
user, phone, email
joe, 123, [email protected]
mary, 456, [email protected]
推荐答案
row[1]
引用当前行(电话)的第二列.一切都很好.
row[1]
refers to the second column in the current row (phone). That's all well in good.
但是,您newrows.append(row)
将整行添加到列表中.
However, you newrows.append(row)
add the entire row to the list.
当您检查row[1] in newrows
时,您正在对照完整行列表检查单个电话号码.但这不是您想要的.您只需要检查一个列表或一组电话号码.为此,您可能想要跟踪行和一组观察到的电话号码.
When you check row[1] in newrows
you are checking the individual phone number against a list of complete rows. But that's not what you want to do. You need to check against a list or set of just phone numbers. For that, you probably want to keep track of the rows and a set of the observed phone numbers.
类似的东西:
f1 = csv.reader(open('file1.csv', 'rb'))
writer = csv.writer(open("file2.csv", "wb"))
phone_numbers = set()
for row in f1:
if row[1] not in phone_numbers:
writer.writerow(row)
phone_numbers.add( row[1] )
这篇关于Python:删除重复的CSV条目的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!