我尝试使用Python代码比较两个CSV文件。但是我的代码并未显示所有不匹配项。它只会显示每行的第一个不匹配项。我需要特定行中的所有不匹配项。
Python代码:
import csv, itertools
column_names = ['id','name','amount']
source_data = csv.reader(open('src.csv'))
target_data = csv.reader(open('tgt.csv'))
counter = 1
def rowElementCompare(sourceRow, targetRow):
row_length = min(len(sourceRow), len(targetRow))
for i in range(row_length):
if sourceRow[i] != targetRow[i]:
print i
return i
return None
for source_row,target_row in itertools.izip(source_data,target_data):
comparison_result = None
comparison_result = rowElementCompare(source_row, target_row)
#print (comparison_result)
if comparison_result != None: #comparison_result is the column index at which the mismatch occured
print "Mismatch in column %s on row number %d , source value %s, target value %s" % (column_names[comparison_result], counter, source_row[comparison_result], target_row[comparison_result])
counter += 1
文件1:
id,name,amount
1,bob,20
3,eva,8
3,sarah,7
4,jeff,19
6,fred,10
档案2:
id,name,amount
1,bob,23
3,sarah,7
4,jeff,19
5,mira,81
6,fred,13
我的代码输出:
Mismatch in column amount on row number 2 , source value 20, target value 23
Mismatch in column name on row number 3 , source value eva, target value sarah
Mismatch in column id on row number 4 , source value 3, target value 4
Mismatch in column id on row number 5 , source value 4, target value 5
Mismatch in column amount on row number 6 , source value 10, target value 13
预期产量:
Mismatch in column amount on row number 2 , source value 20, target value 23
Mismatch in column name on row number 3 , source value eva, target value sarah
Mismatch in column id on row number 4 , source value 3, target value 4
Mismatch in column name on row number 4 , source value sarah, target value jeff
Mismatch in column age on row number 4 , source value 7, target value 19
Mismatch in column id on row number 5 , source value 4, target value 5
Mismatch in column name on row number 5 , source value jeff, target value mira
Mismatch in column age on row number 5 , source value 19, target value 81
...
最佳答案
问题是每行只调用一次rowElementCompare
。此外,重复调用它无济于事,因为它总是从行的开头开始,并在找到第一个不匹配项时停止。
解决此问题的一种方法是将rowElementCompare
的结果更改为yield
而不是返回它。这样,您可以遍历该行中的所有不匹配项。
这是更新的代码。更改的行用# UPDATED
注释。
import csv, itertools
column_names = ['id','name','amount']
source_data = csv.reader(open('foo1.csv'))
target_data = csv.reader(open('foo2.csv'))
counter = 1
def rowElementCompare(sourceRow, targetRow):
row_length = min(len(sourceRow), len(targetRow))
for i in range(row_length):
if sourceRow[i] != targetRow[i]:
print i
yield i # UPDATED
return # UPDATED
for source_row,target_row in itertools.izip(source_data,target_data):
comparison_result = None
for comparison_result in rowElementCompare(source_row, target_row): # UPDATED
print "Mismatch in column %s on row number %d , source value %s, target value %s" % (column_names[comparison_result], counter, source_row[comparison_result], target_row[comparison_result])
counter += 1
清理代码的另一个小建议:可以使用枚举避免手动更新计数器变量。
for counter,(source_row,target_row) in enumerate(itertools.izip(source_data,target_data), start=1):
关于python - 逐字段比较两个CSV文件并列出所有不匹配项,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/29167815/