问题描述
我有两个CSV文件,每个文件有10列,其中第一列称为主键".
I have two CSV files with 10 columns each where the first column is called the "Primary Key".
我需要使用Python查找两个CSV文件之间的公共区域.例如,我应该能够检测到CSV1中的第27-45行等于CSV2中的第125-145行,依此类推.
I need to use Python to find the common region between the two CSV files. For example, I should be able to detect that rows 27-45 in CSV1 is equal to rows 125-145 in CSV2 and so on.
我只比较主键(第一列).其余数据不考虑进行比较.我需要将这些公共区域提取到两个单独的CSV文件中(一个用于CSV1,一个用于CSV2).
I am only comparing the Primary Key (Column One). The rest of the data is not considered for comparison. I need to extract these common regions in two separate CSV files (one for CSV1 and one for CSV2).
我已经解析了两个CSV文件的行并将其存储在两个列表列表"(lstCAN_LOG_TABLE
和lstSHADOW_LOG_TABLE
)中,因此该问题可以减少到比较这两个列表列表中.
I have already parsed and stored the rows of the two CSV files in two 'list of lists', lstCAN_LOG_TABLE
and lstSHADOW_LOG_TABLE
, so the problem reduces down to comparing these two list of lists.
我目前假设的是,如果以后有10个匹配项(MAX_COMMON_THRESHOLD
),则说明我已经到达一个公共区域的开头.我不能记录单行(与true相比),因为会有相等的区域(按主键)和需要标识的区域.
I am currently assuming is that if there are 10 subsequent matches (MAX_COMMON_THRESHOLD
), I have reached the beginning of a common region. I must not log single rows (comparing to true) because there would be regions equal (As per primary key) and those regions I need to identify.
for index in range(len(lstCAN_LOG_TABLE)):
for l_index in range(len(lstSHADOW_LOG_TABLE)):
if(lstSHADOW_LOG_TABLE[l_index][1] == lstCAN_LOG_TABLE[index][1]): #Consider for comparison only CAN IDs
index_can_log = index #Position where CAN Log is to be compared
index_shadow_log = l_index #Position from where CAN Shadow Log is to be considered
start = index_shadow_log
if((index_shadow_log + MAX_COMMON_THRESHOLD) <= (input_file_two_row_count-1)):
end = index_shadow_log + MAX_COMMON_THRESHOLD
else:
end = (index_shadow_log) + ((input_file_two_row_count-1) - (index_shadow_log))
can_index = index
bPreScreened = 1
for num in range(start,end):
if(lstSHADOW_LOG_TABLE[num][1] == lstCAN_LOG_TABLE[can_index][1]):
if((can_index + 1) < (input_file_one_row_count-1)):
can_index = can_index + 1
else:
break
else:
bPreScreened = 0
print("No Match")
break
#we might have found start of common region
if(bPreScreened == 1):
print("Start={0} End={1} can_index={2}".format(start,end,can_index))
for number in range(start,end):
if(lstSHADOW_LOG_TABLE[number][1] == lstCAN_LOG_TABLE[index][1]):
writer_two.writerow(lstSHADOW_LOG_TABLE[number][0])
writer_one.writerow(lstCAN_LOG_TABLE[index][0])
if((index + 1) < (input_file_one_row_count-1)):
index = index + 1
else:
dump_file.close()
print("\nCommon Region in Two CSVs identifed and recorded\n")
return
dump_file.close()
print("\nCommon Region in Two CSVs identifed and recorded\n")
我得到奇怪的输出.即使第一个CSV文件也只有1880行,但是在第一个CSV的公共区域CSV中,我得到了更多的条目.我没有得到想要的输出.
I am getting strange output. Even the first CSV file has only 1880 Rows but in the common region CSV for the first CSV I am getting many more entries. I am not getting desired output.
在此处编辑
CSV1:
216 0.000238225 F4 41 C0 FB 28 0 0 0 MS CAN
109 0.0002256 15 8B 31 0 8 43 58 0 HS CAN
216 0.000238025 FB 47 C6 1 28 0 0 0 MS CAN
340 0.000240175 0A 18 0 C2 0 0 6F FF MS CAN
216 0.000240225 24 70 EF 28 28 0 0 0 MS CAN
216 0.000236225 2B 77 F7 2F 28 0 0 0 MS CAN
216 0.0002278 31 7D FD 35 28 0 0 0 MS CAN
CSV2:
216 0.0002361 0F 5C DB 14 28 0 0 0 MS CAN
216 0.000236225 16 63 E2 1B 28 0 0 0 MS CAN
109 0.0001412 16 A3 31 0 8 63 58 0 HS CAN
216 0.000234075 1C 6A E9 22 28 0 0 0 MS CAN
40A 0.000259925 C1 1 46 54 30 44 47 36 HS CAN
4A 0.000565975 2 0 0 0 0 0 0 C0 MS CAN
340 0.000240175 0A 18 0 C2 0 0 6F FF MS CAN
216 0.000240225 24 70 EF 28 28 0 0 0 MS CAN
216 0.000236225 2B 77 F7 2F 28 0 0 0 MS CAN
216 0.0002278 31 7D FD 35 28 0 0 0 MS CAN
预期输出CSV1:
340 0.000240175 0A 18 0 C2 0 0 6F FF MS CAN
216 0.000240225 24 70 EF 28 28 0 0 0 MS CAN
216 0.000236225 2B 77 F7 2F 28 0 0 0 MS CAN
216 0.0002278 31 7D FD 35 28 0 0 0 MS CAN
预期输出CSV2:
340 0.000240175 0A 18 0 C2 0 0 6F FF MS CAN
216 0.000240225 24 70 EF 28 28 0 0 0 MS CAN
216 0.000236225 2B 77 F7 2F 28 0 0 0 MS CAN
216 0.0002278 31 7D FD 35 28 0 0 0 MS CAN
观察到的输出CSV1
340 0.000240175 0A 18 0 C2 0 0 6F FF MS CAN
216 0.000240225 24 70 EF 28 28 0 0 0 MS CAN
216 0.000236225 2B 77 F7 2F 28 0 0 0 MS CAN
216 0.0002278 31 7D FD 35 28 0 0 0 MS CAN
以及数千个冗余行数据
已编辑-已按建议解决(更改为白色):
学习: 在Python FOR中无法在运行时更改循环索引
dump_file=open("MATCH_PATTERN.txt",'w+')
print("Number of Entries CAN LOG={0}".format(len(lstCAN_LOG_TABLE)))
print("Number of Entries SHADOW LOG={0}".format(len(lstSHADOW_LOG_TABLE)))
index = 0
while(index < (input_file_one_row_count - 1)):
l_index = 0
while(l_index < (input_file_two_row_count - 1)):
if(lstSHADOW_LOG_TABLE[l_index][1] == lstCAN_LOG_TABLE[index][1]): #Consider for comparison only CAN IDs
index_can_log = index #Position where CAN Log is to be compared
index_shadow_log = l_index #Position from where CAN Shadow Log is to be considered
start = index_shadow_log
can_index = index
if((index_shadow_log + MAX_COMMON_THRESHOLD) <= (input_file_two_row_count-1)):
end = index_shadow_log + MAX_COMMON_THRESHOLD
else:
end = (index_shadow_log) + ((input_file_two_row_count-1) - (index_shadow_log))
bPreScreened = 1
for num in range(start,end):
if(lstSHADOW_LOG_TABLE[num][1] == lstCAN_LOG_TABLE[can_index][1]):
if((can_index + 1) < (input_file_one_row_count-1)):
can_index = can_index + 1
else:
break
else:
bPreScreened = 0
break
#we might have found start of common region
if(bPreScreened == 1):
print("Shadow Start={0} Shadow End={1} CAN INDEX={2}".format(start,end,index))
for number in range(start,end):
if(lstSHADOW_LOG_TABLE[number][1] == lstCAN_LOG_TABLE[index][1]):
writer_two.writerow(lstSHADOW_LOG_TABLE[number][0])
writer_one.writerow(lstCAN_LOG_TABLE[index][0])
if((index + 1) < (input_file_one_row_count-1)):
index = index + 1
if((l_index + 1) < (input_file_two_row_count-1)):
l_index = l_index + 1
else:
dump_file.close()
print("\nCommon Region in Two CSVs identifed and recorded\n")
return
else:
l_index = l_index + 1
else:
l_index = l_index + 1
index = index + 1
dump_file.close()
print("\nCommon Region in Two CSVs identifed and recorded\n")
推荐答案
index
是for
循环中的迭代器.如果您在循环中更改了它,它将在每次迭代后重新分配.
index
is the iterator in your for
loop. Should you changed it inside the loop, it will be reassigned after each iteration.
说,在for
循环中的index = 5
和index += 1
被执行3次时.现在index = 8
.但是,在此迭代结束之后,当您的代码返回到for
时,索引将被分配给index x = 6
.
Say, when index = 5
in your for
loop, and index += 1
was execute 3 times. Now index = 8
. But after this iteration ends, when your code go back to for
, index will be assigned to index x = 6
.
尝试以下示例:
for index in range(0,5):
print 'iterator:', index
index = index + 2
print 'index:', index
输出将是:
iterator: 0
index: 2
iterator: 1
index: 3
iterator: 2
index: 4
iterator: 3
index: 5
iterator: 4
index: 6
要解决此问题,您可能需要将for
循环更改为while
循环
To fix this issue, you might want to change your for
loop to while
loop
如果我不理解错误,则说明您试图在两个文件中查找相同"列并将其存储.如果是这种情况,实际上可以通过使用以下代码轻松完成您的工作:
If I didn't understand wrong, you were trying to find 'same' columns in two files and store them.If this is the case, actually your work can be done easily by using following code:
import csv # import csv module to read csv files
file1 = 'csv1.csv' # input file 1
file2 = 'csv2.csv' # input file 2
outfile = 'csv3.csv' # only have one output file since two output files will be the same
read1 = csv.reader(open(file1, 'r')) # read input file 1
write = csv.writer(open(outfile, 'w')) # write to output file
# for each row in input file 1, compare it with each row in input file 2
# if they are the same, write that row into output file
for row1 in read1:
read2 = csv.reader(open(file2, 'r'))
for row2 in read2:
if row1 == row2:
write.writerow(row1)
read1.close()
write.close()
这篇关于在PYTHON中的两个CSV文件中查找公共区域的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!