在PYTHON中的两个CSV文件中查找公共区域

本文介绍了在PYTHON中的两个CSV文件中查找公共区域的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有两个CSV文件，每个文件有10列，其中第一列称为主键".

I have two CSV files with 10 columns each where the first column is called the "Primary Key".

我需要使用Python查找两个CSV文件之间的公共区域.例如，我应该能够检测到CSV1中的第27-45行等于CSV2中的第125-145行，依此类推.

I need to use Python to find the common region between the two CSV files. For example, I should be able to detect that rows 27-45 in CSV1 is equal to rows 125-145 in CSV2 and so on.

我只比较主键(第一列).其余数据不考虑进行比较.我需要将这些公共区域提取到两个单独的CSV文件中(一个用于CSV1，一个用于CSV2).

I am only comparing the Primary Key (Column One). The rest of the data is not considered for comparison. I need to extract these common regions in two separate CSV files (one for CSV1 and one for CSV2).

我已经解析了两个CSV文件的行并将其存储在两个列表列表"(lstCAN_LOG_TABLE和lstSHADOW_LOG_TABLE)中，因此该问题可以减少到比较这两个列表列表中.

I have already parsed and stored the rows of the two CSV files in two 'list of lists', lstCAN_LOG_TABLE and lstSHADOW_LOG_TABLE, so the problem reduces down to comparing these two list of lists.

我目前假设的是，如果以后有10个匹配项(MAX_COMMON_THRESHOLD)，则说明我已经到达一个公共区域的开头.我不能记录单行(与true相比)，因为会有相等的区域(按主键)和需要标识的区域.

I am currently assuming is that if there are 10 subsequent matches (MAX_COMMON_THRESHOLD), I have reached the beginning of a common region. I must not log single rows (comparing to true) because there would be regions equal (As per primary key) and those regions I need to identify.

for index in range(len(lstCAN_LOG_TABLE)):
    for l_index in range(len(lstSHADOW_LOG_TABLE)):
        if(lstSHADOW_LOG_TABLE[l_index][1] == lstCAN_LOG_TABLE[index][1]):  #Consider for comparison only CAN IDs
            index_can_log = index                                           #Position where CAN Log is to be compared
            index_shadow_log = l_index                                      #Position from where CAN Shadow Log is to be considered
            start = index_shadow_log
            if((index_shadow_log + MAX_COMMON_THRESHOLD) <= (input_file_two_row_count-1)):
                end = index_shadow_log + MAX_COMMON_THRESHOLD
            else:
                end = (index_shadow_log) + ((input_file_two_row_count-1) - (index_shadow_log))
            can_index = index
            bPreScreened = 1
            for num in range(start,end):
                if(lstSHADOW_LOG_TABLE[num][1] == lstCAN_LOG_TABLE[can_index][1]):
                    if((can_index + 1) < (input_file_one_row_count-1)):
                        can_index = can_index + 1
                    else:
                        break
                else:
                    bPreScreened = 0
                    print("No Match")
                    break
            #we might have found start of common region
            if(bPreScreened == 1):
                print("Start={0} End={1} can_index={2}".format(start,end,can_index))
                for number in range(start,end):
                    if(lstSHADOW_LOG_TABLE[number][1] == lstCAN_LOG_TABLE[index][1]):
                        writer_two.writerow(lstSHADOW_LOG_TABLE[number][0])
                        writer_one.writerow(lstCAN_LOG_TABLE[index][0])
                        if((index + 1) < (input_file_one_row_count-1)):
                            index = index + 1
                        else:
                            dump_file.close()
                            print("\nCommon Region in Two CSVs identifed and recorded\n")
                            return
dump_file.close()
print("\nCommon Region in Two CSVs identifed and recorded\n")

我得到奇怪的输出.即使第一个CSV文件也只有1880行，但是在第一个CSV的公共区域CSV中，我得到了更多的条目.我没有得到想要的输出.

I am getting strange output. Even the first CSV file has only 1880 Rows but in the common region CSV for the first CSV I am getting many more entries. I am not getting desired output.

在此处编辑

CSV1:

216 0.000238225 F4  41  C0  FB  28  0   0   0   MS CAN
109 0.0002256   15  8B  31  0   8   43  58  0   HS CAN
216 0.000238025 FB  47  C6  1   28  0   0   0   MS CAN
340 0.000240175 0A  18  0   C2  0   0   6F  FF  MS CAN
216 0.000240225 24  70  EF  28  28  0   0   0   MS CAN
216 0.000236225 2B  77  F7  2F  28  0   0   0   MS CAN
216 0.0002278   31  7D  FD  35  28  0   0   0   MS CAN

CSV2:

216 0.0002361   0F  5C  DB  14  28  0   0   0   MS CAN
216 0.000236225 16  63  E2  1B  28  0   0   0   MS CAN
109 0.0001412   16  A3  31  0   8   63  58  0   HS CAN
216 0.000234075 1C  6A  E9  22  28  0   0   0   MS CAN
40A 0.000259925 C1  1   46  54  30  44  47  36  HS CAN
4A  0.000565975 2   0   0   0   0   0   0   C0  MS CAN
340 0.000240175 0A  18  0   C2  0   0   6F  FF  MS CAN
216 0.000240225 24  70  EF  28  28  0   0   0   MS CAN
216 0.000236225 2B  77  F7  2F  28  0   0   0   MS CAN
216 0.0002278   31  7D  FD  35  28  0   0   0   MS CAN

预期输出CSV1:

340 0.000240175 0A  18  0   C2  0   0   6F  FF  MS CAN
216 0.000240225 24  70  EF  28  28  0   0   0   MS CAN
216 0.000236225 2B  77  F7  2F  28  0   0   0   MS CAN
216 0.0002278   31  7D  FD  35  28  0   0   0   MS CAN

预期输出CSV2:

340 0.000240175 0A  18  0   C2  0   0   6F  FF  MS CAN
216 0.000240225 24  70  EF  28  28  0   0   0   MS CAN
216 0.000236225 2B  77  F7  2F  28  0   0   0   MS CAN
216 0.0002278   31  7D  FD  35  28  0   0   0   MS CAN

观察到的输出CSV1

340 0.000240175 0A  18  0   C2  0   0   6F  FF  MS CAN
216 0.000240225 24  70  EF  28  28  0   0   0   MS CAN
216 0.000236225 2B  77  F7  2F  28  0   0   0   MS CAN
216 0.0002278   31  7D  FD  35  28  0   0   0   MS CAN

以及数千个冗余行数据

已编辑-已按建议解决(更改为白色):

学习: 在Python FOR中无法在运行时更改循环索引

dump_file=open("MATCH_PATTERN.txt",'w+')
print("Number of Entries CAN LOG={0}".format(len(lstCAN_LOG_TABLE)))
print("Number of Entries SHADOW LOG={0}".format(len(lstSHADOW_LOG_TABLE)))
index = 0
while(index < (input_file_one_row_count - 1)):
    l_index = 0
    while(l_index < (input_file_two_row_count - 1)):
        if(lstSHADOW_LOG_TABLE[l_index][1] == lstCAN_LOG_TABLE[index][1]):  #Consider for comparison only CAN IDs
            index_can_log = index                                           #Position where CAN Log is to be compared
            index_shadow_log = l_index                                      #Position from where CAN Shadow Log is to be considered
            start = index_shadow_log
            can_index = index
            if((index_shadow_log + MAX_COMMON_THRESHOLD) <= (input_file_two_row_count-1)):
                end = index_shadow_log + MAX_COMMON_THRESHOLD
            else:
                end = (index_shadow_log) + ((input_file_two_row_count-1) - (index_shadow_log))
            bPreScreened = 1
            for num in range(start,end):
                if(lstSHADOW_LOG_TABLE[num][1] == lstCAN_LOG_TABLE[can_index][1]):
                    if((can_index + 1) < (input_file_one_row_count-1)):
                        can_index = can_index + 1
                    else:
                        break
                else:
                    bPreScreened = 0
                    break
            #we might have found start of common region
            if(bPreScreened == 1):
                print("Shadow Start={0} Shadow End={1} CAN INDEX={2}".format(start,end,index))
                for number in range(start,end):
                    if(lstSHADOW_LOG_TABLE[number][1] == lstCAN_LOG_TABLE[index][1]):
                        writer_two.writerow(lstSHADOW_LOG_TABLE[number][0])
                        writer_one.writerow(lstCAN_LOG_TABLE[index][0])
                        if((index + 1) < (input_file_one_row_count-1)):
                            index = index + 1
                        if((l_index + 1) < (input_file_two_row_count-1)):
                            l_index = l_index + 1
                        else:
                            dump_file.close()
                            print("\nCommon Region in Two CSVs identifed and recorded\n")
                            return
            else:
                l_index = l_index + 1
        else:
            l_index = l_index + 1
    index = index + 1
dump_file.close()
print("\nCommon Region in Two CSVs identifed and recorded\n")

so

在PYTHON中的两个CSV文件中查找公共区域

问题描述

推荐答案