我有一个文件,需要从其中删除重复的对(加粗标记)。

输入文件:

AT1G01010 = 0005634
**AT1G01010 = 0006355**
AT1G01010 = 0003677
AT1G01010 = 0007275
**AT1G01010 = 0006355
AT1G01010 = 0006355**
AT1G01010 = 0006888
**AT1G01020 = 0016125**
AT1G01020 = 0016020
**AT1G01020 = 0005739**
**AT1G01020 = 0016125**
AT1G01020 = 0003674
AT1G01020 = 0005783
**AT1G01020 = 0005739**
**AT1G01020 = 0006665
AT1G01020 = 0006665**


预期产量:

AT1G01010 = 0005634
AT1G01010 = 0006355
AT1G01010 = 0003677
AT1G01010 = 0007275
AT1G01010 = 0006888
AT1G01020 = 0016125
AT1G01020 = 0016020
AT1G01020 = 0005739
AT1G01020 = 0003674
AT1G01020 = 0005783
AT1G01020 = 0006665


因此,要删除重复项,我首先制作了字典。创建字典后,我尝试了以下编码:

import sys

ara_go_file = open (sys.argv[1]).readlines()

ara_id_list = []
ara_go_list  = []


for lines in ara_go_file:
    split_lines = lines.split(' ')
    ara_id      = split_lines[0]
    ara_id_list.append(ara_id)

    go_id_split = split_lines[-1]
    go_id       = go_id_split.split('\n')[0]
    ara_go_list.append(go_id)

ara_id_go_dic = dict (zip(ara_id_list, ara_go_list))  ##ara_id_go_dic  (this is the name of the dict I have created)

new_dict = {}  # made a new dict to copy the data into this n remove the duplicate pairs

for k in ara_id_go_dic.items():
    if k[0] in new_dict:
        if k[1] not in new_dict[k[0]]:
            new_dict[k[0]].append(k[1])
        else:
            new_dict[k[0]]=[k[1]]

print new_dict


我不知道我到底在哪里犯错。

请让我知道我的错误,否则,如果还有其他方法可以删除重复的对。

最佳答案

您可以使用set删除重复的元素:

>>> s="""AT1G01010 = 0006355
... AT1G01010 = 0003677
... AT1G01010 = 0007275
... AT1G01010 = 0006355
... AT1G01010 = 0006355
... AT1G01010 = 0006888
... AT1G01020 = 0016125
... AT1G01020 = 0016020
... AT1G01020 = 0005739
... AT1G01020 = 0016125
... AT1G01020 = 0003674
... AT1G01020 = 0005783
... AT1G01020 = 0005739
... AT1G01020 = 0006665
... AT1G01020 = 0006665"""
>>> for j in set([i for i in s.split('\n')]):
...     print j
...
AT1G01010 = 0005634
AT1G01020 = 0016020
AT1G01010 = 0007275
AT1G01010 = 0006355
AT1G01020 = 0006665
AT1G01010 = 0003677
AT1G01020 = 0005783
AT1G01020 = 0016125
AT1G01020 = 0005739
AT1G01020 = 0003674
AT1G01010 = 0006888

09-20 01:09