我有一个类似于以下内容的输入文件:

RefID|FirstName|MiddleName|LastName|SSN|DOB|School Year|Age|District LEA|District Description|School LEA|Location Description|title|frng_amt
1|JULIE|A|ADAMS|123456789|654321|20142015|47|0101000|DEWITTSCHOOLDISTRICT|P|014
2|JULIE|A|ADAMS|123456789|654321|20132014|46|0101000|DEWITTSCHOOLDISTRICT|S|13100
3|JULIE|A|ADAMS|123456789|654321|20122013|45|0101000|DEWITTSCHOOLDISTRICT|P|014
4|JULIE|A|ADAMS|123456789|654321|20132014|46|0101000|DEWITTSCHOOLDISTRICT|P|014
5|JULIE|A|ADAMS|123456789|654321|20142015|47|0101000|DEWITTSCHOOLDISTRICT|S|15000
6|JULIE|A|ADAMS|123456789|654321|20122013|45|0101000|DEWITTSCHOOLDISTRICT|S|13100
7|SHIRLEY||ADAMS|987654321|987890|20122013|49|0101000|DEWITTSCHOOLDISTRICT|S|13100
8|SHIRLEY||ADAMS|987654321|987890|20092010|46|0101000|DEWITTSCHOOLDISTRICT|P|014
9|SHIRLEY||ADAMS|987654321|987890|20102011|47|0101000|DEWITTSCHOOLDISTRICT|P|014
10|SHIRLEY||ADAMS|987654321|987890|20132014|50|0101000|DEWITTSCHOOLDISTRICT|S|13100
11|SHIRLEY||ADAMS|987654321|987890|20132014|50|0101000|DEWITTSCHOOLDISTRICT|P|014
12|SHIRLEY||ADAMS|987654321|987890|20122013|49|0101000|DEWITTSCHOOLDISTRICT|P|014
13|SHIRLEY||ADAMS|987654321|987890|20102011|47|0101000|DEWITTSCHOOLDISTRICT|A|13100
14|SHIRLEY||ADAMS|987654321|987890|20142015|51|0101000|DEWITTSCHOOLDISTRICT|S|15000
15|SHIRLEY||ADAMS|987654321|987890|20092010|46|0101000|DEWITTSCHOOLDISTRICT|A|13100
16|SHIRLEY||ADAMS|987654321|987890|20142015|51|0101000|DEWITTSCHOOLDISTRICT|P|014


我想执行数据匹配,并在输出中,我想根据他们的SSN为Julie分配一个唯一的ID,为Shirley分配一个另一个ID。所以我的想法输出将是:

ID|RefID|FirstName|MiddleName|LastName|SSN|DOB|School Year|Age|District LEA|District Description|School LEA|Location Description|title|frng_amt
10001|1|JULIE|A|ADAMS|123456789|654321|20142015|47|0101000|DEWITTSCHOOLDISTRICT|P|014
10001|2|JULIE|A|ADAMS|123456789|654321|20132014|46|0101000|DEWITTSCHOOLDISTRICT|S|13100
10001|3|JULIE|A|ADAMS|123456789|654321|20122013|45|0101000|DEWITTSCHOOLDISTRICT|P|014
10001|4|JULIE|A|ADAMS|123456789|654321|20132014|46|0101000|DEWITTSCHOOLDISTRICT|P|014
10001|5|JULIE|A|ADAMS|123456789|654321|20142015|47|0101000|DEWITTSCHOOLDISTRICT|S|15000
10001|6|JULIE|A|ADAMS|123456789|654321|20122013|45|0101000|DEWITTSCHOOLDISTRICT|S|13100
10002|7|SHIRLEY||ADAMS|987654321|987890|20122013|49|0101000|DEWITTSCHOOLDISTRICT|S|13100
10002|8|SHIRLEY||ADAMS|987654321|987890|20092010|46|0101000|DEWITTSCHOOLDISTRICT|P|014
10002|9|SHIRLEY||ADAMS|987654321|987890|20102011|47|0101000|DEWITTSCHOOLDISTRICT|P|014
10002|10|SHIRLEY||ADAMS|987654321|987890|20132014|50|0101000|DEWITTSCHOOLDISTRICT|S|13100
10002|11|SHIRLEY||ADAMS|987654321|987890|20132014|50|0101000|DEWITTSCHOOLDISTRICT|P|014
10002|12|SHIRLEY||ADAMS|987654321|987890|20122013|49|0101000|DEWITTSCHOOLDISTRICT|P|014
10002|13|SHIRLEY||ADAMS|987654321|987890|20102011|47|0101000|DEWITTSCHOOLDISTRICT|A|13100
10002|14|SHIRLEY||ADAMS|987654321|987890|20142015|51|0101000|DEWITTSCHOOLDISTRICT|S|15000
10002|15|SHIRLEY||ADAMS|987654321|987890|20092010|46|0101000|DEWITTSCHOOLDISTRICT|A|13100
10002|16|SHIRLEY||ADAMS|987654321|987890|20142015|51|0101000|DEWITTSCHOOLDISTRICT|P|014


如何在Python中完成此操作?我正在尝试通过创建输入文件的副本进行if循环,但是我觉得这是实现这一目标的效率非常低下且错误的方法。有人可以帮我找出办法吗?

我现在的代码:

inputReader = open(inputFile,'r')
inputReaderCopy = open(inputFile, 'r')
outputWriter = open(outputFile, 'w')
counter = 100000
headers = inputReader.readline()
for x in inputReader:
    for y in inputReaderCopy:
        if x.split("|")[4] == y.split("|")[4]:
            outputWriter.write(str(counter) + "|" +y)
            counter+=1
        else:
            outputWriter.write("no match|"+ y)

最佳答案

只需使用将唯一ID映射到每个SSN的字典来保存所看到的SSN记录,您只需要对这些行进行一次传递,并使用csv module来解析文件即可为您进行拆分。如果您想要一个全新的文件:

import csv

cn = 10001

with open("test.txt") as f, open("out.txt","w") as tmp:
    r, wr = csv.reader(f, delimiter="|"), csv.writer(tmp, delimiter="|")
    head, d = next(r), {}
    wr.writerow(["ID"]+head)
    for row in r:
        v = row[4]
        # if we have already seen the SSN, use the id assigned
        if v in d:
            wr.writerow([d[v]] + row)
        else:
            # else create new id, add pairing to dict and write
            d[v] = cn
            wr.writerow([cn] + row)
            cn += 1


输出:

ID|RefID|FirstName|MiddleName|LastName|SSN|DOB|School Year|Age|District LEA|District Description|School LEA|Location Description|title|frng_amt
10001|1|JULIE|A|ADAMS|123456789|654321|20142015|47|0101000|DEWITTSCHOOLDISTRICT|P|014
10001|2|JULIE|A|ADAMS|123456789|654321|20132014|46|0101000|DEWITTSCHOOLDISTRICT|S|13100
10001|3|JULIE|A|ADAMS|123456789|654321|20122013|45|0101000|DEWITTSCHOOLDISTRICT|P|014
10001|4|JULIE|A|ADAMS|123456789|654321|20132014|46|0101000|DEWITTSCHOOLDISTRICT|P|014
10001|5|JULIE|A|ADAMS|123456789|654321|20142015|47|0101000|DEWITTSCHOOLDISTRICT|S|15000
10001|6|JULIE|A|ADAMS|123456789|654321|20122013|45|0101000|DEWITTSCHOOLDISTRICT|S|13100
10002|7|SHIRLEY||ADAMS|987654321|987890|20122013|49|0101000|DEWITTSCHOOLDISTRICT|S|13100
10002|8|SHIRLEY||ADAMS|987654321|987890|20092010|46|0101000|DEWITTSCHOOLDISTRICT|P|014
10002|9|SHIRLEY||ADAMS|987654321|987890|20102011|47|0101000|DEWITTSCHOOLDISTRICT|P|014
10002|10|SHIRLEY||ADAMS|987654321|987890|20132014|50|0101000|DEWITTSCHOOLDISTRICT|S|13100
10002|11|SHIRLEY||ADAMS|987654321|987890|20132014|50|0101000|DEWITTSCHOOLDISTRICT|P|014
10002|12|SHIRLEY||ADAMS|987654321|987890|20122013|49|0101000|DEWITTSCHOOLDISTRICT|P|014
10002|13|SHIRLEY||ADAMS|987654321|987890|20102011|47|0101000|DEWITTSCHOOLDISTRICT|A|13100
10002|14|SHIRLEY||ADAMS|987654321|987890|20142015|51|0101000|DEWITTSCHOOLDISTRICT|S|15000
10002|15|SHIRLEY||ADAMS|987654321|987890|20092010|46|0101000|DEWITTSCHOOLDISTRICT|A|13100
10002|16|SHIRLEY||ADAMS|987654321|987890|20142015|51|0101000|DEWITTSCHOOLDISTRICT|P|014


如果要更新原始文件,可以写入临时文件并执行shutil.move

import csv
from shutil import move
from tempfile import NamedTemporaryFile
import os

cn = 100001
try:
    with open("test.txt") as f, NamedTemporaryFile("w", dir=".", delete=False) as tmp:
        r, wr = csv.reader(f, delimiter="|"), csv.writer(tmp, delimiter="|")
        head, d = next(r), {}
        wr.writerow(["ID"] + head)
        for row in r:
            v = row[4]
            if v in d:
                wr.writerow([d[v]] + row)
            else:
                d[v] = cn
                wr.writerow([cn] + row)
                cn += 1
    # replace original file
    move(tmp.name, "test.txt"))
finally:
    if os.path.isfile(tmp.name):
        os.unlink(tmp.name)


如果实际上像输入一样对数据进行排序,则可以groupby

import csv
from itertools import groupby
from operator import itemgetter

cn = 10001
with open("test.txt") as f, open("out.txt", "w") as tmp:
    r, wr = csv.reader(f, delimiter="|"), csv.writer(tmp, delimiter="|")
    head, d = next(r), {}
    wr.writerow(["ID"] + head)
    for k, v in groupby(r, key=itemgetter(4)):
        wr.writerows([cn]+sub for sub in v)
        cn += 1

关于python - 将属性彼此匹配,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/34189294/

10-11 04:13
查看更多