^Schools*=s*(?P<school_name>.+)(?P<school_content>[sS]+?)(?=^School|)成绩部分(regex101.com 上的另一个演示)^Grades*=s*(?P<grade>.+)(?P<students>[sS]+?)(?=^Grade|)学生/分数部分(regex101.com 上的最后一个演示):^Student number, Name[](?P<student_names>(?:^d+.+[])+)s*^Student number, Score[](?P<student_scores>(?:^d+.+[])+)剩下的就是一个生成器表达式,然后被送入 DataFrame 构造函数(连同列名).代码:The rest is a generator expression which is then fed into the DataFrame constructor (along with the column names).The code:import pandas as pd, rerx_school = re.compile(r''' ^ Schools*=s*(?P<school_name>.+) (?P<school_content>[sS]+?) (?=^School|)''', re.MULTILINE | re.VERBOSE)rx_grade = re.compile(r''' ^ Grades*=s*(?P<grade>.+) (?P<students>[sS]+?) (?=^Grade|)''', re.MULTILINE | re.VERBOSE)rx_student_score = re.compile(r''' ^ Student number, Name[] (?P<student_names>(?:^d+.+[])+) s* ^ Student number, Score[] (?P<student_scores>(?:^d+.+[])+)''', re.MULTILINE | re.VERBOSE)result = ((school.group('school_name'), grade.group('grade'), student_number, name, score) for school in rx_school.finditer(string) for grade in rx_grade.finditer(school.group('school_content')) for student_score in rx_student_score.finditer(grade.group('students')) for student in zip(student_score.group('student_names')[:-1].split(""), student_score.group('student_scores')[:-1].split("")) for student_number in [student[0].split(", ")[0]] for name in [student[0].split(", ")[1]] for score in [student[1].split(", ")[1]])df = pd.DataFrame(result, columns = ['School', 'Grade', 'Student number', 'Name', 'Score'])print(df)浓缩:rx_school = re.compile(r'^Schools*=s*(?P<school_name>.+)(?P<school_content>[sS]+?)(?=^School|)', re.MULTILINE)rx_grade = re.compile(r'^Grades*=s*(?P<grade>.+)(?P<students>[sS]+?)(?=^Grade|)', re.MULTILINE)rx_student_score = re.compile(r'^Student number, Name[](?P<student_names>(?:^d+.+[])+)s*^Student number, Score[](?P<student_scores>(?:^d+.+[])+)', re.MULTILINE)这产生 School Grade Student number Name Score0 Riverdale High 1 0 Phoebe 31 Riverdale High 1 1 Rachel 72 Riverdale High 2 0 Angela 63 Riverdale High 2 1 Tristan 34 Riverdale High 2 2 Aurora 95 Hogwarts 1 0 Ginny 86 Hogwarts 1 1 Luna 77 Hogwarts 2 0 Harry 58 Hogwarts 2 1 Hermione 109 Hogwarts 3 0 Fred 010 Hogwarts 3 1 George 0至于时序,这是运行一万次的结果:As for timing, this is the result running it a ten thousand times:import timeitprint(timeit.timeit(makedf, number=10**4))# 11.918397722000009 s 这篇关于如何使用 Python 解析复杂的文本文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!
10-24 16:46