这是代码:

import csv
import re

with open('alcohol_rehab_ltp.csv', 'rb') as csv_f, \
    open('cities2.txt', 'rb') as cities, \
    open('drug_rehab_city_state.csv', 'wb') as out_csv:
    writer = csv.writer(out_csv, delimiter = ",")
    reader = csv.reader(csv_f)
    city_lst = cities.readlines()

    for row in reader:
        for city in city_lst:
            city = city.strip()
            match = re.search((r'\b{0}\b').format(city), row[0])
            if match:
                writer.writerow(row)
                break


“ alcohol_rehab_ltp.csv”具有145行,“ cities2.txt”具有18,895行(转换为列表时变为18,895)。该过程需要一段时间才能运行,但我尚未计时,但可能需要5分钟左右。我在这里忽略了一些简单(或更复杂)的事情,可以使此脚本运行得更快。我将使用其他.csv文件对“ cities.txt”的大文本文件运行,这些csv文件可能有多达1000行的位置。任何关于如何加快速度的想法将不胜感激!
这是csv文件:关键字(144),平均。每次点击费用,本地搜索,广告客户竞争

[alcohol rehab san diego],$49.54,90,High
[alcohol rehab dallas],$86.48,110,High
[alcohol rehab atlanta],$60.93,50,High
[free alcohol rehab centers],$11.88,110,High
[christian alcohol rehab centers],–,70,High
[alcohol rehab las vegas],$33.40,70,High
[alcohol rehab cost],$57.37,110,High


文本文件中的一些行:

san diego
dallas
atlanta
dallas
los angeles
denver

最佳答案

使用所有城市名称构建一个正则表达式:

city_re = re.compile(r'\b('+ '|'.join(c.strip() for c in cities.readlines()) + r')\b')


然后执行:

for row in reader:
    match = city_re.search(row[0])
    if match:
        writer.writerow(row)


这将使正则表达式引擎在这145个城市名称上的字符串前缀匹配方面发挥出最大作用,从而将循环迭代次数从18895 x 145减少到仅18895。

为了您的方便和测试,以下是完整列表:

import csv
import re

with open('alcohol_rehab_ltp.csv', 'rb') as csv_f, \
    open('cities2.txt', 'rb') as cities, \
    open('drug_rehab_city_state.csv', 'wb') as out_csv:
    writer = csv.writer(out_csv, delimiter = ",")
    reader = csv.reader(csv_f)

    city_re = re.compile(r'\b('+ '|'.join(c.strip() for c in cities.readlines()) + r')\b')

    for row in reader:
        match = city_re.search(row[0])
        if match:
            writer.writerow(row)

关于python - 迭代大列表时更快的Double for循环方法(18,895个元素),我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/28737987/

10-09 06:16
查看更多