


这是我第二天在Python中工作.我用C ++做了一段时间,但决定尝试使用Python.我的程序按预期工作.但是,当我一次处理一个文件而没有glob循环时,每个文件大约需要半小时.当我包含glob时,循环大约需要12个小时来处理8个文件.

This is my second day working in Python .I worked on this in C++ for a while, but decided to try Python. My program works as expected. However, when I process one file at a time without the glob loop, it takes about a half hour per file. When I include the glob, the loop takes about 12 hours to process 8 files.


My question is this, is there anything in my program that is definitely slowing it down? is there anything I should be doing to make it faster?


I have a folder of large files. For example


file1.txt (6gb)file2.txt (5.5gb)file3.txt (6gb)

如果有帮助,每一行数据都以一个字符开头,该字符告诉我其余字符的格式,这就是为什么我拥有所有if elif语句的原因.一行数据如下所示:T35201M352RZNGA AC

If it helps, each line of data begins with a character that tells me how the rest of the characters are formatted, which is why I have all of the if elif statements. A line of data would look like this:T35201M352RZNGA AC


I am trying to read each file, do some parsing using splits, and then save the file.


The computer has 32gb of ram, so my method is to read each file into ram, and then loop through the file, and then save, clearing ram for the next file.

我已包含该文件,因此您可以看到我正在使用的方法.我使用if elif语句,该语句使用大约10个不同的elif命令.我已经尝试过字典,但是我想不起来要挽救我的性命.

I've included the file so you can see the methods that I am using. I use an if elif statement that uses about 10 different elif commands. I have tried a dictionary, but I couldn't figure that out to save my life.


Any answers would be helpful.

import csv
import glob

for filename in glob.glob("/media/3tb/5may/*.txt"):
    f = open(filename,'r')
    c = csv.writer(open(filename + '.csv','wb'))

    for line in f.readlines():
       #print line
        variable = line[0:1]

        if variable is 'T':
           second = line[1:6]
           second = second

        if variable is 'R':
           ticker = line[1:7]
           marketCategory = line[7:8]
        elif variable is ...
        elif variable is ...
        elif ...
        elif ...
        elif ...
        elif ...

        if variable (!= 'T') and (!= 'M')
            c.writerow([second,mill,event ....])


UPDATEEach of the elif statements are nearly identical. The only parts that change are the ways that I split the lines. Here are two elif statements (There are 13 total, and they are almost all identical except for the way that they are split.)

  elif variable is 'C':
     order = line[1:10]
     Shares = line[10:16]
     match = line[16:25]
     printable = line[25:26]
     price = line[26:36]
   elif variable is 'P':
     ticker = line[17:23]
     order = line[1:10]
     buy = line[10:11]
     shares = line[11:17]
     price = line[23:33]
     match = line[33:42]

UPDATE2 我已经使用for file in f两次运行了代码.我第一次运行单个文件而没有 for filename in glob.glob("/media/3tb/file.txt"):时,花了大约30分钟的时间来手动编码一个文件的文件路径.

UPDATE2I have ran the code using for file in f two different times. The first time I ran a single file without for filename in glob.glob("/media/3tb/file.txt"): and it took about 30 minutes manually coding the file path for one file.

我再次使用 for filename in glob.glob("/media/3tb/*file.txt")运行了该文件,只花了一个小时的时间就找到了该文件夹中的一个文件.全局代码会增加这么多时间吗?

I ran it again with for filename in glob.glob("/media/3tb/*file.txt") and it took an hour just for one file in the folder. Does the glob code add that much time?



for line in f.readlines():


for line in f:


The former reads the entire file into a list of lines, then iterates over that list. The latter does it incrementally, which should drastically reduce the total memory allocated and later freed by your program.



09-06 13:34