

我有一个将近 100000 行的文件.我想做一个清理过程(小写,删除停用词等)但是需要时间.

I have a file with almost 100000 lines. I want to make a cleanning process (lower case, remove stopwords etc) However it takes time.

以 10000 为例,脚本需要 15 分钟.对于所有文件,我预计需要 150 分钟.但是需要5个小时.

Example for 10000 the script needs 15 minutes. For all file I expect to take 150 minutes. However it takes 5 hours.


fileinput = open('tweets.txt', 'r')

lines = fileinput.read().lower() #for lower case, however it load all file

for line in fileinput:
    lines = line.lower()

问题:我可以使用一种方法来读取前 10000 行进行清理的过程,然后再阅读下一行博客等吗?

Question: Can I use a way to read the first 10000 lines making the process of cleaning and after that reading the next blog of lines etc?


我强烈建议逐行操作,而不是一次读取整个文件(换句话说,不要使用 .read()).

I would highly suggest operating line-by-line instead of reading in the entire file all at once (in other words, don't use .read()).

with open('tweets.txt', 'r') as fileinput:
    for line in fileinput:
        line = line.lower()
        # ... do something with line ...
        # (for example, write the line to a new file, or print it)

将自动利用 Python 的内置缓冲功能.

