问题描述
我有一个将近 100000 行的文件.我想做一个清理过程(小写,删除停用词等)但是需要时间.
I have a file with almost 100000 lines. I want to make a cleanning process (lower case, remove stopwords etc) However it takes time.
以 10000 为例,脚本需要 15 分钟.对于所有文件,我预计需要 150 分钟.但是需要5个小时.
Example for 10000 the script needs 15 minutes. For all file I expect to take 150 minutes. However it takes 5 hours.
在启动文件时使用:
fileinput = open('tweets.txt', 'r')
lines = fileinput.read().lower() #for lower case, however it load all file
for line in fileinput:
lines = line.lower()
问题:我可以使用一种方法来读取前 10000 行进行清理的过程,然后再阅读下一行博客等吗?
Question: Can I use a way to read the first 10000 lines making the process of cleaning and after that reading the next blog of lines etc?
推荐答案
我强烈建议逐行操作,而不是一次读取整个文件(换句话说,不要使用 .read()
).
I would highly suggest operating line-by-line instead of reading in the entire file all at once (in other words, don't use .read()
).
with open('tweets.txt', 'r') as fileinput:
for line in fileinput:
line = line.lower()
# ... do something with line ...
# (for example, write the line to a new file, or print it)
这篇关于使用python从文件中读取行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!