问题描述
我正在尝试解析一个巨大的日志文件(大约5 GB).
I am trying to parse a gigantic log file (around 5 GB).
我只想解析前500,000行,并且不想将整个文件读入内存.
I only want to parse the first 500,000 lines and I don't want to read the whole file into memory.
基本上,我想做下面的代码,但是要用while
循环而不是for
循环和if
有条件.我还想确保不要将整个文件读到内存中.
Basically, I want to do what the below is code is doing but with a while
loop instead of a for
loop and if
conditional. I also want to be sure not read the entire file into memory.
import re
from collections import defaultdict
FILE = open('logs.txt', 'r')
count_words=defaultdict(int)
import pickle
i=0
for line in FILE.readlines():
if i < 500000:
m = re.search('key=([^&]*)', line)
count_words[m.group(1)]+=1
i+=1
csv=[]
for k, v in count_words.iteritems():
csv.append(k+","+str(v))
print "\n".join(csv)
推荐答案
调用readlines()
会将整个文件调用到内存中,因此您必须逐行读取,直到达到500,000行或按EOF键为止首先.您应该改用以下方法:
Calling readlines()
will call the entire file into memory, so you'll have to read line by line until you reach line 500,000 or hit the EOF, whichever comes first. Here's what you should do instead:
i = 0
while i < 500000:
line = FILE.readline()
if line == "": # Cuts off if end of file reached
break
m = re.search('key=([^&]*)', line)
count_words[m.group(1)]+=1
i += 1
这篇关于用Python解析硕大的日志文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!