用Python解析硕大的日志文件 | 用Python解析硕大的日志文件

本文介绍了用Python解析硕大的日志文件的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试解析一个巨大的日志文件(大约5 GB).

I am trying to parse a gigantic log file (around 5 GB).

我只想解析前500,000行，并且不想将整个文件读入内存.

I only want to parse the first 500,000 lines and I don't want to read the whole file into memory.

基本上，我想做下面的代码，但是要用while循环而不是for循环和if有条件.我还想确保不要将整个文件读到内存中.

Basically, I want to do what the below is code is doing but with a while loop instead of a for loop and if conditional. I also want to be sure not read the entire file into memory.

import re
from collections import defaultdict
FILE = open('logs.txt', 'r')
count_words=defaultdict(int)
import pickle
i=0
for line in FILE.readlines():
    if i < 500000:
        m = re.search('key=([^&]*)', line)
        count_words[m.group(1)]+=1
    i+=1

csv=[]
for k, v in count_words.iteritems():
    csv.append(k+","+str(v))
print "\n".join(csv)

推荐答案

调用readlines()会将整个文件调用到内存中，因此您必须逐行读取，直到达到500,000行或按EOF键为止首先.您应该改用以下方法:

Calling readlines() will call the entire file into memory, so you'll have to read line by line until you reach line 500,000 or hit the EOF, whichever comes first. Here's what you should do instead:

i = 0
while i < 500000:
    line = FILE.readline()
    if line == "": # Cuts off if end of file reached
        break
    m = re.search('key=([^&]*)', line)
    count_words[m.group(1)]+=1
    i += 1

这篇关于用Python解析硕大的日志文件的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！