使用正则表达式解析大文本文件

本文介绍了使用正则表达式解析大文本文件的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个巨大的文本文件 (1 GB)，其中每个行"由 ## 分隔.
例如:

I have a huge text file (1 GB), where each "line" is separated by ##.
For example:

## sentence 1 ## sentence 2
## sentence 3

我正在尝试根据 ## 分隔打印文件.

I'm trying to print the file according to the ## separation.

我尝试了下面的代码，但是 read() 函数崩溃了(因为文件的大小).

I tried the following code, but the read() function crush (because the size of the file).

import re

dataFile = open('post.txt', 'r')
p = re.compile('##(.+)')

iterator = p.finditer(dataFile.read())
for match in iterator:
    print (match.group())

dataFile.close()

有什么想法吗?

推荐答案

这将以块(chunksize 字节)读取文件，从而避免与读取太多文件相关的内存问题一次:

This will read the file in chunks (of chunksize bytes) thus avoiding memory issues related to reading too much of the file all at once:

import re
def open_delimited(filename, delimiter, *args, **kwargs):
    """
    http://stackoverflow.com/a/17508761/190597
    """
    with open(filename, *args, **kwargs) as infile:
        chunksize = 10000
        remainder = ''
        for chunk in iter(lambda: infile.read(chunksize), ''):
            pieces = re.split(delimiter, remainder + chunk)
            for piece in pieces[:-1]:
                yield piece
            remainder = pieces[-1]
        if remainder:
            yield remainder

filename = 'post.txt'
for chunk in open_delimited(filename, '##', 'r'):
    print(chunk)
    print('-'*80)

这篇关于使用正则表达式解析大文本文件的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！