处理大文件的最快方法

处理大文件的最快方法

本文介绍了处理大文件的最快方法?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有多个3 GB的制表符分隔文件.每个文件中有2000万行.所有行都必须独立处理,任何两行之间都没有关系.我的问题是,什么会更快A.使用以下命令逐行阅读:

I have multiple 3 GB tab delimited files. There are 20 million rows in each file. All the rows have to be independently processed, no relation between any two rows.My question is, what will be faster A. Reading line-by-line using:

with open() as infile:
    for line in infile:

或者B.将文件分块读取到内存中并进行处理,比如说一次250 MB?

Or B. Reading the file into memory in chunks and processing it, say 250 MB at a time?

处理不是很复杂,我只是在column1到List1,column2到List2等中获取值.可能需要将一些列值加在一起.

The processing is not very complicated, I am just grabbing value in column1 to List1, column2 to List2 etc. Might need to add some column values together.

我正在具有30GB内存的Linux机器上使用python 2.7. ASCII文本.

I am using python 2.7 on a linux box that has 30GB of memory. ASCII Text.

有什么方法可以并行加速吗?现在,我正在使用前一种方法,该过程非常缓慢.使用任何CSVReader模块都会有所帮助吗?我不必用python来做,欢迎使用任何其他语言或数据库使用的想法.

Any way to speed things up in parallel? Right now I am using the former method and the process is very slow. Is using any CSVReader module going to help?I don't have to do it in python, any other language or database use ideas are welcome.

推荐答案

听起来您的代码受I/O约束.这意味着多处理将无济于事-如果您花费90%的时间从磁盘读取数据,那么等待下一次读取的额外7个进程将无济于事.

It sounds like your code is I/O bound. This means that multiprocessing isn't going to help—if you spend 90% of your time reading from disk, having an extra 7 processes waiting on the next read isn't going to help anything.

而且,虽然使用CSV读取模块(无论是stdlib的csv还是NumPy或Pandas之类的东西)可能都是一个简单的好主意,但不太可能在性能上产生很大差异.

And, while using a CSV reading module (whether the stdlib's csv or something like NumPy or Pandas) may be a good idea for simplicity, it's unlikely to make much difference in performance.

不过,值得检查的是您确实是 的I/O绑定对象,而不仅仅是猜测.运行您的程序,查看您的CPU使用率是接近0%还是接近100%还是一个核心.执行Amadan在注释中建议的操作,并仅使用pass运行您的程序以进行处理,然后查看这是减少5%的时间还是减少70%的时间.您甚至可能想尝试与os.openos.read(1024*1024)上的循环进行比较,看看是否更快.

Still, it's worth checking that you really are I/O bound, instead of just guessing. Run your program and see whether your CPU usage is close to 0% or close to 100% or a core. Do what Amadan suggested in a comment, and run your program with just pass for the processing and see whether that cuts off 5% of the time or 70%. You may even want to try comparing with a loop over os.open and os.read(1024*1024) or something and see if that's any faster.

由于您使用的是Python 2.x,Python依靠C stdio库来一次猜测要缓冲多少,因此可能值得强迫它缓冲更多.最简单的方法是对某些较大的bufsize使用readlines(bufsize). (您可以尝试使用不同的数字进行测量,以查看峰值在哪里.根据我的经验,通常从64K-8MB的任何东西都差不多,但是取决于您的系统可能有所不同,尤其是如果您正在阅读的网络文件系统具有很高的吞吐量,但可怕的延迟使实际物理驱动器的吞吐量与等待时间相比变得无能为力,而操作系统却对此进行了缓存.)

Since your using Python 2.x, Python is relying on the C stdio library to guess how much to buffer at a time, so it might be worth forcing it to buffer more. The simplest way to do that is to use readlines(bufsize) for some large bufsize. (You can try different numbers and measure them to see where the peak is. In my experience, usually anything from 64K-8MB is about the same, but depending on your system that may be different—especially if you're, e.g., reading off a network filesystem with great throughput but horrible latency that swamps the throughput-vs.-latency of the actual physical drive and the caching the OS does.)

例如,

bufsize = 65536
with open(path) as infile:
    while True:
        lines = infile.readlines(bufsize)
        if not lines:
            break
        for line in lines:
            process(line)


同时,假设您使用的是64位系统,则可能要尝试使用 mmap 而不是首先读取文件.当然,不能保证会更好,但是可能会更好,具体取决于您的系统.例如:


Meanwhile, assuming you're on a 64-bit system, you may want to try using mmap instead of reading the file in the first place. This certainly isn't guaranteed to be better, but it may be better, depending on your system. For example:

with open(path) as infile:
    m = mmap.mmap(infile, 0, access=mmap.ACCESS_READ)

Python mmap有点像一个怪异的对象,它同时像strfile一样工作,因此您可以例如手动迭代扫描换行符,也可以调用readline就好像它是一个文件一样.与将文件作为行进行迭代或批量处理readlines相比,这两种方法都将需要更多的Python处理(因为C语言中的循环现在已经在纯Python中了……尽管也许您可以使用re或一个简单的Cython扩展?)…但是操作系统的I/O优势知道您正在使用映射进行操作可能会淹没CPU的劣势.

A Python mmap is sort of a weird object—it acts like a str and like a file at the same time, so you can, e.g., manually iterate scanning for newlines, or you can call readline on it as if it were a file. Both of those will take more processing from Python than iterating the file as lines or doing batch readlines (because a loop that would be in C is now in pure Python… although maybe you can get around that with re, or with a simple Cython extension?)… but the I/O advantage of the OS knowing what you're doing with the mapping may swamp the CPU disadvantage.

不幸的是,Python没有公开 madvise 调用您用来微调事物以尝试在C中进行优化(例如,显式设置MADV_SEQUENTIAL而不是让内核猜测或强制透明大页面),但实际上您可以ctypes将该函数删除libc.

Unfortunately, Python doesn't expose the madvise call that you'd use to tweak things in an attempt to optimize this in C (e.g., explicitly setting MADV_SEQUENTIAL instead of making the kernel guess, or forcing transparent huge pages)—but you can actually ctypes the function out of libc.

这篇关于处理大文件的最快方法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-16 06:50