使用线程/多进程读取多个文件

本文介绍了使用线程/多进程读取多个文件的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我当前正在从正在工作的FileNameList的路径列表中提取.txt文件.但是我的主要问题是，文件太多时，速度太慢.

I am currently pulling .txt files from the path list of FileNameList, which is working. But my main problem is, it is too slow when the files is too many.

我正在使用此代码来打印txt文件列表，

I am using this code to print list of txt files,

import os
import sys

#FileNameList is my set of files from my path
for filefolder in FileNameList:
  for file in os.listdir(filefolder):
    if "txt" in file:
        filename = filefolder + "\\" + file
        print filename

任何具有线程/多进程并使其快速读取的帮助或建议都将接受.预先感谢.

Any help or suggestion to have thread/multiprocess and make it fast reading will accept. Thanks in advance.

推荐答案

优化的第一法则是问自己是否应该打扰.如果您的程序仅运行一次或几次优化，那将是浪费时间.

The first rule of optimization is to ask yourself if you should bother. If your program is run only once or a couple of times optimizing it is a waste of time.

第二条规则是在执行其他任何操作之前，先测量问题所在；

The second rule is that before you do anything else, measure where the problem lies;

编写一个简单的程序，该程序顺序读取文件，将文件拆分为几行，然后将其填充到数据库中.在 profiler 下运行该程序，以查看该程序将大部分时间花费在哪里.

Write a simple program that sequentially reads files, splits them into lines and stuffs those in a database.Run that program under a profiler to see where the program is spending most of its time.

只有这样，您才知道该程序的哪一部分需要加快速度.

Only then do you know which part of the program needs speeding up.

尽管如此，这里还是有一些指针.

Here are some pointers nevertheless.

使用mmap可以完成文件读取.
您可以使用multiprocessing.Pool将读取的文件分散到不同的内核上.但是，这些文件中的数据将最终进入不同的进程，并且必须使用IPC发送回父进程.对于大量数据，这会产生大量开销.
在Python的CPython实现中，一次只能有一个线程在执行Python字节码.尽管不受实际读取文件的限制，但处理结果是受限制的.因此，线程是否可以提供改进值得怀疑.
将行填充到数据库中可能始终是一个主要的瓶颈，因为这是所有内容组合在一起的地方.这有多少问题取决于数据库.它是在内存中还是在磁盘上，是否允许多个程序同时更新它，等等?

Speading up the reading of files can be done using mmap.
You could use multiprocessing.Pool to spread out the reading of multiple files over different cores. But then the data from those files will end up in different processes and would have to be sent back to the parent process using IPC. This has significant overhead for large amounts of data.
In the CPython implementation of Python, only one thread at a time can be executing Python bytecode. While the actual reading from files isn't inhibited by that, processing the results is. So it is questionable if threads would offer improvement.
Stuffing the lines into a database will probably always be a major bottleneck, because that is where everything comes together. How much of a problem this is depends on the database. Is it in-memory or on disk, does it allow multiple programs to update it simultaneously, et cetera.

这篇关于使用线程/多进程读取多个文件的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！