问题描述
我试图在Python中编写一个多线程程序,以加速(1000以下).csv文件的复制。多线程代码的运行速度甚至比顺序方法慢。我用 profile.py
定时代码。我相信我必须做错事,但我不知道什么。
环境:
- 四核CPU 。
- 2个硬盘,其中一个包含源文件。另一个是目标。
- 1000个csv文件,大小从几KB到10 MB。
方法:
我将所有文件路径放在队列中,并创建4-8个工作线程队列并复制指定的文件。在任何情况下,多线程代码都不会更快:
- 顺序复制需要150-160秒
- 复制需要230秒以上
我假设这是一个I / O绑定任务,因此多线程应该有助于提高操作速度。 >
代码:
import Queue
import threading
import cStringIO
import os
import shutil
import timeit#time代码exec使用gc disable
import glob#file wildcards list,glob.glob ('* .py')
import profile#
fileQueue = Queue.Queue()#global
srcPath ='C:\\temp'
destPath ='D:\\temp'
tcnt = 0
ttotal = 0
def CopyWorker():
while True:
fileName = fileQueue.get()
fileQueue.task_done()
shutil.copy(fileName,destPath)
#tcnt + = 1
print'copied:',tcnt,'of ',ttotal
def threadWorkerCopy(fileNameList):
print'threadWorkerCopy:',len(fileNameList)
ttotal = len(fileNameList)
for i in range 4):
t = threading.Thread(target = CopyWorker)
t.daemon = True
t.start()
fileNameList中的fileName:
fileQueue.put (fileName)
fileQueue.join()
def sequentialCopy(fileNameList):
#around 160.446 seconds,152 seconds
print'sequentialCopy:',len )
cnt = 0
ctotal = len(fileNameList)
fileNameList中的fileName:
shutil.copy(fileName,destPath)
cnt + = 1
print'copied:',cnt,'of',ctotal
def main():
print'this is main method'
fileCount = 0
fileList = glob.glob(srcPath +'\\'+'* .csv')
#sequentialCopy(fileList)
threadWorkerCopy(fileList)
如果__name__ ==' __main__':
profile.run('main()')
当然,它更慢。硬盘驱动器不得不不断地在文件之间寻找。你相信多线程会使这个任务更快是完全没有道理的。限制速度是您从磁盘读取数据或向磁盘写入数据的速度,以及从一个文件到另一个文件的每一次搜索都是花费在传输数据上的时间损失。
I am trying to write a multithreaded program in Python to accelerate the copying of (under 1000) .csv files. The multithreaded code runs even slower than the sequential approach. I timed the code with profile.py
. I am sure I must be doing something wrong but I'm not sure what.
The Environment:
- Quad core CPU.
- 2 hard drives, one containing source files. The other is the destination.
- 1000 csv files ranging in size from several KB to 10 MB.
The Approach:
I put all the file paths in a Queue, and create 4-8 worker threads pull file paths from the queue and copy the designated file. In no case is the multithreaded code faster:
- sequential copy takes 150-160 seconds
- threaded copy takes over 230 seconds
I assume this is an I/O bound task, so multithreading should help the operation speed.
The Code:
import Queue
import threading
import cStringIO
import os
import shutil
import timeit # time the code exec with gc disable
import glob # file wildcards list, glob.glob('*.py')
import profile #
fileQueue = Queue.Queue() # global
srcPath = 'C:\\temp'
destPath = 'D:\\temp'
tcnt = 0
ttotal = 0
def CopyWorker():
while True:
fileName = fileQueue.get()
fileQueue.task_done()
shutil.copy(fileName, destPath)
#tcnt += 1
print 'copied: ', tcnt, ' of ', ttotal
def threadWorkerCopy(fileNameList):
print 'threadWorkerCopy: ', len(fileNameList)
ttotal = len(fileNameList)
for i in range(4):
t = threading.Thread(target=CopyWorker)
t.daemon = True
t.start()
for fileName in fileNameList:
fileQueue.put(fileName)
fileQueue.join()
def sequentialCopy(fileNameList):
#around 160.446 seconds, 152 seconds
print 'sequentialCopy: ', len(fileNameList)
cnt = 0
ctotal = len(fileNameList)
for fileName in fileNameList:
shutil.copy(fileName, destPath)
cnt += 1
print 'copied: ', cnt, ' of ', ctotal
def main():
print 'this is main method'
fileCount = 0
fileList = glob.glob(srcPath + '\\' + '*.csv')
#sequentialCopy(fileList)
threadWorkerCopy(fileList)
if __name__ == '__main__':
profile.run('main()')
Of course it's slower. The hard drives are having to seek between the files constantly. Your belief that multi-threading would make this task faster is completely unjustified. The limiting speed is how fast you can read data from or write data to the disk, and every seek from one file to another is a loss of time that could have been spent transferring data.
这篇关于多线程文件复制比多核CPU上的单线程慢得多的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!