问题描述
我试图在 python 中读取数千小时的 wav 文件并获取它们的持续时间.这基本上需要打开 wav 文件,获取帧数并考虑采样率.下面是代码:
I am trying to read several thousands of hours of wav files in python and get their duration. This essentially requires opening the wav file, getting the number of frames and factoring in the sampling rate. Below is the code for that:
def wav_duration(file_name):
wv = wave.open(file_name, 'r')
nframes = wv.getnframes()
samp_rate = wv.getframerate()
duration = nframes / samp_rate
wv.close()
return duration
def build_datum(wav_file):
key = "/".join(wav_file.split('/')[-3:])[:-4]
try:
datum = {"wav_file" : wav_file,
"labels" : all_labels[key],
"duration" : wav_duration(wav_file)}
return datum
except KeyError:
return "key_error"
except:
return "wav_error"
按顺序执行此操作将花费太长时间.我的理解是多线程应该在这里有所帮助,因为它本质上是一个 IO 任务.因此,我就是这样做的:
Doing this sequentially will take too long. My understanding was that multi-threading should help here since it is essentially an IO task. Hence, I do just that:
all_wav_files = all_wav_files[:1000000]
data, key_errors, wav_errors = list(), list(), list()
start = time.time()
with concurrent.futures.ThreadPoolExecutor(max_workers=1) as executor:
# submit jobs and get the mapping from futures to wav_file
future2wav = {executor.submit(build_datum, wav_file): wav_file for wav_file in all_wav_files}
for future in concurrent.futures.as_completed(future2wav):
wav_file = future2wav[future]
try:
datum = future.result()
if datum == "key_error":
key_errors.append(wav_file)
elif datum == "wav_error":
wav_errors.append(wav_file)
else:
data.append(datum)
except:
print("Generated exception from thread processing: {}".format(wav_file))
print("Time : {}".format(time.time() - start))
令我沮丧的是,我得到了以下结果(以秒为单位):
To my dismay, I however get the following results (in seconds):
Num threads | 100k wavs | 1M wavs
1 | 4.5 | 39.5
2 | 6.8 | 54.77
10 | 9.5 | 64.14
100 | 9.07 | 68.55
这是预期的吗?这是 CPU 密集型任务吗?多处理有帮助吗?我怎样才能加快速度?我正在从本地驱动器读取文件,这是在 Jupyter 笔记本上运行的.Python 3.5.
Is this expected? Is this a CPU intensive task? Will Multi-Processing help? How can I speed things up? I am reading files from the local drive and this is running on a Jupyter notebook. Python 3.5.
编辑:我知道 GIL.我只是假设打开和关闭文件本质上是 IO.人们分析表明,在IO情况下,可能是反的使用多处理富有成效.因此我决定改用多处理.
EDIT: I am aware of GIL. I just assumed that opening and closing a file is essentially IO. People's analysis have shown that in IO cases, it might be counter productive to use multi-processing. Hence I decided to use multi-processing instead.
我想现在的问题是:这个任务是否受 IO 限制?
EDIT EDIT:对于那些想知道的人,我认为它受 CPU 限制(一个内核已达到 100%).这里的教训是不要对任务做出假设并自行检查.
EDIT EDIT: For those wondering, I think it was CPU bound (a core was maxing out to 100%). Lesson here is to not make assumptions about the task and check it for yourself.
推荐答案
按类别检查的一些事项:
Some things to check by category:
代码
- wave.open 的效率如何?当它可以简单地读取标题信息时,它是否将整个文件加载到内存中?
- 为什么 max_workers 设置为 1 ?
- 您是否尝试过使用 cProfile 甚至 timeit 以了解代码的哪个特定部分需要更多时间?立>
- How efficient is wave.open ? Is it loading the entire file into memory when it could simply be reading header information?
- Why is max_workers set to 1 ?
- Have you tried using cProfile or even timeit to get an idea of what particular part of code is taking more time?
硬件
通过一些硬盘活动、内存使用情况和 CPU 监控重新运行您现有的设置,以确认硬件不是您的限制因素.如果您看到您的硬盘以最大 IO 运行,则您的内存已满或所有 CPU 内核都达到 100% - 其中之一可能已达到极限.
Re-run your existing setup with some hard disk activity, memory usage and CPU monitoring to confirm that hardware is not your limiting factor. If you see your hard disk running at maximum IO, your memory getting full or all CPU cores at 100% - one of those could be at its limit.
全局解释器锁 (GIL)
如果没有明显的硬件限制,您很可能会遇到 Python 的全局解释器锁 (GIL) 问题,如 这个答案.如果您的代码仅限于在单核上运行或在运行线程时没有有效的并发性,则可以预期这种行为.在这种情况下,我肯定会更改为 multiprocessing,首先创建每个 CPU 内核一个进程,运行该进程,然后将硬件监控结果与前一次运行进行比较.
If there are no obvious hardware limitations, you are most likely running into problems with Python's Global Interpreter Lock (GIL), as described well in this answer. This behavior is to be expected if your code has been limited to running on a single core or there is no effective concurrency in running threads. In this case, I'd most certainly change to multiprocessing, starting by creating one process per CPU core, run that and then compare hardware monitoring results with the previous run.
这篇关于IO 任务中的 Python 多线程没有好处?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!