python - 在Pandas数据帧上并行化操作时速度较慢

我有一个数据框，我执行一些操作并打印出来。为此，我必须遍历每一行。

for count, row in final_df.iterrows():
    x = row['param_a']
    y = row['param_b']
    # Perform operation
    # Write to output file

我决定使用python多处理模块将其并行化

def write_site_files(row):
    x = row['param_a']
    y = row['param_b']
    # Perform operation
    # Write to output file

pkg_num = 0
total_runs = final_df.shape[0] # Total number of rows in final_df
threads = []

import multiprocessing

while pkg_num < total_runs or len(threads):
    if(len(threads) < num_proc and pkg_num < total_runs):
        print pkg_num, total_runs
        t = multiprocessing.Process(target=write_site_files,args=[final_df.iloc[pkg_num],pkg_num])
        pkg_num = pkg_num + 1
        t.start()
        threads.append(t)
    else:
        for thread in threads:
            if not thread.is_alive():
               threads.remove(thread)

然而，后一种（并行化）方法比简单的基于迭代的方法慢得多。我有什么遗漏吗？
谢谢！

最佳答案

除非实际操作花费大量时间（如每行秒数），否则在单个进程中执行此操作的效率要低得多。
通常并行化是最后一个工具。分析之后，局部矢量化之后，局部优化之后，然后并行化。
你只需要花时间做切片，然后旋转新的流程（这通常是一个恒定的开销），然后酸洗一行（从你的例子中不清楚它有多大）。
至少，您应该将行分组，例如df.iloc[i:(i+1)*chunksize]。
希望在0.14中有一些对parallelapply的支持，请参见此处：https://github.com/pydata/pandas/issues/5751

关于python - 在Pandas数据帧上并行化操作时速度较慢，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/22468279/