问题描述
Numpy 的大部分功能都会默认启用多线程.
例如,如果我运行脚本,我在 8 核英特尔 CPU 工作站上工作
将 numpy 导入为 npx=np.random.random(1000000)对于我在范围内(100000):np.sqrt(x)
linux top
在运行过程中会显示 800% 的 CPU 使用率,例如这意味着 numpy 会自动检测我的工作站有 8 个内核,并且 np.sqrt
会自动使用所有 8 个内核来加速计算.
然而,我发现了一个奇怪的错误.如果我运行一个脚本
将 numpy 导入为 np将熊猫导入为 pddf=pd.DataFrame(np.random.random((10,10)))df+dfx=np.random.random(1000000)对于我在范围内(100000):np.sqrt(x)
cpu 使用率是 100%!!. 这意味着如果你在运行任何numpy函数之前加上两个pandas DataFrame,那么numpy的自动多线程功能就毫无预警地消失了!这绝对不合理, 为什么 Pandas 的 dataFrame 计算会影响 Numpy 线程设置?这是一个错误吗?如何解决这个问题?
附注:
我使用 Linux perf
工具进一步挖掘.
运行第一个脚本节目
运行第二个脚本时显示
所以这两个脚本都涉及libmkl_vml_avx2.so
,而第一个脚本涉及附加的libiomp5.so
,这似乎与openMP有关.
而且由于 vml 表示英特尔矢量数学库,所以根据 vml doc 我猜至少下面的函数都是自动多线程的
Pandas 使用 numexpr
计算一些操作,并且 numexpr
将 vml 的最大线程数设置为 1,当 导入:
# VML 的默认值是 1 个线程(见 #39)set_vml_num_threads(1)
当 df+df
在 表达式.py:
from pandas.core.computation.check import _NUMEXPR_INSTALLED如果 _NUMEXPR_INSTALLED:导入 numexpr 作为 ne
然而,Anaconda 发行版也使用 vml-functionity 来实现 sqrt
、sin
、cos
等函数 - 并且曾经 numexpr
将最大 vml 线程数设置为 1,numpy 函数不再使用并行化.
这个问题很容易在 gdb 中看到(使用你的慢脚本):
>>>gdb --args python slow.py(gdb) b mkl_serv_domain_set_num_threads函数mkl_serv_domain_set_num_threads"未定义.在未来的共享库加载时使断点挂起?(y 或 [n]) y断点 1 (mkl_serv_domain_set_num_threads) 待处理.(gbd) 运行线程 1 "python" 在/home/ed/anaconda37/lib/python3.7/site-packages/numpy/../../../libmkl_intel_thread.so 中的 mkl_serv_domain_set_num_threads () 中命中 Breakpoint 1, 0x00007ffffee65cd70(gdb) bt来自/home/ed/anaconda37/lib/python3.7/site-packages/numpy/../../../libmkl_intel_thread.so 的 mkl_serv_domain_set_num_threads () 中的 #0 0x00007ffffee65cd70#1 0x00007fffe978026c in _set_vml_num_threads(_object*, _object*) () 来自/home/ed/anaconda37/lib/python3.7/site-packages/numexpr/interpreter.cpython-37m-x86_64-linux-gnu.so#2 0x00005555556cd660 in _PyMethodDef_RawFastCallKeywords () at/tmp/build/80754af9/python_1553721932202/work/Objects/call.c:694...(gdb) 打印 $rdi1 美元 = 1即我们可以看到,numexpr
将线程数设置为 1.后面调用 vml-sqrt 函数时会用到:
(gbd) b mkl_serv_domain_get_max_threads断点 2 位于 0x7fffee65a900(gdb) (gdb) c继续.线程 1python"在来自/home/ed/anaconda37/lib/python3.7/site-packages/numpy/../../../libmkl_intel_thread.so 的 mkl_serv_domain_get_max_threads () 中命中 Breakpoint 2, 0x00007ffffee65a900(gdb) bt来自/home/ed/anaconda37/lib/python3.7/site-packages/numpy/../../../libmkl_intel_thread.so的mkl_serv_domain_get_max_threads()中的#0 0x00007ffffee65a900#1 0x00007ffff01fcea9 in mkl_vml_serv_threader_d_1i_1o () 来自/home/ed/anaconda37/lib/python3.7/site-packages/numpy/../../../libmkl_intel_thread.so#2 0x00007fffedf78563 in vdSqrt () from/home/ed/anaconda37/lib/python3.7/site-packages/numpy/../../../libmkl_intel_lp64.so#3 0x00007ffff5ac04ac in trivial_two_operand_loop () from/home/ed/anaconda37/lib/python3.7/site-packages/numpy/core/_multiarray_umath.cpython-37m-x86_64-linux-gnu.so
所以我们可以看到 numpy 使用了 vml 的 vdSqrt
实现,它利用 mkl_vml_serv_threader_d_1i_1o
来决定是否应该并行进行计算,并查看线程数:
(gdb) 鳍从/home/ed/anaconda37/lib/python3.7/site-packages/numpy/../../../libmkl_intel_thread.so 中的 mkl_serv_domain_get_max_threads () 中运行直到退出 #0 0x00007ffffee65a900来自/home/ed/anaconda37/lib/python3.7/site-packages/numpy/../../../libmkl_intel_thread.so 的 mkl_vml_serv_threader_d_1i_1o () 中的 0x00007ffff01fcea9(gdb) 打印 $rax2 美元 = 1
寄存器%rax
的最大线程数为1.
现在我们可以使用 numexpr
来增加vml-threads的数量,即:
将 numpy 导入为 np导入 numexpr 作为 ne将熊猫导入为 pddf=pd.DataFrame(np.random.random((10,10)))df+df#HERE:重置 vml 线程数ne.set_vml_num_threads(8)x=np.random.random(1000000)对于我在范围内(10000):np.sqrt(x) # 现在并行
现在使用了多个内核!
Most of the Numpy's function will enable multithreading by default.
for example, I work on a 8-cores intel cpu workstation, if I run a script
import numpy as np
x=np.random.random(1000000)
for i in range(100000):
np.sqrt(x)
the linux top
will show 800% cpu usage during running likeWhich means numpy automatically detects that my workstation has 8 cores, and np.sqrt
automatically use all 8 cores to accelerate computation.
However, I found a weird bug. If I run a script
import numpy as np
import pandas as pd
df=pd.DataFrame(np.random.random((10,10)))
df+df
x=np.random.random(1000000)
for i in range(100000):
np.sqrt(x)
the cpu usage is 100%!!. It means that if you plus two pandas DataFrame before running any numpy function, the auto multithreading feature of numpy is gone without any warning! This is absolutely not reasonable, why would Pandas dataFrame calculation affect Numpy threading setting? Is it a bug? How to work around this?
PS:
I dig further using Linux perf
tool.
running first script shows
While running second script shows
So both script involves libmkl_vml_avx2.so
, while the first script involves additional libiomp5.so
which seems to be related to openMP.
And since vml means intel vector math library, so according to vml doc I guess at least below functions are all automatically multithreaded
Pandas uses numexpr
under the hood to calculate some operations, and numexpr
sets the maximal number of threads for vml to 1, when it is imported:
# The default for VML is 1 thread (see #39)
set_vml_num_threads(1)
and it gets imported by pandas when df+df
is evaluated in expressions.py:
from pandas.core.computation.check import _NUMEXPR_INSTALLED
if _NUMEXPR_INSTALLED:
import numexpr as ne
However, Anaconda distribution also uses vml-functionality for such functions as sqrt
, sin
, cos
and so on - and once numexpr
set the maximal number of vml-threads to 1, the numpy-functions no longer use parallelization.
The problem can be easily seen in gdb (using your slow script):
>>> gdb --args python slow.py
(gdb) b mkl_serv_domain_set_num_threads
function "mkl_serv_domain_set_num_threads" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (mkl_serv_domain_set_num_threads) pending.
(gbd) run
Thread 1 "python" hit Breakpoint 1, 0x00007fffee65cd70 in mkl_serv_domain_set_num_threads () from /home/ed/anaconda37/lib/python3.7/site-packages/numpy/../../../libmkl_intel_thread.so
(gdb) bt
#0 0x00007fffee65cd70 in mkl_serv_domain_set_num_threads () from /home/ed/anaconda37/lib/python3.7/site-packages/numpy/../../../libmkl_intel_thread.so
#1 0x00007fffe978026c in _set_vml_num_threads(_object*, _object*) () from /home/ed/anaconda37/lib/python3.7/site-packages/numexpr/interpreter.cpython-37m-x86_64-linux-gnu.so
#2 0x00005555556cd660 in _PyMethodDef_RawFastCallKeywords () at /tmp/build/80754af9/python_1553721932202/work/Objects/call.c:694
...
(gdb) print $rdi
$1 = 1
i.e. we can see, numexpr
sets number of threads to 1. Which is later used when vml-sqrt function is called:
(gbd) b mkl_serv_domain_get_max_threads
Breakpoint 2 at 0x7fffee65a900
(gdb) (gdb) c
Continuing.
Thread 1 "python" hit Breakpoint 2, 0x00007fffee65a900 in mkl_serv_domain_get_max_threads () from /home/ed/anaconda37/lib/python3.7/site-packages/numpy/../../../libmkl_intel_thread.so
(gdb) bt
#0 0x00007fffee65a900 in mkl_serv_domain_get_max_threads () from /home/ed/anaconda37/lib/python3.7/site-packages/numpy/../../../libmkl_intel_thread.so
#1 0x00007ffff01fcea9 in mkl_vml_serv_threader_d_1i_1o () from /home/ed/anaconda37/lib/python3.7/site-packages/numpy/../../../libmkl_intel_thread.so
#2 0x00007fffedf78563 in vdSqrt () from /home/ed/anaconda37/lib/python3.7/site-packages/numpy/../../../libmkl_intel_lp64.so
#3 0x00007ffff5ac04ac in trivial_two_operand_loop () from /home/ed/anaconda37/lib/python3.7/site-packages/numpy/core/_multiarray_umath.cpython-37m-x86_64-linux-gnu.so
So we can see numpy uses vml's implementation of vdSqrt
which utilizes mkl_vml_serv_threader_d_1i_1o
to decide whether calculation should be done in parallel and it looks the number of threads:
(gdb) fin
Run till exit from #0 0x00007fffee65a900 in mkl_serv_domain_get_max_threads () from /home/ed/anaconda37/lib/python3.7/site-packages/numpy/../../../libmkl_intel_thread.so
0x00007ffff01fcea9 in mkl_vml_serv_threader_d_1i_1o () from /home/ed/anaconda37/lib/python3.7/site-packages/numpy/../../../libmkl_intel_thread.so
(gdb) print $rax
$2 = 1
the register %rax
has the maximal number of threads and it is 1.
Now we can use numexpr
to increase the number of vml-threads, i.e.:
import numpy as np
import numexpr as ne
import pandas as pd
df=pd.DataFrame(np.random.random((10,10)))
df+df
#HERE: reset number of vml-threads
ne.set_vml_num_threads(8)
x=np.random.random(1000000)
for i in range(10000):
np.sqrt(x) # now in parallel
Now multiple cores are utilized!
这篇关于Pandas 和 Numpy 中关于多线程的奇怪错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!