本文介绍了快速对数计算的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述 29岁程序员,3月因学历无情被辞! 所有代码都在linux上的同一台计算机上运行。 在python中: import numpy as np drr = abs(np.random.randn(100000,50))%timeit np.log2(drr) 10个循环,最好的3:每个循环77.9 ms 在C ++使用g ++ -o log ./log.cpp -std = c ++ 11 -O3编译): #include< iostream> #include< iomanip> #include< string> #include< map> #include< random> #include< ctime> int main() { std :: mt19937 e2(0); std :: normal_distribution<> dist(0,1); const int n_seq = 100000; const int l_seq = 50; static double x [n_seq] [l_seq]; for(int n = 0; n for(int k = 0; k x [n] k] = abs(dist(e2)); if(x [n] [k] x [n] [k] = 0.1; } } clock_t begin = clock(); for(int n = 0; n< n_seq; ++ n){ for(int k = 0; k x [n] [k] = std :: log2(x [n] [k]); } } clock_t end = clock(); 以60 ms运行 : abr = abs(randn(100000,50)); tic; abr = log2(abr); toc 已用时间为7.8 ms。 / p> 我可以理解C ++和numpy之间的速度差异,但MATLAB打败了一切。 我遇到了 http://fastapprox.googlecode.com /svn/trunk/fastapprox/src/fastonebigheader.h 但是这只是float,不是double,我不知道如何将它转换为double。 我也尝试过: http: //hackage.haskell.org/package/approximate-0.2.2.1/src/cbits/fast.c 它具有快速日志功能,编译为numpy ufunc时运行时间为20 ms,这是伟大的,但精度的损失是显着的。 有关如何实现MATLAB神奇的log2速度的任何想法? UPDATE 感谢大家的意见,这非常快速,非常有帮助!实际上,答案是并行化,即在几个线程上分散负载。根据@morningsun建议, %timeit numexpr.evaluate('log(drr)') ms,这是与MATLAB一样,谢谢! numexpr is MKL enabled 解决方案请注意,以下所有都是float32,而不是双精度。 UPDATE :我完全弃用了gcc以支持英特尔的icc。当性能至关重要,当你没有时间微调你的编译器提示来强制执行gcc向量化时,这一切都会有所不同(参见例如 here ) log_omp.c , GCC:gcc -o log_omp.so -fopenmp log_omp.c -lm -O3 -fPIC -shared -std = c99 ICC:icc -o log_omp.so -openmp loge_omp.c -lm -O3 -fPIC -shared -std = c99 -vec-report1 -xAVX -I / opt / intel / composer / mkl / include #include< math.h> #includeomp.h #includemkl_vml.h #define restrict __restrict inline void log_omp * restrict a,float * restrict c); void log_omp(int m,float * restrict a,float * restrict c) { int i; #pragma omp parallel for default(none)shared(m,a,c)private(i) for(i = 0; i a [i] log(c [i]); } } // VML / icc only: void log_VML(int m,float * restrict a,float * restrict c) { int i; int split_to = 14; int iter = m / split_to; int additional = m%split_to; // vsLn(m,c,a); #pragma omp parallel for default(none)shared(m,a,c,additional,iter)private(i)num_threads(split_to) for(i = 0; i vsLog10(iter,c + i,a + i) // vmsLn(iter,c + i,a + i,VML_HA); if(additional> 0) vsLog10(附加,c + m附加,a + m附加); // vmsLn(additional,c + m-additional,a + m-additional,VML_HA); } in python: $ b 来自ctypes import CDLL,c_int,c_void_p def log_omp(xs,out): lib = CDLL('./ log_omp .so') lib.log_omp.argtypes = [c_int,np.ctypeslib.ndpointer(dtype = np.float32),np.ctypeslib.ndpointer(dtype = np.float32)] lib.log_omp .restype = c_void_p n = xs.shape [0] out = np.empty(n,np.float32) lib.log_omp(n,out,xs) return out Cython代码(在ipython notebook中,因此是%% magic): %% cython --compile-args = -fopenmp --link-args = -fopenmp import numpy as np cimport numpy as np 从libc.math cimport log 从cython.parallel cimport prange import cython @cython .boundscheck(False) def cylog(np.ndarray [np.float32_t,ndim = 1] a not None, np.ndarray [np.float32_t,ndim = 1] out = None):如果out为None: out = np.empty((a.shape [0]),dtype = a.dtype) cdef Py_ssize_t i with nogil: for i in prange(a.shape [0]): out [i] = log(a [i]) return out 计时: numexpr。 detect_number_of_cores()// 2 28 %env OMP_NUM_THREADS = 28 x = np.abs(np.random.randn(50000000).astype('float32'))y = x.copy() #GCC %timeit log_omp(x,y) 10循环,最好的3:每循环21.6 ms #ICC %timeit log_omp(x,y) 100循环,最好是3:每循环9.6 ms %timeit log_VML(x,y) 100循环3:每个循环10 ms %timeit cylog(x,out = y) 10个循环,最好的3:每个循环21.7 ms numexpr.set_num_threads (28)%timeit out = numexpr.evaluate('log(x)') 100个循环,最好的3:每个循环13 ms pre> 所以numexpr似乎比gcc编译好,但是icc胜过。 一些资源我发现有用和可耻使用的代码: http://people.duke.edu/~ccc14/sta-663/Optimization_Bakeoff.html https://gist.github.com/zed/2051661 All the code was run on the same machine on linux.In python:import numpy as npdrr = abs(np.random.randn(100000,50))%timeit np.log2(drr)10 loops, best of 3: 77.9 ms per loopIn C++ (compiled with g++ -o log ./log.cpp -std=c++11 -O3):#include <iostream>#include <iomanip>#include <string>#include <map>#include <random>#include <ctime>int main(){std::mt19937 e2(0);std::normal_distribution<> dist(0, 1);const int n_seq = 100000;const int l_seq = 50;static double x[n_seq][l_seq];for (int n = 0;n < n_seq; ++n) { for (int k = 0; k < l_seq; ++k) { x[n][k] = abs(dist(e2)); if(x[n][k] <= 0) x[n][k] = 0.1; } } clock_t begin = clock(); for (int n = 0; n < n_seq; ++n) { for (int k = 0; k < l_seq; ++k) { x[n][k] = std::log2(x[n][k]); } } clock_t end = clock();Runs in 60 msIn MATLAB:abr = abs(randn(100000,50));tic;abr=log2(abr);tocElapsed time is 7.8 ms.I can understand the speed difference between C++ and numpy, but MATLAB beats everything.I've come acrosshttp://fastapprox.googlecode.com/svn/trunk/fastapprox/src/fastonebigheader.hbut this does only float, not double, and I'm not sure how to convert it to double.I also tried this:http://hackage.haskell.org/package/approximate-0.2.2.1/src/cbits/fast.cwhich has fast log functions, and when compiled as a numpy ufunc, runs in 20 ms, which is great, but the loss in accuracy is significant.Any ideas on how to achieve the magical log2 speed that MATLAB gets?UPDATEThank you all for comments, that was very fast and very helpful! Indeed, the answer is parallelisation, i.e. spreading the load on several threads. Following @morningsun suggestion,%timeit numexpr.evaluate('log(drr)')gives 5.6 ms, which is on par with MATLAB, thank you! numexpr is MKL enabled 解决方案 Note that ALL below are float32, not double precision.UPDATE:I've ditched gcc completely in favour of Intel's icc. It makes ALL the difference when performance is critical and when you don't have time to fine-tune your "compiler hints" to enforce gcc vectorization (see, e.g. here)log_omp.c,GCC: gcc -o log_omp.so -fopenmp log_omp.c -lm -O3 -fPIC -shared -std=c99ICC: icc -o log_omp.so -openmp loge_omp.c -lm -O3 -fPIC -shared -std=c99 -vec-report1 -xAVX -I/opt/intel/composer/mkl/include#include <math.h>#include "omp.h"#include "mkl_vml.h"#define restrict __restrictinline void log_omp(int m, float * restrict a, float * restrict c);void log_omp(int m, float * restrict a, float * restrict c){ int i;#pragma omp parallel for default(none) shared(m,a,c) private(i) for (i=0; i<m; i++) { a[i] = log(c[i]); }}// VML / icc only:void log_VML(int m, float * restrict a, float * restrict c){ int i; int split_to = 14; int iter = m / split_to; int additional = m % split_to;// vsLn(m, c, a);#pragma omp parallel for default(none) shared(m,a,c, additional, iter) private(i) num_threads(split_to) for (i=0;i < (m-additional); i+=iter) vsLog10(iter,c+i,a+i); //vmsLn(iter,c+i,a+i, VML_HA); if (additional > 0) vsLog10(additional, c+m-additional, a+m-additional); //vmsLn(additional, c+m-additional, a+m-additional, VML_HA);}in python:from ctypes import CDLL, c_int, c_void_pdef log_omp(xs, out): lib = CDLL('./log_omp.so') lib.log_omp.argtypes = [c_int, np.ctypeslib.ndpointer(dtype=np.float32), np.ctypeslib.ndpointer(dtype=np.float32)] lib.log_omp.restype = c_void_p n = xs.shape[0] out = np.empty(n, np.float32) lib.log_omp(n, out, xs) return outCython code (in ipython notebook, hence the %% magic):%%cython --compile-args=-fopenmp --link-args=-fopenmpimport numpy as npcimport numpy as npfrom libc.math cimport logfrom cython.parallel cimport prangeimport [email protected](False)def cylog(np.ndarray[np.float32_t, ndim=1] a not None, np.ndarray[np.float32_t, ndim=1] out=None): if out is None: out = np.empty((a.shape[0]), dtype=a.dtype) cdef Py_ssize_t i with nogil: for i in prange(a.shape[0]): out[i] = log(a[i]) return outTimings:numexpr.detect_number_of_cores() // 228%env OMP_NUM_THREADS=28x = np.abs(np.random.randn(50000000).astype('float32'))y = x.copy()# GCC%timeit log_omp(x, y)10 loops, best of 3: 21.6 ms per loop# ICC%timeit log_omp(x, y)100 loops, best of 3: 9.6 ms per loop%timeit log_VML(x, y)100 loops, best of 3: 10 ms per loop%timeit cylog(x, out=y)10 loops, best of 3: 21.7 ms per loopnumexpr.set_num_threads(28)%timeit out = numexpr.evaluate('log(x)')100 loops, best of 3: 13 ms per loopSo numexpr seems to be doing a better job than (poorly) compiled gcc code, but icc wins.Some resources I found useful and shamefully used code from:http://people.duke.edu/~ccc14/sta-663/Optimization_Bakeoff.htmlhttps://gist.github.com/zed/2051661 这篇关于快速对数计算的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持! 上岸,阿里云!
08-13 17:47