问题描述
作为OpenMP
& Rcpp
性能测试我想检查使用最直接,最简单的Rcpp
+ OpenMP
实现可以快速计算R中的Mandelbrot集.目前我所做的是:
As an OpenMP
& Rcpp
performance test I wanted to check how fast I could calculate the Mandelbrot set in R using the most straightforward and simple Rcpp
+OpenMP
implementation. Currently what I did was:
#include <Rcpp.h>
#include <omp.h>
// [[Rcpp::plugins(openmp)]]
using namespace Rcpp;
// [[Rcpp::export]]
Rcpp::NumericMatrix mandelRcpp(const double x_min, const double x_max, const double y_min, const double y_max,
const int res_x, const int res_y, const int nb_iter) {
Rcpp::NumericMatrix ret(res_x, res_y);
double x_step = (x_max - x_min) / res_x;
double y_step = (y_max - y_min) / res_y;
int r,c;
#pragma omp parallel for default(shared) private(c) schedule(dynamic,1)
for (r = 0; r < res_y; r++) {
for (c = 0; c < res_x; c++) {
double zx = 0.0, zy = 0.0, new_zx;
double cx = x_min + c*x_step, cy = y_min + r*y_step;
int n = 0;
for (n=0; (zx*zx + zy*zy < 4.0 ) && ( n < nb_iter ); n++ ) {
new_zx = zx*zx - zy*zy + cx;
zy = 2.0*zx*zy + cy;
zx = new_zx;
}
ret(c,r) = n;
}
}
return ret;
}
然后在R中:
library(Rcpp)
sourceCpp("mandelRcpp.cpp")
xlims=c(-0.74877,-0.74872);
ylims=c(0.065053,0.065103);
x_res=y_res=1080L; nb_iter=10000L;
system.time(m <- mandelRcpp(xlims[[1]], xlims[[2]], ylims[[1]], ylims[[2]], x_res, y_res, nb_iter))
# 0.92s
rainbow=c(rgb(0.47,0.11,0.53),rgb(0.27,0.18,0.73),rgb(0.25,0.39,0.81),rgb(0.30,0.57,0.75),rgb(0.39,0.67,0.60),rgb(0.51,0.73,0.44),rgb(0.67,0.74,0.32),rgb(0.81,0.71,0.26),rgb(0.89,0.60,0.22),rgb(0.89,0.39,0.18),rgb(0.86,0.13,0.13))
cols=c(colorRampPalette(rainbow)(100),rev(colorRampPalette(rainbow)(100)),"black") # palette
par(mar=c(0, 0, 0, 0))
system.time(image(m^(1/7), col=cols, asp=diff(ylims)/diff(xlims), axes=F, useRaster=T))
# 0.5s
我不确定是否还有其他明显的速度改进,我可以利用OpenMP多线程之外的其他优势,例如通过simd
矢量化? (在openmp #pragma
中使用simd选项似乎没有任何作用)
I was unsure though if there is any other obvious speed improvements I could take advantage of aside from OpenMP multithreading, e.g. via simd
vectorization? (using simd options in the openmp #pragma
didn't seem to do anything)
PS最初我的代码崩溃了,但后来我发现可以通过将ret[r,c] = n;
替换为ret(r,c) = n;
来解决按照下面的答案中的建议使用Armadillo类,虽然时间几乎相同,但速度会稍微加快一点.还要在x
和y
周围翻转,以便在用image()
绘制时以正确的方向显示.使用8个线程的速度约为.比矢量化的普通R Mandelbrot版本快350倍此处比(非多线程)Python/Numba版本快7.3倍此处(类似于PyCUDA或PyOpenCL的速度),对此感到非常满意...
PS at first my code was crashing but I later found this was solved by replacing ret[r,c] = n;
with ret(r,c) = n;
Using Armadillo classes as suggested in the answer below make things very slightly faster, though the timings are almost the same. Also flipped around x
and y
so it comes out in the right orientation when plotted with image()
. Using 8 threads speed is ca. 350 times faster than the vectorized plain R Mandelbrot version here and also about 7.3 times faster than the (non-multithreaded) Python/Numba version here (similar to PyCUDA or PyOpenCL speeds), so quite happy with that... Rasterizing/display now seems the bottleneck in R....
推荐答案
不要不要将 OpenMP 与 Rcpp 的*Vector
一起使用,或*Matrix
对象掩盖了单线程的SEXP
函数/内存分配. OpenMP是一种多线程方法.
Do not use OpenMP with Rcpp's *Vector
or *Matrix
objects as they mask SEXP
functions / memory allocations that are single-threaded. OpenMP is a multi-threaded approach.
这就是代码崩溃的原因.
规避此限制的一种方法是使用非 R 数据结构来存储结果.以下条件之一就足够了:arma::mat
或Eigen::MatrixXd
或std::vector<T>
...由于我喜欢犰狳,因此将res
矩阵从Rcpp::NumericMatrix
更改为arma::mat
.因此,以下将并行执行您的代码:
One way to get around this limitation is to use a non-R data structure to store the results. One of the following will be sufficient: arma::mat
or Eigen::MatrixXd
or std::vector<T>
... As I favor armadillo, I will change the res
matrix to arma::mat
from Rcpp::NumericMatrix
. Thus, the following will execute your code in parallel:
#include <RcppArmadillo.h> // Note the changed include and new attribute
// [[Rcpp::depends(RcppArmadillo)]]
// Avoid including header if openmp not on system
#ifdef _OPENMP
#include <omp.h>
#endif
// [[Rcpp::plugins(openmp)]]
// Note the changed return type
// [[Rcpp::export]]
arma::mat mandelRcpp(const double x_min, const double x_max,
const double y_min, const double y_max,
const int res_x, const int res_y, const int nb_iter) {
arma::mat ret(res_x, res_y); // note change
double x_step = (x_max - x_min) / res_x;
double y_step = (y_max - y_min) / res_y;
unsigned r,c;
#pragma omp parallel for shared(res)
for (r = 0; r < res_y; r++) {
for (c = 0; c < res_x; c++) {
double zx = 0.0, zy = 0.0, new_zx;
double cx = x_min + c*x_step, cy = y_min + r*y_step;
unsigned n = 0;
for (; (zx*zx + zy*zy < 4.0 ) && ( n < nb_iter ); n++ ) {
new_zx = zx*zx - zy*zy + cx;
zy = 2.0*zx*zy + cy;
zx = new_zx;
}
if(n == nb_iter) {
n = 0;
}
ret(r, c) = n;
}
}
return ret;
}
使用测试代码(注意,未定义y
和x
,因此我假设y = ylims
和x = xlims
)有:
With the test code (note y
and x
were not defined, thus I assumed y = ylims
and x = xlims
) we have:
xlims = ylims = c(-2.0, 2.0)
x_res = y_res = 400L
nb_iter = 256L
system.time(m <-
mandelRcpp(xlims[[1]], xlims[[2]],
ylims[[1]], ylims[[2]],
x_res, y_res, nb_iter))
rainbow = c(
rgb(0.47, 0.11, 0.53),
rgb(0.27, 0.18, 0.73),
rgb(0.25, 0.39, 0.81),
rgb(0.30, 0.57, 0.75),
rgb(0.39, 0.67, 0.60),
rgb(0.51, 0.73, 0.44),
rgb(0.67, 0.74, 0.32),
rgb(0.81, 0.71, 0.26),
rgb(0.89, 0.60, 0.22),
rgb(0.89, 0.39, 0.18),
rgb(0.86, 0.13, 0.13)
)
cols = c(colorRampPalette(rainbow)(100),
rev(colorRampPalette(rainbow)(100)),
"black") # palette
par(mar = c(0, 0, 0, 0))
image(m,
col = cols,
asp = diff(range(ylims)) / diff(range(xlims)),
axes = F)
针对:
这篇关于多线程SIMD使用Rcpp& OpenMP的的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!