问题描述
我正在研究OpenMP,部分原因是我的程序需要添加非常大的向量(数百万个元素).但是,如果我使用std :: vector或raw数组,我会看到很大的不同.我无法解释. 我坚持认为区别只是在循环,而不是初始化.
I'm looking around OpenMP, partially because my program need to make additions of very large vectors (millions of elements). However i see a quite large difference if i use std::vector or raw array. Which i cannot explain. I insist that the difference is only on the loop, not the initialisation of course.
我所指的时间差仅是对加法进行计时,尤其是不考虑向量,数组等之间的任何初始化差异.我实际上只是在说求和部分.向量的大小在编译时未知.我在Ubuntu 16.04上使用g++
5.x.
The difference in time I refer to, is only timing the addition, especially not to take into account any initialization difference between vectors, arrays, etc. I'm really talking only about the sum part. The size of the vectors is not known at compile time.I use g++
5.x on Ubuntu 16.04.
edit:我测试了@Shadow所说的内容,这让我开始思考,优化是否正在进行某些工作?如果我使用-O2
进行编译,则使用初始化的原始数组,我将返回带有线程数的循环缩放.但是,使用-O3
或-funroll-loops
时,似乎编译器会提前启动并在看到编译指示之前进行优化.
edit: I tested what @Shadow said, it got me thinking, is there something going on with optimization? If i compile with -O2
, then, using raw arrays initialized, I get back for loop scaling with number of threads. But with -O3
or -funroll-loops
, it is as if the compiler kicks in early and optimize before the pragma is seen.
我提出了以下简单测试:
I came up with the following, simple test:
#define SIZE 10000000
#define TRIES 200
int main(){
std::vector<double> a,b,c;
a.resize(SIZE);
b.resize(SIZE);
c.resize(SIZE);
double start = omp_get_wtime();
unsigned long int i,t;
#pragma omp parallel shared(a,b,c) private(i,t)
{
for( t = 0; t< TRIES; t++){
#pragma omp for
for( i = 0; i< SIZE; i++){
c[i] = a[i] + b[i];
}
}
}
std::cout << "finished in " << omp_get_wtime() - start << std::endl;
return 0;
}
我用
g++ -O3 -fopenmp -std=c++11 main.cpp
并获得一个线程
>time ./a.out
finished in 2.5638
./a.out 2.58s user 0.04s system 99% cpu 2.619 total.
对于两个线程,循环耗时1.2s,总计1.23.
For two threads, loop takes 1.2s, for 1.23 total.
现在,如果我使用原始数组:
Now if I use raw arrays:
int main(){
double *a, *b, *c;
a = new double[SIZE];
b = new double[SIZE];
c = new double[SIZE];
double start = omp_get_wtime();
unsigned long int i,t;
#pragma omp parallel shared(a,b,c) private(i,t)
{
for( t = 0; t< TRIES; t++)
{
#pragma omp for
for( i = 0; i< SIZE; i++)
{
c[i] = a[i] + b[i];
}
}
}
std::cout << "finished in " << omp_get_wtime() - start << std::endl;
delete[] a;
delete[] b;
delete[] c;
return 0;
}
我得到(1个线程):
>time ./a.out
finished in 1.92901
./a.out 1.92s user 0.01s system 99% cpu 1.939 total
std::vector
慢33%!
对于两个线程:
>time ./a.out
finished in 1.20061
./a.out 2.39s user 0.02s system 198% cpu 1.208 total
作为比较,使用Eigen或Armadillo进行完全相同的操作(对矢量对象使用c = a + b重载),我得到的总实时时间约为2.8s.它们不是用于矢量加法的多线程.
As a comparison, with Eigen or Armadillo for exactly the same operation (using c = a+b overload with vector object), I get for total real time ~2.8s. They are not multi-threaded for vector additions.
现在,我以为std::vector
几乎没有开销?这是怎么回事我想使用漂亮的标准库对象.
Now, i thought the std::vector
has almost no overhead? What is happening here? I'd like to use nice standard library objects.
在像这样的简单示例中,我找不到任何参考.
I cannot find any reference anywhere on a simple example like this.
推荐答案
有意义的基准测试很困难
Xirema的答案已经详细概述了代码中的差异. std::vector::reserve
将数据初始化为为零,而new double[size]
则没有.请注意,您可以使用new double[size]()
强制初始化.
Meaningful benchmarking is hard
The answer from Xirema has already outlined in detail the difference in the code. std::vector::reserve
initializes the data to zero, whereas new double[size]
does not. Note that you can use new double[size]()
to force initalization.
但是,您的测量不包括初始化,并且重复的次数如此之多,以至于循环成本应该甚至超过Xirema的示例中小的初始化.那么,为什么在循环中使用完全相同的指令会因为初始化数据而花费更多时间呢?
However your measurement doesn't include initialization, and the number of repetitions is so high that the loop costs should outweigh the small initialization even in Xirema's example. So why do the very same instructions in the loop take more time because the data is initialized?
让我们通过动态确定内存是否已初始化的代码(基于Xirema的代码,但仅对循环本身进行计时)来探究其核心.
Let's dig to the core of this with a code that dynamically determines whether memory is initialized or not (Based on Xirema's, but only timing the loop itself).
#include <vector>
#include <chrono>
#include <iostream>
#include <memory>
#include <iomanip>
#include <cstring>
#include <string>
#include <sys/types.h>
#include <unistd.h>
constexpr size_t size = 10'000'000;
auto time_pointer(size_t reps, bool initialize, double init_value) {
double * a = new double[size];
double * b = new double[size];
double * c = new double[size];
if (initialize) {
for (size_t i = 0; i < size; i++) {
a[i] = b[i] = c[i] = init_value;
}
}
auto start = std::chrono::steady_clock::now();
for (size_t t = 0; t < reps; t++) {
for (size_t i = 0; i < size; i++) {
c[i] = a[i] + b[i];
}
}
auto end = std::chrono::steady_clock::now();
delete[] a;
delete[] b;
delete[] c;
return end - start;
}
int main(int argc, char* argv[]) {
bool initialize = (argc == 3);
double init_value = 0;
if (initialize) {
init_value = std::stod(argv[2]);
}
auto reps = std::stoll(argv[1]);
std::cout << "pid: " << getpid() << "\n";
auto t = time_pointer(reps, initialize, init_value);
std::cout << std::setw(12) << std::chrono::duration_cast<std::chrono::milliseconds>(t).count() << "ms" << std::endl;
return 0;
}
结果一致:
./a.out 50 # no initialization
657ms
./a.out 50 0. # with initialization
1005ms
乍一看绩效柜台
使用出色的Linux perf
工具:
$ perf stat -e LLC-loads -e dTLB-misses ./a.out 50
pid: 12481
626ms
Performance counter stats for './a.out 50':
101.589.231 LLC-loads
105.415 dTLB-misses
0,629369979 seconds time elapsed
$ perf stat -e LLC-loads -e dTLB-misses ./a.out 50 0.
pid: 12499
1008ms
Performance counter stats for './a.out 50 0.':
145.218.903 LLC-loads
1.889.286 dTLB-misses
1,096923077 seconds time elapsed
随着重复次数的增加,线性缩放也告诉我们,差异来自循环内部.但是为什么初始化内存会导致更多的最后一级缓存负载和数据TLB丢失?
Linear scaling with increasing number of repetitions also tells us, that the difference comes from within the loop. But why would initializing the memory cause more last level cache-loads and data TLB misses?
要了解这一点,我们需要了解如何分配内存.仅仅因为malloc
/new
返回一些指向虚拟内存的指针,并不意味着它背后有物理内存.虚拟内存可以位于没有物理内存支持的页面中,并且仅在第一页故障时才分配物理内存.现在这里是page-types
(来自linux/tools/vm
的位置-和显示为输出的pid派上用场了.在长时间执行我们的小基准测试期间,查看页面统计信息:
具有初始化功能
To understand that, we need to understand how memory is allocated. Just because a malloc
/ new
returns some pointer to virtual memory, doesn't mean that there is physical memory behind it. The virtual memory can be in a page that is not backed by physical memory - and the physical memory is only assigned on the first page fault. Now here is where page-types
(from linux/tools/vm
- and the pid we show as output comes in handy. Looking at the page statistics during a long execution of our little benchmark:
flags page-count MB symbolic-flags long-symbolic-flags
0x0000000000000804 1 0 __R________M______________________________ referenced,mmap
0x000000000004082c 392 1 __RU_l_____M______u_______________________ referenced,uptodate,lru,mmap,unevictable
0x000000000000086c 335 1 __RU_lA____M______________________________ referenced,uptodate,lru,active,mmap
0x0000000000401800 56721 221 ___________Ma_________t___________________ mmap,anonymous,thp
0x0000000000005868 1807 7 ___U_lA____Ma_b___________________________ uptodate,lru,active,mmap,anonymous,swapbacked
0x0000000000405868 111 0 ___U_lA____Ma_b_______t___________________ uptodate,lru,active,mmap,anonymous,swapbacked,thp
0x000000000000586c 1 0 __RU_lA____Ma_b___________________________ referenced,uptodate,lru,active,mmap,anonymous,swapbacked
total 59368 231
大多数虚拟内存在正常的mmap,anonymous
区域中-映射到物理地址的区域.
Most of the virtual memory is in a normal mmap,anonymous
region - something that is mapped to a physical address.
flags page-count MB symbolic-flags long-symbolic-flags
0x0000000001000000 1174 4 ________________________z_________________ zero_page
0x0000000001400000 37888 148 ______________________t_z_________________ thp,zero_page
0x0000000000000800 1 0 ___________M______________________________ mmap
0x000000000004082c 388 1 __RU_l_____M______u_______________________ referenced,uptodate,lru,mmap,unevictable
0x000000000000086c 347 1 __RU_lA____M______________________________ referenced,uptodate,lru,active,mmap
0x0000000000401800 18907 73 ___________Ma_________t___________________ mmap,anonymous,thp
0x0000000000005868 633 2 ___U_lA____Ma_b___________________________ uptodate,lru,active,mmap,anonymous,swapbacked
0x0000000000405868 37 0 ___U_lA____Ma_b_______t___________________ uptodate,lru,active,mmap,anonymous,swapbacked,thp
0x000000000000586c 1 0 __RU_lA____Ma_b___________________________ referenced,uptodate,lru,active,mmap,anonymous,swapbacked
total 59376 231
现在,这里只有1/3的内存由专用物理内存支持,而2/3映射到零页. a
和b
后面的数据全部由填充有零的单个只读4kiB页支持. c
(和另一个测试中的a
,b
)已经被写入,因此它必须具有自己的内存.
Now here, only 1/3 of the memory is backed by dedicated physical memory, and 2/3 are mapped to a zero page. The data behind a
and b
is all backed by a single read-only 4kiB page filled with zeros. c
(and a
, b
in the other test) have already been written to, so it has to have it's own memory.
现在看起来很奇怪:这里的所有内容都是零-为什么变成零很重要?无论您是memset(0)
,a[i] = 0.
还是std::vector::reserve
-一切都会导致显式写入内存,因此如果在零页面上执行操作,则将导致页面错误.我认为您此时不能/应该阻止物理页面分配.对于memset
/reserve
,唯一可以做的就是使用calloc
显式请求零位内存,该内存可能由zero_page
支持,但是我怀疑它已经完成了(或做了很多事情)的意义).请记住,对于new double[size];
或malloc
,不能保证获得哪种类型的内存,但这包括零内存的可能性.
Now it may look weird: everything here is zero - why does it matter how it became zero? Whether you memset(0)
, a[i] = 0.
, or std::vector::reserve
- everything causes explicit writes to memory, hence a page fault if you do it on a zero page. I don't think you can/should prevent physical page allocation at that point. The only thing you could do for the memset
/ reserve
is to use calloc
to explicitly request zero'd memory, which is probably backed by a zero_page
, but I doubt it is done (or makes a lot of sense). Remember that for new double[size];
or malloc
there is no guarantee what kind of memory you get, but that includes the possibility of zero-memory.
:请记住,双精度数0.0的所有位都设置为零.
: Remember that the double 0.0 has all bits set to zero.
最后,性能差异实际上仅来自循环,但是由初始化引起的. std::vector
进行循环的无开销.在基准代码中,原始数组仅受益于优化未初始化数据的异常情况.
In the end the performance difference really comes only from the loop, but is caused by initialization. std::vector
carries no overhead for the loop. In the benchmark code, raw arrays just benefit from optimization of an abnormal case of uninitialized data.
这篇关于为什么添加两个std :: vector比new []的原始数组要慢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!