问题描述
我有这样一个类:
//Array of Structures
class Unit
{
public:
float v;
float u;
//And similarly many other variables of float type, upto 10-12 of them.
void update()
{
v+=u;
v=v*i*t;
//And many other equations
}
};
我创建单位类型的对象的数组。并呼吁他们更新。
I create an array of objects of Unit type. And call update on them.
int NUM_UNITS = 10000;
void ProcessUpdate()
{
Unit *units = new Unit[NUM_UNITS];
for(int i = 0; i < NUM_UNITS; i++)
{
units[i].update();
}
}
为了加快东西,并有可能autovectorize环路,我转换AOS构建阵列。
In order to speed up things, and possibly autovectorize the loop, I converted AoS to structure of arrays.
//Structure of Arrays:
class Unit
{
public:
Unit(int NUM_UNITS)
{
v = new float[NUM_UNITS];
}
float *v;
float *u;
//Mnay other variables
void update()
{
for(int i = 0; i < NUM_UNITS; i++)
{
v[i]+=u[i];
//Many other equations
}
}
};
在循环未能autovectorize,我得到的阵列结构的表现很糟糕。对于50个单位,SOA的更新是略快于AoS.But然后从100个单位开始,SOA不仅仅是AOS慢。在300个单位,SOA是几乎两倍更糟。在100K单位,SOA是4倍比AOS慢。尽管缓存可能是SOA的一个问题,我没有想到的性能差别是这么高。分析上cachegrind显示错过这两个方法的相似数量。一股股对象的大小为48字节。 L1缓存为256K,L2为1MB和L3为8MB。我缺少的是在这里吗?这真的是一个缓存的问题?
When the loop fails to autovectorize, i am getting a very bad performance for structure of arrays. For 50 units, SoA's update is slightly faster than AoS.But then from 100 units onwards, SoA is slower than AoS. At 300 units, SoA is almost twice as worse. At 100K units, SoA is 4x slower than AoS. While cache might be an issue for SoA, i didnt expect the performance difference to be this high. Profiling on cachegrind shows similar number of misses for both approach. Size of a Unit object is 48 bytes. L1 cache is 256K, L2 is 1MB and L3 is 8MB. What am i missing here? Is this really a cache issue?
编辑:
我用gcc 4.5.2。编译器选项-03 -msse4 -ftree-量化。
I am using gcc 4.5.2. Compiler options are -o3 -msse4 -ftree-vectorize.
我做SOA的另一个实验。而不是动态分配的数组,我在编译时分配的V和U。当有100K的单位,这给出了一个性能,它比SOA快10倍与动态分配的阵列。发生了什么吗?为什么有静态和动态分配的内存之间的这种性能差异?
I did another experiment in SoA. Instead of dynamically allocating the arrays, i allocated "v" and "u" in compile time. When there are 100K units, this gives a performance which is 10x faster than the SoA with dynamically allocated arrays. Whats happening here? Why is there such a performance difference between static and dynamically allocated memory?
推荐答案
阵列结构不缓存在这种情况下友好。
Structure of arrays is not cache friendly in this case.
您同时使用 U
和 v
在一起,但在2个不同阵列的情况下,对他们来说,他们将不同时加载到一个高速缓存行和高速缓存未命中将耗资巨大的性能损失。
You use both u
and v
together, but in case of 2 different arrays for them they will not be loaded simultaneously into one cache line and cache misses will cost huge performance penalty.
<$c$c>_mm_$p$pfetch$c$c>可用于制作 AOS
重新presentation甚至更快。
_mm_prefetch
can be used to make AoS
representation even faster.
这篇关于数组和结构的阵列结构 - 性能差异的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!