问题描述
我在玩另一天,试图看看我可以优化一些东西多远。我决定从一个简单的地图开始,只是做一个线性搜索,以找到一个元素是否存在,然后尝试优化它的大部分。此外,为了比较,我做同样的std :: map和std :: vector使用std :: find。
I was playing the other day, trying to see how far could I optimize something. I decided to start from a simple map that just does a linear search to find if an element is there, and then try to optimize the most of it. Also, to compare, I do the same with a std::map and a std::vector using std::find.
地图的结果是预期的,比我的地图更慢的创建和破坏,但更快的速度(实际上,我已经无法测试它,它返回0 allways)。
问题是使用std :: vector。我预计它比我的实现慢,但不是,我真的不明白如何可以是相同或更快,因为我的实现跳过一个最坏的情况(值不在向量中),并且是使用结果缓存。
The results with the map are the expected ones, slower creation and destruction than my map, but much more speed( actually, I have been unable to mesure it, it returns 0 allways).The problem is with std::vector. I expected it to be slower than my implementation, but is not, and I really don't understand how can it be the same or faster, as my implementation is skipping a worst case( the value isn't in the vector) and is using a cache of results.
任何人都可以在这里露出一些光芒?我知道stl背后的家伙是半神,但仍然没有意义。
Can anyone shed some light here? I know that the guys behind stl are semi-gods, but still, this doesn't make sense.
基准测试结果(i3,Windows 8.1 Pro 64,Visual Studio 2013 ):
Benchmark results ( i3, Windows 8.1 Pro 64, Visual Studio 2013 ):
std::vector :
Build : 85.0042 ms
Loop : 37.0011 ms
Find : 1.82259 ms -> First : Found, Second : Found, Third : Not Found
Release : 0 ms
--------------------
std::map :
Build : 6929.41 ms
Loop : 570.032 ms
Find : 0 ms -> First : Found, Second : Found, Third : Not Found
Release : 1425.08
--------------------
Linear Map V0:
Build : 194.012 ms
Loop : 49.0052 ms
Find : 1.88915 ms -> First : Found, Second : Found, Third : Not Found
Release : 109.004
地图代码:
template<typename T>
class LinearMap0
{
public:
LinearMap0()
{
_end = _root = new Node;
_prebuffer = nullptr;
prebufferCapacity = 0;
_alive = true;
prebufferMarker = 0;
_cache = _mm_set1_epi32(-1);
for (auto& ptr : _cacheBuffer) ptr = nullptr;
MinID = INT32_MAX - 1;
MaxID = -1;
}
void PreAllocate(int Count)
{
prebufferCapacity = Count;
_prebuffer = new Node[Count];
}
~LinearMap0()
{
if (_alive)
{
Release();
}
}
void Release()
{
Node* marker = _end;
while (marker->Prev)
{
marker = marker->Prev;
if (!marker->Next->IsPreAllocated) delete marker->Next;
}
if (!_root->IsPreAllocated) delete _root;
delete[] _prebuffer;
_alive = false;
}
void AddElement(int ID,T element)
{
Node* tmp = nullptr;
if (prebufferMarker < prebufferCapacity)
{
// Use a pre-allocated object
tmp = &_prebuffer[prebufferMarker];
prebufferMarker++;
tmp->IsPreAllocated = true;
}
else
{
tmp = new Node;
}
tmp->ID = ID;
tmp->Data = element;
// Update list
_end->Next = tmp;
Node* prevEnd = _end;
_end = tmp;
_end->Prev = prevEnd;
bool isMin = ID < MinID; MinID = ID * isMin + (1 - isMin) * MinID;
bool isMax = ID > MaxID; MaxID = ID * isMax + (1 - isMax) * MaxID;
}
void DeleteLast()
{
Node* tmp = _end;
_end = _end->Prev;
_end->Next = nullptr;
delete tmp;
}
template<class Function>
void Loop(Function&& f, bool Forward = true)
{
if (Forward)
{
Node* marker = _root;
while (marker->Next)
{
marker = marker->Next;
f(marker->Data);
}
}
else
{
Node* marker = _end;
while (marker->Prev)
{
marker = marker->Prev;
f(marker->Data);
}
}
}
T* Find(int ID)
{
// Bounds check
if (ID < MinID || ID > MaxID) return nullptr;
// Check it it's in the cache
// Compare the value to every value in the cache
__m128i idxSSE = _mm_set1_epi32(ID);
__m128i C = _mm_cmpeq_epi32(_cache, idxSSE);
// To change form -1 to 1
C = _mm_mul_epi32(C, _mm_set1_epi32(-1));
// Now C holds 1 if true, or 0 if false (in each of its 4 members). It should only be ONE set at 1
__m128i tmp = _mm_set1_epi32(1);
__m128i S = _mm_sub_epi32(tmp, C);
// Now find the index
int i = S.m128i_i32[0] * (C.m128i_i32[1] + S.m128i_i32[1] * (2 * C.m128i_i32[2] + S.m128i_i32[2] * (3 * C.m128i_i32[3] + S.m128i_i32[3] * -1)));
if (i != -1)
return _cacheBuffer[i];
// Traverse the list
Node* marker0 = _root;
T* obj = nullptr;
while (true)
{
if (marker0->ID == ID)
{
obj = &marker0->Data;
}
if (marker0->Next) marker0 = marker0->Next; else break;
}
// Cache value and return
_cache.m128i_i32[cacheMarker] = ID;
_cacheBuffer[cacheMarker] = obj;
cacheMarker = (cacheMarker + 1) & 3; // x & 3 = x % 4
return obj;
}
private:
struct Node
{
Node()
{
Prev = nullptr;
Next = nullptr;
IsPreAllocated = false;
ID = -1;
}
T Data;
Node* Prev;
Node* Next;
bool IsPreAllocated;
int ID;
};
Node* _root;
Node* _end;
Node* _prebuffer;
int prebufferCapacity;
int prebufferMarker;
bool _alive;
__m128i _cache;
T* _cacheBuffer[4];
int cacheMarker;
int MinID, MaxID;
};
这里是基准:
// Initialize seeds
const __int64 ecount = 5 * 1000*1000;
vector<__int64> seed(ecount);
for (__int64 i = 0; i < ecount; i++)
{
seed[i] = i;
}
random_shuffle(seed.begin(), seed.end());
///////////// std::vector
vector<__int64> v;
cout << "--------------------" << endl;
cout << "std::vector :" << endl;
cout << " Build : " << time_call([&]()
{
v.resize(ecount/2);
for (__int64 i = 0; i < ecount; i++)
{
if (i < (ecount / 2))
v[i] = seed[i];
else
v.push_back(seed[i]);
}
}) << " ms" << endl;
cout << " Loop : " << time_call([&]()
{
for (auto& n : v)
n /= 2;
}) << " ms" << endl;
bool found1, found2, found3;
cout << " Find : " << (((float)time_call([&]()
{
for (int i = 0; i < 15; i++)
{
// Should exist
found1 = find(v.begin(), v.end(), seed[5] / 2) != v.end();//find(seed[5]) != m.end();
found2 = find(v.begin(), v.end(), seed[1000] / 2) != v.end();
// Shouldn't exist
found3 = find(v.begin(), v.end(), -1234) != v.end();
}
})) / 15.0) / 3.0;
cout << " ms " << " -> First : " << ((found1) ? "Found" : "Not Found") << ", Second : " << ((found2) ? "Found" : "Not Found") << ", Third : " << ((found3) ? "Found" : "Not Found") << endl;
cout << " Release : " << time_call([&]()
{
v.clear();
}) << " ms" << endl;
///////////// std::map
map<__int64, __int64> m;
cout << "--------------------" << endl;
cout << "std::map :" << endl;
cout << " Build : " << time_call([&]()
{
for (__int64 i = 0; i < ecount; i++)
{
m[seed[i]] = seed[i];
}
}) << " ms" << endl;
cout << " Loop : " << time_call([&]()
{
for (auto& n : m)
n.second /= 2;
}) << " ms" << endl;
cout << " Find : " << (((float)time_call([&]()
{
for (int i = 0; i < 15; i++)
{
// Should exist
found1 = m.find(seed[5]) != m.end();
found2 = m.find(seed[1000]) != m.end();
// Shouldn't exist
found3 = m.find(-1234) != m.end();
}
})) / 15.0) / 3.0;
cout << " ms " << " -> First : " << ((found1) ? "Found" : "Not Found") << ", Second : " << ((found2) ? "Found" : "Not Found") << ", Third : " << ((found3) ? "Found" : "Not Found") << endl;
cout << " Release : " << time_call([&]()
{
m.clear();
}) << endl;
///////////// Linear Map V0
LinearMap0<__int64> c;
cout << "--------------------" << endl;
cout << "Linear Map V0:" << endl;
cout << " Build : " << time_call([&]()
{
c.PreAllocate(ecount / 2);
for (__int64 i = 0; i < ecount; i++)
{
c.AddElement(seed[i],seed[i]);
}
}) << " ms" << endl;
cout << " Loop : " << time_call([&]()
{
c.Loop([](__int64& Data)
{
Data /= 2;
});
}) << " ms" << endl;
cout << " Find : " << (((float)time_call([&]()
{
for (int i = 0; i < 15; i++)
{
// Should exist
found1 = c.Find(seed[5]);
found2 = c.Find(seed[1000]);
// Shouldn't exist
found3 = c.Find(-1234);
}
})) / 15.0) / 3.0;
cout << " ms -> First : " << ((found1) ? "Found" : "Not Found") << ", Second : " << ((found2) ? "Found" : "Not Found") << ", Third : " << ((found3) ? "Found" : "Not Found") << endl;
cout << " Release : " << time_call([&]()
{
c.Release();
}) << endl;
EDIT:time_call is:
time_call is:
template <class Function>
double time_call(Function&& f)
{
chrono::time_point<chrono::high_resolution_clock> start, end;
start = chrono::high_resolution_clock::now();
f();
end = chrono::high_resolution_clock::now();
return ((double)(chrono::duration_cast<chrono::nanoseconds>(end - start).count())) / 1000000.0;
}
推荐答案
,而 std :: vector
是一个动态大小的数组。
Your container is a linked list, whereas std::vector
is a dynamically-sized array.
链表方法有好处,例如
但是数组方法有一些显着的优点:
However the array approach has some significant advantages:
- 线性搜索只是扫描内存,这正是缓存和预提取程序的构建。对链表的扫描效率较低,因为每次跳转到未缓存的内存意味着昂贵的缓存未命中。
- 线性阵列扫描很容易矢量化。如果使用
-O3
编译,那么编译器可能使用向量化版本的std :: find
。由于内存依赖关系,无法将链接列表扫描向量化。 - 使用的内存量。您的链表必须维护一个
下一个
指针,有效地使您的元素更大。此外,每个非预分配的Node
必须支付分配器的开销(即new
和delete
)。这意味着您可以更快地达到内存带宽限制,并且可以在缓存中容纳更少的元素。
- a linear search simply scans memory, which is exactly what caches and pre-fetchers are built for. A scan of a linked list will be less efficient because each jump into uncached memory means an expensive cache miss.
- a linear array scan is easy to vectorize. If you compile with
-O3
then the compiler will likely use a vectorized version ofstd::find
. It's impossible to vectorize a linked list scan due to memory dependencies. - amount of memory used. Your linked list has to maintain a
next
pointer which effectively makes your elements larger. Also, each non-preallocatedNode
has to pay the overhead of the allocator (i.e. accounting data fornew
anddelete
). That means you're hitting memory bandwidth limits sooner, and you can fit fewer elements in cache.
这篇关于为什么std :: vector这么快(或者是我的实现太慢)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!