问题描述
我有一个123MB大的int
数组,它基本上是这样使用的:
I have a 123MB big int
array, and it is basically used like this:
private static int[] data = new int[32487834];
static int eval(int[] c)
{
int p = data[c[0]];
p = data[p + c[1]];
p = data[p + c[2]];
p = data[p + c[3]];
p = data[p + c[4]];
p = data[p + c[5]];
return data[p + c[6]];
}
eval()
被称为很多(〜50B次),并且具有不同的c
,我想知道是否(以及如何)加快速度.
eval()
is called a lot (~50B times) with different c
and I would like to know if (and how) I could speed it up.
我已经使用带有固定数组的不安全函数,该函数利用了所有CPU.这是RayW的 TwoPlusTwo 7卡评估程序的C#端口. C ++版本的速度可观.
I already use a unsafe function with an fixed array that makes use of all the CPUs. It's a C# port of the TwoPlusTwo 7 card evaluator by RayW. The C++ version is insignificantly faster.
GPU可以用来加快速度吗?
Can the GPU be used to speed this up?
推荐答案
- 将数组引用缓存到局部变量中.静态字段访问通常比本地方法慢,原因有很多(其中一个原因是字段可以更改,因此必须一直重新加载.JIT可以更自由地优化本地方法).
- 请勿将数组用作方法的参数.硬编码7个整数索引.这样可以减少数组分配,间接惩罚和边界检查.
- 使用不安全的代码来索引数组.这将消除边界检查.使用
GCHandle
修复数组并将指针缓存在静态字段中(不要只使用固定块-我相信它与输入它有一定的(小)开销.不确定). - 作为修复阵列的替代方法,使用
VirtualAlloc
分配123MB阵列并使用大页面.这样可以减少TLB的失误.
- Cache the array reference into a local variable. Static field accesses are generally slower than locals for multiple reasons (one of them is that the field can change so it has to be reloaded all the time. The JIT can optimize locals much more freely).
- Don't use an array as the argument to the method. Hard-code 7 integer-indices. That reduces array allocation, indirection-penalty and bounds checking.
- Use unsafe code to index into the array. This will eliminate bounds checking. Use a
GCHandle
to fix the array and cache the pointer in a static field (don't just use a fixed-block - I believe it has certain (small) overhead associated with entering it. Not sure). - As an alternative to fixing the array, allocate the 123MB array using
VirtualAlloc
and use huge pages. That cuts down on TLB misses.
所有这些都是核心的低级优化.它们仅在您需要最佳性能时才适用.
All of these are hardcore low-level optimizations. They only apply if you need maximum performance.
我认为我们在优化此功能方面几乎处于极限.仅当您显示函数的调用方时,我们才能做得更好,以便可以将它们作为一个单元进行优化.
I think we are pretty much at the limit here when it comes to optimizing this function. We probably can only do better if you show the caller of the function so that they can be optimized as a single unit.
这篇关于遍历后加速数组查找?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!