问题描述
最近,我已经回答了关于任意基数的每一个排列优化并行的可能方法产生问题。我张贴类似的并行化,执行不力的答案代码块列表,有人几乎立刻指出了这一点:
Recently, I had answered a question about optimizing a likely parallelizable method for generation every permutation of arbitrary base numbers. I posted an answer similar to the Parallelized, poor implementation code block list, and someone nearly immediately pointed this out:
这是相当多保证给你假共享并很可能会慢很多倍。 (信贷)
和他们是正确的,它的死亡的慢。这就是说,我研究的话题,并发现了一些的其作斗争的。如果我理解正确的话,当线程访问连续的内存(中说,这可能支持那些 ConcurrentStack
数组),伪共享可能发生。
and they were right, it was death slow. That said, I researched the topic, and found some interesting material and suggestions for combating it. If I understand it correctly, when threads access contiguous memory (in say, the array that's likely backing that ConcurrentStack
), false sharing likely occurs.
有关水平线以下代码,字节
是:
For code below the horizontal rule, a Bytes
is:
struct Bytes {
public byte A; public byte B; public byte C; public byte D;
public byte E; public byte F; public byte G; public byte H;
}
有关我自己的测试,我希望得到这个运行的并行版本是真正更快,所以我创建了一个基于原代码一个简单的例子。 6
为限制[0]
是我的一个懒选择 - 我的电脑有6个核心。
For my own testing, I wanted to get a parallel version of this running and be genuinely faster, so I created a simple example based on the original code. 6
as limits[0]
was a lazy choice on my part - my computer has 6 cores.
单线程块 平均运行时间:10s0059ms 的
var data = new List<Bytes>();
var limits = new byte[] { 6, 16, 16, 16, 32, 8, 8, 8 };
for (byte a = 0; a < limits[0]; a++)
for (byte b = 0; b < limits[1]; b++)
for (byte c = 0; c < limits[2]; c++)
for (byte d = 0; d < limits[3]; d++)
for (byte e = 0; e < limits[4]; e++)
for (byte f = 0; f < limits[5]; f++)
for (byte g = 0; g < limits[6]; g++)
for (byte h = 0; h < limits[7]; h++)
data.Add(new Bytes {
A = a, B = b, C = c, D = d,
E = e, F = f, G = g, H = h
});
并行化,执行不力 运行时间平均:81s729ms,〜 8700争论的
var data = new ConcurrentStack<Bytes>();
var limits = new byte[] { 6, 16, 16, 16, 32, 8, 8, 8 };
Parallel.For(0, limits[0], (a) => {
for (byte b = 0; b < limits[1]; b++)
for (byte c = 0; c < limits[2]; c++)
for (byte d = 0; d < limits[3]; d++)
for (byte e = 0; e < limits[4]; e++)
for (byte f = 0; f < limits[5]; f++)
for (byte g = 0; g < limits[6]; g++)
for (byte h = 0; h < limits[7]; h++)
data.Push(new Bytes {
A = (byte)a,B = b,C = c,D = d,
E = e,F = f,G = g,H = h
});
});
并行化?实施 运行时间平均:5s833ms,92争的
var data = new ConcurrentStack<List<Bytes>>();
var limits = new byte[] { 6, 16, 16, 16, 32, 8, 8, 8 };
Parallel.For (0, limits[0], () => new List<Bytes>(),
(a, loop, localList) => {
for (byte b = 0; b < limits[1]; b++)
for (byte c = 0; c < limits[2]; c++)
for (byte d = 0; d < limits[3]; d++)
for (byte e = 0; e < limits[4]; e++)
for (byte f = 0; f < limits[5]; f++)
for (byte g = 0; g < limits[6]; g++)
for (byte h = 0; h < limits[7]; h++)
localList.Add(new Bytes {
A = (byte)a, B = b, C = c, D = d,
E = e, F = f, G = g, H = h
});
return localList;
}, x => {
data.Push(x);
});
我很高兴,我已经得到比单线程版本快的实现。我期望的结果接近10秒左右/ 6,或约160秒,但是这可能是一个天真的期望。
I'm glad that I had got an implementation that is faster than the single threaded version. I expected a result closer to around 10s / 6, or around 1.6 seconds, but that's probably a naive expectation.
我的问题是作为并行执行实际速度比单线程的版本是,是否有进一步的优化,可能是一个适用于操作?我想了解一下相关的并行优化,而不是改进用于计算值的算法。具体做法是:
My question is for the parallelized implementation that is actually faster than the single-threaded version, are there further optimizations that could be a applied to the operation? I'm wondering about optimizations related to parallelization, not improvements to the algorithm used to compute the values. Specifically:
- 我知道存储和填充为
结构
而不是优化的字节[]
,但它不涉及到并行(是吗?) - 我知道所需的值可能是偷懒与波进位加法器进行评估,但同为
结构
优化。
- I know about the optimization to store and populate as a
struct
instead ofbyte[]
, but it's not related to parallelization (or is it?) - I know that a desired value could be lazy evaluated with a ripple-carry adder, but same as the
struct
optimization.
推荐答案
首先,我对最初的假设的Parallel.For( )
和 Parallel.ForEach()
是错误的。
First off, my initial assumption regarding Parallel.For()
and Parallel.ForEach()
was wrong.
可怜的并行实现很可能有6个线程都试图同时写入一个 CouncurrentStack()
。在良好的执行线程usuing当地人(解释请见下文)每个任务只访问共享变量一次,几乎消除了任何竞争。
The poor parallel implementation very likely has 6 threads all attempting to write to a single CouncurrentStack()
at once. The good implementation usuing thread locals (explained more below) only accesses the shared variable once per task, nearly eliminating any contention.
在使用的Parallel.For()
和 Parallel.ForEach()
,您不能的简单的在线更换或的foreach
循环>跟他们。这并不是说,这可能不是一个盲目的改善,但没有研究这个问题,并插装它,使用他们在一个问题扔多线程,因为这可能会使其更快。
When using Parallel.For()
and Parallel.ForEach()
, you cannot simply in-line replace a for
or foreach
loop with them. That's not to say it couldn't be a blind improvement, but without examining the problem and instrumenting it, using them is throwing multithreading at a problem because it might make it faster.
** 的Parallel.For()
和 Parallel.ForEach()
具有重载允许您创建为工作
他们最终创建一个本地状态,在每次迭代的执行后运行的表达式。
**Parallel.For()
and Parallel.ForEach()
has overloads that allow you to create a local state for the Task
they ultimately create, and run an expression before and after each iteration's execution.
如果您有与的Parallel.For()
或并联并行操作。的ForEach()
,很可能是个好主意,用这个重载:
If you have an operation you parallelize with Parallel.For()
or Parallel.ForEach()
, it's likely a good idea to use this overload:
public static ParallelLoopResult For<TLocal>(
int fromInclusive,
int toExclusive,
Func<TLocal> localInit,
Func<int, ParallelLoopState, TLocal, TLocal> body,
Action<TLocal> localFinally
)
例如,呼叫对于()
来总结所有整数从1到100,
For example, calling For()
to sum all integers from 1 to 100,
var total = 0;
Parallel.For(0, 101, () => 0, // <-- localInit
(i, state, localTotal) => { // <-- body
localTotal += i;
return localTotal;
}, localTotal => { <-- localFinally
Interlocked.Add(ref total, localTotal);
});
Console.WriteLine(total);
localInit
应该是初始化一个拉姆达本地状态类型,它被传递到体
和 localFinally
lambda表达式。请注意,我不建议实施总结1至100并行使用,但只是有一个简单的例子来让这个例子很短。
localInit
should be an lambda that initializes the local state type, which is passed to the body
and localFinally
lambdas. Please note I am not recommending implementing summing 1 to 100 using parallelization, but just have a simple example to make the example short.
这篇关于并行框架和避免假共享的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!