问题描述
编辑:maaartinus给出了我正在寻找的答案,而tmyklebu关于这个问题的数据帮助了很多,所以谢谢两者! :)我已经读过一些关于HotSpot如何注入代码的内在函数,特别是Java标准数学库()
所以我决定尝试一下,看看如何很多不同HotSpot可以直接进行比较(特别是因为我听说min / max可以编译为无分支asm)。
public static final int max(final int a,final int b)
{
if(a> b)
{
return a;
}
返回b;
}
这是我的实施。从另一个SO问题我已经读过,使用三元运算符使用额外的寄存器,我没有发现在执行if块和使用三元运算符之间存在显着差异(即返回(a> b)?a:b)。
分配一个8Mb的int数组(即200万个值)并随机化它,我做了以下测试:
尝试(最终Benchmark bench = new Benchmark(millis to max))
{
int max = Integer.MIN_VALUE;
for(int i = 0; i< array.length; ++ i)
{
max = OpsMath.max(max,array [i]);
// max = Math.max(max,array [i]);
}
}
我正在尝试使用Benchmark对象与资源块。完成后,它会调用对象上的close()并打印该块完成所需的时间。通过在上面的代码中注释/输出最大调用来单独完成测试。
'max'被添加到基准块之外的列表中并稍后打印,所以避免JVM优化整个块。
每次测试运行时,数组都是随机的。
运行测试6次,它给出了以下结果:
Java标准数学:
毫升至最大9.242167
毫升至最大2.1566199999999998
毫升至最大2.046396
毫升至最大2.048616
毫升至最大2.035761
毫当至最大2.001044
第一次运行后相当稳定,再次运行测试会得到类似的结果。
OpsMath:
millis到最大值8.65418
millis到最大值1.161559
millis到最大0.955851
millis到最大0.946642
millis到最大0.994543
millis到最大0.9469069999999999
A第一次运行后获得非常稳定的结果。
问题是:为什么?那里有很大的不同。我不明白为什么。即使我实现我的max()方法完全像Math.max()(即返回(a> = b)?a:b)我仍然会得到更好的结果!这没有任何意义。
规格:
CPU:Intel i5 2500,3,3Ghz。
Java版本:JDK 8(公开行军18版),x64。
Debian Jessie(测试版)x64。
我还没试过32位JVM。
//恒定纳米到毫秒。
最后双倍TO_MILLIS = 1.0d / 1000000.0d;
// 8Mb alloc。
final int [] array = new int [(8 * 1024 * 1024)/ 4];
//结果和时间数组。
final ArrayList< Integer> results = new ArrayList<>();
final ArrayList< Double> times = new ArrayList<>();
//测试次数。
final int itcount = 6;
//调用Math和OpsMath方法,以便JVM初始化类。
System.out.println(initialize classes+
OpsMath.max(Math.max(20.0f,array.length),array.length / 2.0f));
final Random r = new Random();
for(int it = 0; it< itcount; ++ it)
{
int max = Integer.MIN_VALUE;
//随机化数组。
for(int i = 0; i< array.length; ++ i)
{
array [i] = r.nextInt();
}
final long start = System.nanoTime();
for(int i = 0; i< array.length; ++ i)
{
max = Math.max(array [i],max);
//如上所述实现的OpsMath.max()方法。
// max = OpsMath.max(array [i],max);
}
//计算时间。
final double end =(System.nanoTime() - start);
//存储结果。
times.add(Double.valueOf(end));
results.add(Integer.valueOf(max));
}
//打印所有内容。
for(int i = 0; i< itcount; ++ i)
{
System.out.println(IT+ i +result:+ results.get(一世 ) );
System.out.println(IT+ i +millis:+ times.get(i)* TO_MILLIS);
}
Java Math.max结果:
IT0结果:2147477409
IT0毫秒:9.636998
IT1结果:2147483098
IT1毫秒:1.901314
IT2结果:2147482877
IT2毫秒:2.095551
IT3结果:2147483286
IT3毫秒:1.9232859999999998
IT4结果:2147482828
IT4毫秒:1.9455179999999999
IT5结果:2147482475
IT5毫秒:1.882047
OpsMath.max结果:
IT0结果:2147482689
IT0毫秒:9.003616
IT1结果:2147483480
IT1毫秒:0.882421
IT2结果:2147483186
IT2 millis:1.079143
IT3结果:2147478560
IT3 millis:0.8861169999999999
IT4结果:2147477851
IT4 millis:0.916383
IT5结果:2147481983
IT5 millis:0.873984
总体结果仍然相同。我试过将数组随机化一次,并在同一个数组上重复测试,总体上得到更快的结果,但是Java Math.max和OpsMath.max之间的差异相同。
很难说为什么 Math.max
慢于 Ops.max
,但很容易说出为什么这个基准强烈支持对条件移动进行分支:在 n
-th迭代中,概率为
Math.max(array [i],max);
不等于 max
是概率那个数组[n-1]
比以前的所有元素都大。显然,随着 n
的增长,这种可能性越来越低,而且
final int [] array = new int [(8 * 1024 * 1024)/ 4];
大部分时间它都可以忽略不计。条件移动指令对分支概率不敏感,它总是花费相同的时间来执行。条件移动指令比分支预测更快,如果分支很难预测。另一方面,如果可以高概率地预测分支,则分支预测更快。目前,我不确定条件移动的速度与分支的最佳和最差情况相比。
在你的情况下,除了第一个很少有分支是可以预测的。从大约 n == 10
开始,使用条件移动是没有意义的,因为分支可以保证正确预测并且可以与其他指令并行执行(我猜你每次迭代只需要一个循环。)
这似乎发生在算法计算最小值/最大值或进行一些低效排序(良好的分支可预测性意味着每步的低熵)。 / p>
条件移动和预测分支都需要一个周期。前者的问题是它需要两个操作数,这需要额外的指令。最后,当分支单元空闲时,关键路径可能变得更长和/或ALU饱和。通常,但并非总是如此,在实际应用中可以很好地预测分支;这就是为什么分支预测首先被发明的原因。
关于时间条件移动与分支预测最佳和最差情况的血腥细节,请参阅下面的评论中的讨论。我的表明,当分支预测遇到最坏情况时,条件移动明显快于分支预测,但我不能忽视 。我们需要一些解释才能确切地发挥作用。一些更多的基准和/或分析可能有所帮助。
EDIT: maaartinus gave the answer I was looking for and tmyklebu's data on the problem helped a lot, so thanks both! :)
I've read a bit about how HotSpot has some "intrinsics" that injects in the code, specially for Java standard Math libs (from here)
So I decided to give it a try, to see how much difference HotSpot could make against doing the comparison directly (specially since I've heard min/max can compile to branchless asm).
public static final int max ( final int a, final int b )
{
if ( a > b )
{
return a;
}
return b;
}
That's my implementation. From another SO question I've read that using the ternary operator uses an extra register, I haven't found significant differences between doing an if block and using a ternary operator (ie, return ( a > b ) ? a : b ).
Allocating a 8Mb int array (ie, 2 million values), and randomizing it, I do the following test:
try ( final Benchmark bench = new Benchmark( "millis to max" ) )
{
int max = Integer.MIN_VALUE;
for ( int i = 0; i < array.length; ++i )
{
max = OpsMath.max( max, array[i] );
// max = Math.max( max, array[i] );
}
}
I'm using a Benchmark object in a try-with-resources block. When it finishes, it calls close() on the object and prints the time the block took to complete. The tests are done separately by commenting in/out the max calls in the code above.
'max' is added to a list outside the benchmark block and printed later, so to avoid the JVM optimizing the whole block away.
The array is randomized each time the test runs.
Running the test 6 times, it gives these results:
Java standard Math:
millis to max 9.242167
millis to max 2.1566199999999998
millis to max 2.046396
millis to max 2.048616
millis to max 2.035761
millis to max 2.001044
So fairly stable after the first run, and running the tests again gives similar results.
OpsMath:
millis to max 8.65418
millis to max 1.161559
millis to max 0.955851
millis to max 0.946642
millis to max 0.994543
millis to max 0.9469069999999999
Again, very stable results after the first run.
The question is: Why? Thats quite a big difference there. And I have no idea why. Even if I implement my max() method exactly like Math.max() (ie, return (a >= b) ? a : b ) I still get better results! It makes no sense.
Specs:
CPU: Intel i5 2500, 3,3Ghz.Java Version: JDK 8 (public march 18 release), x64.Debian Jessie (testing release) x64.
I have yet to try with 32 bit JVM.
EDIT: Self contained test as requested. Added a line to force the JVM to preload Math and OpsMath classes. That eliminates the 18ms cost of the first iteration for OpsMath test.
// Constant nano to millis.
final double TO_MILLIS = 1.0d / 1000000.0d;
// 8Mb alloc.
final int[] array = new int[(8*1024*1024)/4];
// Result and time array.
final ArrayList<Integer> results = new ArrayList<>();
final ArrayList<Double> times = new ArrayList<>();
// Number of tests.
final int itcount = 6;
// Call both Math and OpsMath method so JVM initializes the classes.
System.out.println("initialize classes " +
OpsMath.max( Math.max( 20.0f, array.length ), array.length / 2.0f ));
final Random r = new Random();
for ( int it = 0; it < itcount; ++it )
{
int max = Integer.MIN_VALUE;
// Randomize the array.
for ( int i = 0; i < array.length; ++i )
{
array[i] = r.nextInt();
}
final long start = System.nanoTime();
for ( int i = 0; i < array.length; ++i )
{
max = Math.max( array[i], max );
// OpsMath.max() method implemented as described.
// max = OpsMath.max( array[i], max );
}
// Calc time.
final double end = (System.nanoTime() - start);
// Store results.
times.add( Double.valueOf( end ) );
results.add( Integer.valueOf( max ) );
}
// Print everything.
for ( int i = 0; i < itcount; ++i )
{
System.out.println( "IT" + i + " result: " + results.get( i ) );
System.out.println( "IT" + i + " millis: " + times.get( i ) * TO_MILLIS );
}
Java Math.max result:
IT0 result: 2147477409
IT0 millis: 9.636998
IT1 result: 2147483098
IT1 millis: 1.901314
IT2 result: 2147482877
IT2 millis: 2.095551
IT3 result: 2147483286
IT3 millis: 1.9232859999999998
IT4 result: 2147482828
IT4 millis: 1.9455179999999999
IT5 result: 2147482475
IT5 millis: 1.882047
OpsMath.max result:
IT0 result: 2147482689
IT0 millis: 9.003616
IT1 result: 2147483480
IT1 millis: 0.882421
IT2 result: 2147483186
IT2 millis: 1.079143
IT3 result: 2147478560
IT3 millis: 0.8861169999999999
IT4 result: 2147477851
IT4 millis: 0.916383
IT5 result: 2147481983
IT5 millis: 0.873984
Still the same overall results. I've tried with randomizing the array only once, and repeating the tests over the same array, I get faster results overall, but the same 2x difference between Java Math.max and OpsMath.max.
It's hard to tell why Math.max
is slower than a Ops.max
, but it's easy to tell why this benchmark strongly favors branching to conditional moves: On the n
-th iteration, the probability of
Math.max( array[i], max );
being not equal to max
is the probability that array[n-1]
is bigger than all previous elements. Obviously, this probability gets lower and lower with growing n
and given
final int[] array = new int[(8*1024*1024)/4];
it's rather negligible most of the time. The conditional move instruction is insensitive to the branching probability, it always take the same amount of time to execute. The conditional move instruction is faster than branch prediction if the branch is very hard to predict. On the other hand, branch prediction is faster if the branch can be predicted well with high probability. Currently, I'm unsure about the speed of conditional move compared to best and worst case of branching.
In your case all but first few branches are fairly predictable. From about n == 10
onward, there's no point in using conditional moves as the branch is rather guaranteed to be predicted correctly and can execute in parallel with other instructions (I guess you need exactly one cycle per iteration).
This seems to happen for algorithms computing minimum/maximum or doing some inefficient sorting (good branch predictability means low entropy per step).
Both conditional move and predicted branch take one cycle. The problem with the former is that it needs its two operands and this takes additional instruction. In the end the critical path may get longer and/or the ALUs saturated while the branching unit is idle. Often, but not always, branches can be predicted well in practical applications; that's why branch prediction was invented in the first place.
As for the gory details of timing conditional move vs. branch prediction best and worst case, see the discussion below in comments. My my own benchmark shows that conditional move is significantly faster than branch prediction when branch prediction encounters its worst case, but I can't ignore contradictory results. We need some explanation for what exactly makes the difference. Some more benchmarks and/or analysis could help.
这篇关于Java Math.min / max性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!