为什么在繁忙的旋转暂停后，JVM对于相同的代码块显示更多的延迟?

本文介绍了为什么在繁忙的旋转暂停后，JVM对于相同的代码块显示更多的延迟?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

下面的代码明确地演示了该问题，即:

The code below demonstrates the problem unequivocally, which is:

请注意，我当然不使用Thread.sleep.还要注意，因为我使用数学运算而不是IF来更改暂停，所以没有条件导致HotSpot/JIT取消优化.

Note that of course I'm not using Thread.sleep. Also note that there are no conditionals leading to a HotSpot/JIT de-optimization as I'm changing the pause using a math operation, not an IF.

我想定时进行一些数学运算.
首先，我在开始测量之前将块暂停1纳秒.我做了20,000次.
然后，我将暂停时间从1纳秒更改为5秒，然后像往常一样测量延迟.我做了15次.
然后我打印最近的30个测量值，因此您可以看到15个测量值(暂停时间为1纳秒)和15个测量值(暂停时间为5秒).

如下所示，差异很大，尤其是在暂停更改后的第一次测量中. 为什么?！

As you can see below, the discrepancy is big, especially in the very first measurement after the pause change. Why is that!?

$ java -server -cp . JvmPauseLatency
Sat Apr 29 10:34:28 EDT 2017 => Please wait 75 seconds for the results...
Sat Apr 29 10:35:43 EDT 2017 => Calculation: 4.0042328611017236E11
Results:
215
214
215
214
215
214
217
215
216
214
216
213
215
214
215
2343 <----- FIRST MEASUREMENT AFTER PAUSE CHANGE
795
727
942
778
765
856
762
801
708
692
765
776
780
754

代码:

import java.util.Arrays;
import java.util.Date;
import java.util.Random;

public class JvmPauseLatency {

    private static final int WARMUP = 20000;
    private static final int EXTRA = 15;
    private static final long PAUSE = 5 * 1000000000L; // in nanos

    private final Random rand = new Random();
    private int count;
    private double calculation;
    private final long[] results = new long[WARMUP + EXTRA];
    private long interval = 1; // in nanos

    private long busyPause(long pauseInNanos) {
        final long start = System.nanoTime();
        long until = Long.MAX_VALUE;
        while(System.nanoTime() < until) {
           until = start + pauseInNanos;
        }
        return until;
    }

    public void run() {

        long testDuration = ((WARMUP * 1) + (EXTRA * PAUSE)) / 1000000000L;
        System.out.println(new Date() +" => Please wait " + testDuration + " seconds for the results...");

        while(count < results.length) {

            double x = busyPause(interval);

            long latency = System.nanoTime();

            calculation += x / (rand.nextInt(5) + 1);
            calculation -= calculation / (rand.nextInt(5) + 1);
            calculation -= x / (rand.nextInt(6) + 1);
            calculation += calculation / (rand.nextInt(6) + 1);

            latency = System.nanoTime() - latency;

            results[count++] = latency;
            interval = (count / WARMUP * (PAUSE - 1)) + 1; // it will change to PAUSE when it reaches WARMUP
        }

        // now print the last (EXTRA * 2) results so you can compare before and after the pause change (from 1 to PAUSE)
        System.out.println(new Date() + " => Calculation: " + calculation);
        System.out.println("Results:");
        long[] array = Arrays.copyOfRange(results, results.length - EXTRA * 2, results.length);
        for(long t: array) System.out.println(t);
    }

    public static void main(String[] args) {
        new JvmPauseLatency().run();
    }
}

TL; DR

http://www.brendangregg.com/activebenchmarking.html

问题N1.暂停更改后的第一个测量.

您似乎遇到了堆栈替换.发生OSR时，VM暂停，目标功能的堆栈帧被等效的帧替换.

Problem N1. The very first measurement after the pause change.

It looks like you are faced with on-stack replacement. When OSR occurs, the VM is paused, and the stack frame for the target function is replaced by an equivalent frame.

根本原因是微基准测试错误-未正确预热.只需在while循环之前将以下行插入您的基准测试即可对其进行修复:

The root case is wrong microbenchmark - it was not properly warmed up. Just insert the following line into your benchmark before while loop in order to fix it:

System.out.println("WARMUP = " + busyPause(5000000000L));

如何检查-只需使用-XX:+UnlockDiagnosticVMOptions -XX:+PrintCompilation -XX:+TraceNMethodInstalls运行基准测试.我已经修改了您的代码-现在，它将在每次调用之前将间隔打印到系统输出中:

How to check this - just run your benchmark with -XX:+UnlockDiagnosticVMOptions -XX:+PrintCompilation -XX:+TraceNMethodInstalls. I've modified your code - now it prints interval into system output before every call:

interval = 1
interval = 1
interval = 5000000000
    689  145       4       JvmPauseLatency::busyPause (19 bytes)   made not entrant
    689  146       3       JvmPauseLatency::busyPause (19 bytes)
Installing method (3) JvmPauseLatency.busyPause(J)J
    698  147 %     4       JvmPauseLatency::busyPause @ 6 (19 bytes)
Installing osr method (4) JvmPauseLatency.busyPause(J)J @ 6
    702  148       4       JvmPauseLatency::busyPause (19 bytes)
    705  146       3       JvmPauseLatency::busyPause (19 bytes)   made not entrant
Installing method (4) JvmPauseLatency.busyPause(J)J
interval = 5000000000
interval = 5000000000
interval = 5000000000
interval = 5000000000

通常OSR发生在第4层，因此要禁用它，可以使用以下选项:

Usually OSR occurs at tier 4 so in order to disable it you can use the following options:

-XX:-TieredCompilation禁用分层编译
-XX:-TieredCompilation -XX:TieredStopAtLevel=3禁用4级分层编译
-XX:+TieredCompilation -XX:TieredStopAtLevel=4 -XX:-UseOnStackReplacement禁用OSR

-XX:-TieredCompilation disable tiered compilation
-XX:-TieredCompilation -XX:TieredStopAtLevel=3 disable tiered compilation to 4 level
-XX:+TieredCompilation -XX:TieredStopAtLevel=4 -XX:-UseOnStackReplacement disable OSR

让我们从文章> https://shipilev.net/blog/2014/nanotrusting-nanotime开始.简而言之:

Let's start from the article https://shipilev.net/blog/2014/nanotrusting-nanotime. In few words:

JIT只能编译方法-在您的测试中，您只有一个循环，因此只有OSR可用于您的测试
您正在尝试测量可能小于nanoTime()调用的较小内容(请参见易失性写入的成本是多少?)
微体系结构级别–缓存，CPU管道停顿很重要，例如，TLB丢失或分支错误预测花费的时间比测试执行时间要长

JIT can compile only method - in your test you have one loop, so only OSR is available for your test
you are trying to measure something small, maybe smaller than nanoTime() call(see What is the cost of volatile write?)
microarchitecture level – caches, CPU pipeline stalls are important, for example, TLB miss or branch misprediction take more time than the test execution time

因此，为了避免所有这些陷阱，您可以使用基于JMH的基准测试，如下所示:

So in order to avoid all these pitfalls you can use JMH based benchmark like this:

import org.openjdk.jmh.annotations.*;
import org.openjdk.jmh.infra.Blackhole;
import org.openjdk.jmh.runner.Runner;
import org.openjdk.jmh.runner.RunnerException;
import org.openjdk.jmh.runner.options.Options;
import org.openjdk.jmh.runner.options.OptionsBuilder;
import org.openjdk.jmh.runner.options.VerboseMode;

import java.util.Random;
import java.util.concurrent.TimeUnit;

@State(Scope.Benchmark)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Warmup(iterations = 2, time = 1, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 2, time = 3, timeUnit = TimeUnit.SECONDS)
@Fork(value = 2)
public class LatencyTest {

    public static final long LONG_PAUSE = 5000L;
    public static final long SHORT_PAUSE = 1L;
    public Random rand;

    @Setup
    public void initI() {
        rand = new Random(0xDEAD_BEEF);
    }

    private long busyPause(long pauseInNanos) {
        Blackhole.consumeCPU(pauseInNanos);
        return pauseInNanos;
    }

    @Benchmark
    @BenchmarkMode({Mode.AverageTime})
    public long latencyBusyPauseShort() {
        return busyPause(SHORT_PAUSE);
    }

    @Benchmark
    @BenchmarkMode({Mode.AverageTime})
    public long latencyBusyPauseLong() {
        return busyPause(LONG_PAUSE);
    }

    @Benchmark
    @BenchmarkMode({Mode.AverageTime})
    public long latencyFunc() {
        return doCalculation(1);
    }

    @Benchmark
    @BenchmarkMode({Mode.AverageTime})
    public long measureShort() {
        long x = busyPause(SHORT_PAUSE);
        return doCalculation(x);
    }

    @Benchmark
    @BenchmarkMode({Mode.AverageTime})
    public long measureLong() {
        long x = busyPause(LONG_PAUSE);
        return doCalculation(x);
    }

    private long doCalculation(long x) {
        long calculation = 0;
        calculation += x / (rand.nextInt(5) + 1);
        calculation -= calculation / (rand.nextInt(5) + 1);
        calculation -= x / (rand.nextInt(6) + 1);
        calculation += calculation / (rand.nextInt(6) + 1);
        return calculation;
    }

    public static void main(String[] args) throws RunnerException {
        Options options = new OptionsBuilder()
                .include(LatencyTest.class.getName())
                .verbosity(VerboseMode.NORMAL)
                .build();
        new Runner(options).run();
    }
}

请注意，为了避免与操作系统相关的影响，我已将繁忙循环实现更改为Blackhole#consumeCPU().所以我的结果是:

Please note that I've changed busy loop implementation to Blackhole#consumeCPU() in order to avoid OS related effects. So my results are:

Benchmark                          Mode  Cnt      Score     Error  Units
LatencyTest.latencyBusyPauseLong   avgt    4  15992.216 ± 106.538  ns/op
LatencyTest.latencyBusyPauseShort  avgt    4      6.450 ±   0.163  ns/op
LatencyTest.latencyFunc            avgt    4     97.321 ±   0.984  ns/op
LatencyTest.measureLong            avgt    4  16103.228 ± 102.338  ns/op
LatencyTest.measureShort           avgt    4    100.454 ±   0.041  ns/op

请注意，结果几乎是累加的，即 latencyFunc +等待时间BusyPauseShort = measureShort

Please note that the results are almost additive, i.e. latencyFunc + latencyBusyPauseShort = measureShort

您的考试出了什么问题?它无法正确预热JVM，即它使用一个参数进行预热，而使用另一个参数进行测试.为什么这很重要? JVM使用配置文件引导的优化，例如，它计算分支采用的频率，并为特定配置文件生成最佳"(无分支)代码.因此，然后我们尝试使用参数1预热JVM，使JVM达到基准，JVM生成最优代码"，而分支从未进入while循环.这是来自JIT编译日志(-XX:+UnlockDiagnosticVMOptions -XX:+LogCompilation)的事件:

What is wrong with your test? It does not warm-up JVM properly, i.e. it uses one parameter to warm-up and another to test. Why is this important? JVM uses profile-guided optimizations, for example, it counts how often a branch has been taken and generates "the best"(branch-free) code for the particular profile. So then we are trying to warm-up JVM our benchmark with parameter 1, JVM generates "optimal code" where branch in while loop has been never taken. Here is an event from JIT compilation log(-XX:+UnlockDiagnosticVMOptions -XX:+LogCompilation):

<branch prob="0.0408393" not_taken="40960" taken="1744" cnt="42704" target_bci="42"/>

属性更改后，JIT使用不常见的陷阱来处理不是最佳的代码.我创建了一个基准，该基准基于原始基准进行了微小的更改:

After property change JIT uses uncommon trap in order to process your code which is not optimal. I've created a benchmark which is based on original one with minor changes:

busyPause被JMH的消耗CPU取代，以便获得纯Java基准测试，而无需与系统进行交互(实际上，nano time使用用户界面功能vdso clock_gettime，我们无法分析此代码)
所有计算均已删除

busyPause replaced by consumeCPU from JMH in order to have pure java benchmark without interactions with system(actually nano time uses userland function vdso clock_gettime and we unable to profile this code)
all calculations are removed

import java.util.Arrays;

public class JvmPauseLatency {

    private static final int WARMUP = 2000 ;
    private static final int EXTRA = 10;
    private static final long PAUSE = 70000L; // in nanos
    private static volatile long consumedCPU = System.nanoTime();

    //org.openjdk.jmh.infra.Blackhole.consumeCPU()
    private static void consumeCPU(long tokens) {
        long t = consumedCPU;
        for (long i = tokens; i > 0; i--) {
            t += (t * 0x5DEECE66DL + 0xBL + i) & (0xFFFFFFFFFFFFL);
        }
        if (t == 42) {
            consumedCPU += t;
        }
    }

    public void run(long warmPause) {
        long[] results = new long[WARMUP + EXTRA];
        int count = 0;
        long interval = warmPause;
        while(count < results.length) {

            consumeCPU(interval);

            long latency = System.nanoTime();
            latency = System.nanoTime() - latency;

            results[count++] = latency;
            if (count == WARMUP) {
                interval = PAUSE;
            }
        }

        System.out.println("Results:" + Arrays.toString(Arrays.copyOfRange(results, results.length - EXTRA * 2, results.length)));
    }

    public static void main(String[] args) {
        int totalCount = 0;
        while (totalCount < 100) {
            new JvmPauseLatency().run(0);
            totalCount ++;
        }
    }
}

结果是

Results:[62, 66, 63, 64, 62, 62, 60, 58, 65, 61, 127, 245, 140, 85, 88, 114, 76, 199, 310, 196]
Results:[61, 63, 65, 64, 62, 65, 82, 63, 67, 70, 104, 176, 368, 297, 272, 183, 248, 217, 267, 181]
Results:[62, 65, 60, 59, 54, 64, 63, 71, 48, 59, 202, 74, 400, 247, 215, 184, 380, 258, 266, 323]

要修复此基准，只需将new JvmPauseLatency().run(0)替换为new JvmPauseLatency().run(PAUSE);，结果如下:

In order to fix this benchmark just replace new JvmPauseLatency().run(0) with new JvmPauseLatency().run(PAUSE); and here is the results:

Results:[46, 45, 44, 45, 48, 46, 43, 72, 50, 47, 46, 44, 54, 45, 43, 43, 43, 48, 46, 43]
Results:[44, 44, 45, 45, 43, 46, 46, 44, 44, 44, 43, 49, 45, 44, 43, 49, 45, 46, 45, 44]

如果要动态更改暂停"，则必须动态预热JVM，即

If you want to change "pause" dynamically - you have to warm-up JVM dynamically, i.e.

    while(count < results.length) {

        consumeCPU(interval);

        long latency = System.nanoTime();
        latency = System.nanoTime() - latency;

        results[count++] = latency;
        if (count >= WARMUP) {
            interval = PAUSE;
        } else {
            interval =  rnd.nextBoolean() ? PAUSE : 0;
        }
    }

问题N4.解释器-Xint呢?

在基于开关的解释器中，我们有很多问题，主要是间接分支指令.我做了3个实验:

Problem N4. What about interpreter -Xint?

In case of switch-based interpreter we have a lot of problems and the main is indirect branch instructions. I've made 3 experiments:

随机热身
具有0个暂停的恒定预热
整个测试使用暂停0，包括

每个实验都是通过以下命令sudo perf stat -e cycles,instructions,cache-references,cache-misses,bus-cycles,branch-misses java -Xint JvmPauseLatency开始的，结果是:

Every experiment was started by the following command sudo perf stat -e cycles,instructions,cache-references,cache-misses,bus-cycles,branch-misses java -Xint JvmPauseLatency and the results are:

 Performance counter stats for 'java -Xint JvmPauseLatency':

   272,822,274,275      cycles
   723,420,125,590      instructions              #    2.65  insn per cycle
        26,994,494      cache-references
         8,575,746      cache-misses              #   31.769 % of all cache refs
     2,060,138,555      bus-cycles
         2,930,155      branch-misses

      86.808481183 seconds time elapsed

 Performance counter stats for 'java -Xint JvmPauseLatency':

     2,812,949,238      cycles
     7,267,497,946      instructions              #    2.58  insn per cycle
         6,936,666      cache-references
         1,107,318      cache-misses              #   15.963 % of all cache refs
        21,410,797      bus-cycles
           791,441      branch-misses

       0.907758181 seconds time elapsed

 Performance counter stats for 'java -Xint JvmPauseLatency':

       126,157,793      cycles
       158,845,300      instructions              #    1.26  insn per cycle
         6,650,471      cache-references
           909,593      cache-misses              #   13.677 % of all cache refs
         1,635,548      bus-cycles
           775,564      branch-misses

       0.073511817 seconds time elapsed

在分支未命中的情况下，由于巨大的内存占用，延迟和占用空间会非线性增长.

In case of branch miss latency and footprint grows non-linearly due to huge memory footprint.

这篇关于为什么在繁忙的旋转暂停后，JVM对于相同的代码块显示更多的延迟?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！

results