在使用iperf测试容器网络性能时发现IPIP隧道性能非常的低,同样在VXLAN模式下网络性能明显高于IPIP隧道;这个问题是在Thunderx2-02与Thunderx2-04 服务发现。针对这个问题,对内核进行tunning。
1. 将iperf client绑定在特定的CPU上,利用perf工具获取程序运行过程中cache,cycle,context-switches 的一些信息;
perf stat -a -e cache-references,cache-misses,cycles,instructions,context-switches,irq:irq_handler_entry,irq:softirq_raise,irq:softirq_entry -C 8,9,10,11 -- taskset -c 8,9,10,11 iperf -c 172.200.50.102 -t 10 -l 8k -i 1 -w 100K -P 4 |
这个过程中发现 04上下文切换特别频繁;
thunderx2-04: Performance counter stats for 'system wide': 2,507,215,365 cache-references 111,415,469 cache-misses # 4.444 % of all cache refs 67,086,874,844 cycles 795,370 context-switches 10.047327675 seconds time elapsed |
thunderx2-02: Performance counter stats for 'system wide': 8,643,300,648 cache-references 704,501,629 cache-misses # 8.151 % of all cache refs 88,676,115,363 cycles 31,310 context-switches 10.050243036 seconds time elapsed |
但是还不清楚有哪些进程跑在这些cpu上,利用下面工具:
可以看到几个内核线程switch频繁,但此时我还不知道这个线程是做什么用的。
perf sched record -C 8,9,10,11 -- taskset -c 8,9,10,11 iperf -c 172.200.50.102 -t 10 -l 8k -i 1 -w 100K -P 4 perf sched latency |
----------------------------------------------------------------------------------------------------------------- Task | Runtime ms | Switches | Average delay ms | Maximum delay ms | Maximum delay at | ----------------------------------------------------------------------------------------------------------------- kworker/u449:3-:202481 | 0.029 ms | 2 | avg: 2184.930 ms | max: 4016.295 ms | max at: 910804.377561 s iperf:(6) | 2574.831 ms | 27683 | avg: 0.027 ms | max: 31.598 ms | max at: 910809.489586 s migration/11:78 | 0.000 ms | 3 | avg: 0.005 ms | max: 0.006 ms | max at: 910805.141550 s kworker/9:1H-kb:3907 | 0.029 ms | 2 | avg: 0.004 ms | max: 0.005 ms | max at: 910802.574529 s kworker/9:1-eve:211650 | 2643.633 ms | 53689 | avg: 0.003 ms | max: 2.335 ms | max at: 910801.817809 s kworker/11:0-ev:212453 | 2013.207 ms | 82402 | avg: 0.003 ms | max: 13.222 ms | max at: 910810.381545 s kworker/8:0-eve:182175 | 2395.868 ms | 84278 | avg: 0.003 ms | max: 6.304 ms | max at: 910809.171858 s kworker/10:2-ev:179435 | 2914.797 ms | 90490 | avg: 0.003 ms | max: 6.503 ms | max at: 910804.112214 s kworker/9:2-eve:217678 | 62.932 ms | 7036 | avg: 0.001 ms | max: 0.004 ms | max at: 910806.190321 s kworker/11:2-ev:216856 | 31.314 ms | 3509 | avg: 0.001 ms | max: 0.002 ms | max at: 910807.091390 s kworker/8:1:217889 | 26.993 ms | 3381 | avg: 0.001 ms | max: 0.002 ms | max at: 910803.190522 s kworker/10:0-ev:217348 | 20.548 ms | 2776 | avg: 0.001 ms | max: 0.004 ms | max at: 910804.786117 s perf:218630 | 15.906 ms | 1 | avg: 0.000 ms | max: 0.000 ms | max at: 0.000000 s sshd:213979 | 0.932 ms | 1 | avg: 0.000 ms | max: 0.000 ms | max at: 0.000000 s jenkins.util.Ti:5749 | 0.029 ms | 1 | avg: 0.000 ms | max: 0.000 ms | max at: 0.000000 s ----------------------------------------------------------------------------------------------------------------- TOTAL: | 12774.563 ms | 355893 | --------------------------------------------------- --------------------------------------------------- |
那就用火焰图来分析吧
sudo perf record -g -e branch-misses,cycles,instructions,cache-misses -C 8,9,10,11 -o iperf-11-6 taskset -c 8,9,10,11 iperf -c 172.200.50.102 -t 10 -l 8k -i 1 -w 100K -P 4 sudo perf report --no-children -i iperf-11-6 |
perf script -i iperf-11-6 &> perf.unfold code/FlameGraph/stackcollapse-perf.pl perf.unfold &> perf.folded code/FlameGraph/flamegraph.pl perf.folded > perf-thunderx2-04-iperf-11-6.svg |
可以看到4个kworker,每一个的栈都是类似的,发现了od_dbs_update;这个函数是在kernledriver/cpufreq目录下; 没接触过这个模块,查看Documentation\cpu-freq\, 了解关键信息“savebattery power”,
google下相关资料还有同事给的指令,可以将这个threaddisable;
echo performance > /sys/devices/system/cpu/cpu8/cpufreq/scaling_governor echo performance > /sys/devices/system/cpu/cpu9/cpufreq/scaling_governor echo performance > /sys/devices/system/cpu/cpu10/cpufreq/scaling_governor echo performance > /sys/devices/system/cpu/cpu11/cpufreq/scaling_governor |
之后的perf结果为:
perf stat -a -e cache-references,cache-misses,cycles,instructions,context-switches,irq:irq_handler_entry,irq:softirq_raise,irq:softirq_entry -C 8,9,10,11 -- taskset -c 8,9,10,11 iperf -c 172.200.50.102 -t 10 -l 8k -i 1 -w 100K -P 4 Performance counter stats for 'system wide': 1,934,218,498 cache-references 162,761,564 cache-misses # 8.415 % of all cache refs 111,003,587,975 cycles 5,523,437,453 instructions # 0.05 insn per cycle 43,713 context-switches 5,498 irq:irq_handler_entry 8,700 irq:softirq_raise 8,698 irq:softirq_entry |
root@net-arm-thunderx2-04:/home/jianlin# perf sched latency ----------------------------------------------------------------------------------------------------------------- Task | Runtime ms | Switches | Average delay ms | Maximum delay ms | Maximum delay at | ----------------------------------------------------------------------------------------------------------------- :12373:12373 | 39.073 ms | 3 | avg: 20.733 ms | max: 62.070 ms | max at: 89536.961252 s :12394:12394 | 0.701 ms | 2 | avg: 15.095 ms | max: 30.186 ms | max at: 89536.991686 s kworker/8:1-mm_:124291 | 0.657 ms | 16 | avg: 0.016 ms | max: 0.160 ms | max at: 89536.982505 s kworker/10:1:12361 | 0.004 ms | 1 | avg: 0.016 ms | max: 0.016 ms | max at: 89529.641059 s migration/10:72 | 0.000 ms | 3 | avg: 0.008 ms | max: 0.012 ms | max at: 89537.848265 s kworker/10:2-ev:132535 | 0.976 ms | 15 | avg: 0.007 ms | max: 0.021 ms | max at: 89532.884291 s ksoftirqd/10:73 | 0.014 ms | 1 | avg: 0.006 ms | max: 0.006 ms | max at: 89531.128267 s kworker/9:0-mm_:11957 | 0.361 ms | 12 | avg: 0.005 ms | max: 0.012 ms | max at: 89534.076292 s kworker/11:5-mm:219074 | 0.894 ms | 19 | avg: 0.005 ms | max: 0.014 ms | max at: 89538.968274 s iperf:(6) | 2116.851 ms | 16204 | avg: 0.004 ms | max: 0.258 ms | max at: 89529.628820 s migration/11:78 | 0.000 ms | 3 | avg: 0.003 ms | max: 0.003 ms | max at: 89529.912260 s migration/9:66 | 0.000 ms | 3 | avg: 0.003 ms | max: 0.004 ms | max at: 89529.784259 s :12388:12388 | 1.713 ms | 1 | avg: 0.003 ms | max: 0.003 ms | max at: 89536.976836 s migration/8:60 | 0.000 ms | 3 | avg: 0.003 ms | max: 0.003 ms | max at: 89537.720259 s kworker/10:0:4978 | 0.052 ms | 1 | avg: 0.002 ms | max: 0.002 ms | max at: 89529.640951 s :12397:12397 | 2.157 ms | 1 | avg: 0.000 ms | max: 0.000 ms | max at: 0.000000 s ----------------------------------------------------------------------------------------------------------------- TOTAL: | 2163.453 ms | 16288 | --------------------------------------------------- |
没有了内核thread的频繁切换,但是相比02来说,04的iperf switches还是很高;
root@net-arm-thunderx2-02:/home/jianlin# perf sched latency ----------------------------------------------------------------------------------------------------------------- Task | Runtime ms | Switches | Average delay ms | Maximum delay ms | Maximum delay at | ----------------------------------------------------------------------------------------------------------------- iperf:(6) | 21872.641 ms | 44 | avg: 0.031 ms | max: 0.180 ms | max at: 348025.791558 s jbd2/sda2-8:1934 | 2.566 ms | 2 | avg: 0.003 ms | max: 0.006 ms | max at: 348035.269940 s |
为什么? 难道对端响应不及时导致的阻塞吗?
Brendan Gregg有blog提及如何使用perf来抓取off-CPU信息;
http://www.brendangregg.com/blog/2015-02-26/linux-perf-off-cpu-flame-graph.html
参照这位神的指令,抓取了04上iperf clientoff-CPU时间火焰图,
这里的宽度代表了代码路径睡眠的总持续时间。
in this graph the width spans the total duration that a codepath was sleeping.
消息发送时等待memory( tcp_sendmsg_locked),导致了thread被换出;
在02上,tcp_sendmsg_locked上睡眠时间仅仅是04上1/3;
Server端是什么情况呐; 02做server时,tcp_recvmsg代码路径上睡眠时间是12,571ms;
对应04上代码路径上睡眠时间是3,453
看样子04 send数据也不给力;
关键点也许就是tcp_sendmsg_locked,数据发送的过程中被阻塞了,线程被切换出去;等待资源的释放。
问题还没定位,后续继续进行tunning.