我使用以下两个makefile来编译我的程序以进行高斯模糊处理。

  • g++ -Ofast -ffast-math -march=native -flto -fwhole-program -std=c++11 -fopenmp -o interpolateFloatImg interpolateFloatImg.cpp
  • g++ -O3 -std=c++11 -fopenmp -o interpolateFloatImg interpolateFloatImg.cpp

  • 我的两个测试环境是:
  • i7 4710HQ 4核8线程
  • E5 2650

  • 但是,第一个输出在E5上的速度是2倍,而在i7上的速度是0.5倍。
    第二个输出在i7上表现较快,但在E5上则较慢。

    谁能提供一些解释?

    这是源代码:https://github.com/makeapp007/interpolateFloatImg

    我将尽快给出更多细节。

    i7上的程序将在8个线程上运行。
    我不知道该程序将在E5上生成多少个线程。

    ====更新====

    我是该项目原始作者的队友,以下是结果。
    Arch-Lenovo-Y50 ~/project/ca/3/12 (git)-[master] % perf stat -d ./interpolateFloatImg lobby.bin out.bin 255 20
    Kernel kernelSize  : 255
    Standard deviation : 20
    Kernel maximum: 0.000397887
    Kernel minimum: 1.22439e-21
    Reading width 20265 height  8533 = 172921245
    Micro seconds: 211199093
    Performance counter stats for './interpolateFloatImg lobby.bin out.bin 255 20':
    1423026.281358      task-clock:u (msec)       #    6.516 CPUs utilized
                 0      context-switches:u        #    0.000 K/sec
                 0      cpu-migrations:u          #    0.000 K/sec
             2,604      page-faults:u             #    0.002 K/sec
    4,167,572,543,807      cycles:u                  #    2.929 GHz                      (46.79%)
    6,713,517,640,459      instructions:u            #    1.61  insn per cycle           (59.29%)
    725,873,982,404      branches:u                #  510.092 M/sec                    (57.28%)
    23,468,237,735      branch-misses:u           #    3.23% of all branches          (56.99%)
    544,480,682,764      L1-dcache-loads:u         #  382.622 M/sec                    (37.00%)
    545,000,783,842      L1-dcache-load-misses:u   #  100.10% of all L1-dcache hits    (31.44%)
    38,696,703,292      LLC-loads:u               #   27.193 M/sec                    (26.68%)
    1,204,703,652      LLC-load-misses:u         #    3.11% of all LL-cache hits     (35.70%)
    218.384387536 seconds time elapsed
    

    这些是工作站的结果:
    workstation:~/mossCAP3/repos/liuyh1_liujzh/12$  perf stat -d ./interpolateFloatImg ../../../lobby.bin out.bin 255 20
    Kernel kernelSize  : 255
    Standard deviation : 20
    Kernel maximum: 0.000397887
    Kernel minimum: 1.22439e-21
    Reading width 20265 height  8533 = 172921245
    Micro seconds: 133661220
    Performance counter stats for './interpolateFloatImg ../../../lobby.bin out.bin 255 20':
    2035379.528531      task-clock (msec)         #   14.485 CPUs utilized
             7,370      context-switches          #    0.004 K/sec
               273      cpu-migrations            #    0.000 K/sec
             3,123      page-faults               #    0.002 K/sec
    5,272,393,071,699      cycles                    #    2.590 GHz                     [49.99%]
                 0      stalled-cycles-frontend   #    0.00% frontend cycles idle
                 0      stalled-cycles-backend    #    0.00% backend  cycles idle
    7,425,570,600,025      instructions              #    1.41  insns per cycle         [62.50%]
    370,199,835,630      branches                  #  181.882 M/sec                   [62.50%]
    47,444,417,555      branch-misses             #   12.82% of all branches         [62.50%]
    591,137,049,749      L1-dcache-loads           #  290.431 M/sec                   [62.51%]
    545,926,505,523      L1-dcache-load-misses     #   92.35% of all L1-dcache hits   [62.51%]
    38,725,975,976      LLC-loads                 #   19.026 M/sec                   [50.00%]
     1,093,840,555      LLC-load-misses           #    2.82% of all LL-cache hits    [49.99%]
    140.520016141 seconds time elapsed
    

    ====更新====
    E5的规范:
    workstation:~$ cat /proc/cpuinfo | grep name | cut -f2 -d: | uniq -c
         20  Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz
    workstation:~$ dmesg | grep cache
    [    0.041489] Dentry cache hash table entries: 4194304 (order: 13, 33554432 bytes)
    [    0.047512] Inode-cache hash table entries: 2097152 (order: 12, 16777216 bytes)
    [    0.050088] Mount-cache hash table entries: 65536 (order: 7, 524288 bytes)
    [    0.050121] Mountpoint-cache hash table entries: 65536 (order: 7, 524288 bytes)
    [    0.558666] PCI: pci_cache_line_size set to 64 bytes
    [    0.918203] VFS: Dquot-cache hash table entries: 512 (order 0, 4096 bytes)
    [    0.948808] xhci_hcd 0000:00:14.0: cache line size of 32 is not supported
    [    1.076303] ehci-pci 0000:00:1a.0: cache line size of 32 is not supported
    [    1.089022] ehci-pci 0000:00:1d.0: cache line size of 32 is not supported
    [    1.549796] sd 4:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
    [    1.552711] sd 5:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
    [    1.552955] sd 6:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
    

    最佳答案

    您的程序具有很高的缓存未命中率。对程序有利还是不利?

    545,000,783,842 L1-dcache-load-misses:u#所有L1-dcache命中次数的100.10%

    545,926,505,523 L1-dcache-load-misses#所有L1-dcache命中率的92.35%

    i7和E5中的缓存大小可能有所不同,因此这是差异的来源之一。其他是-不同的汇编代码,不同的gcc版本,不同的gcc选项。

    您应该尝试查看代码内部,找到热点,分析命令处理的像素数量,以及处理顺序对于cpu和内存可能更好。重写热点(花费大部分时间的代码部分)是解决任务http://shtech.org/course/ca/projects/3/的关键。

    您可以在perf/record/report模式下使用annotate探查器查找热点(如果您添加-g选项重新编译项目,会更容易):

    # Profile program using cpu cycle performance counter; write profile to perf.data file
    perf record ./test test_arg1 test_arg2
    # Read perf.data file and report functions where time was spent
    #  (Do not change ./test file, or recompile it after record and before report)
    perf report
    # Find the hotspot in the top functions by annotation
    #  you may use Arrows and Enter to do "annotate" action from report; or:
    perf annonate -s top_function_name
    perf annonate -s top_function_name > annotate_func1.txt
    

    我可以在具有2个核心(启用HT的4个虚拟核心)和AVX2 + FMA的移动i5-4 *(英特尔Haswell)上,以7倍的速度提高小型bin文件和277个10个参数的速度。

    需要重写一些循环/循环嵌套。您应该了解CPU缓存的工作原理以及更容易使用的方法:经常错过还是不经常错过。另外,gcc可能很笨,可能无法始终检测到读取数据的模式。可能需要进行此检测才能并行处理几个像素。

    08-27 01:38