本文介绍了基准测试-如何计算发送到CPU的指令数以查找消耗的MIPS的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

考虑到我有一个软件,并且想使用黑盒方法.我有一个带有2个插槽和4个内核的3.0GHz CPU.如您所知,为了找出每秒的指令(IPS),我们必须使用以下公式:

Consider I have a software and want to study its behavior using a black-box approach. I have a 3.0GHz CPU with 2 sockets and 4 cores. As you know, in order to find out instructions per second (IPS) we have to use the following formula:

IPS = sockets*(cores/sockets)*clock*(instructions/cycle)

首先,我想查找特定算法每个周期的指令数.然后,我意识到使用块盒方法对其进行计数几乎是不可能的,我需要对算法进行深入分析.

At first, I wanted to find number of instructions per cycle for my specific algorithm. Then I realised its almost impossible to count it using a block-box approach and I need to do in-depth analysis of the algorithm.

但是现在,我有两个问题:不管我的计算机上运行的是哪种软件以及其cpu的使用情况,有没有办法计数每秒发送到CPU的指令数(每秒数百万条指令(MIPS ))?是否可以找到指令集的类型(添加,比较,输入,跳转等)?

But now, I have two question: Regardless of what kind of software is running on my machine and its cpu usage, is there any way to count number of instructions per second sent to the CPU (Millions of instructions per second (MIPS))? And is it possible to find the type of instruction set (add, compare, in, jump, etc) ?

任何脚本或工具推荐都会受到赞赏(以任何语言).

Any piece of script or tool recommendation would be appreciated (in any language).

推荐答案

perf stat --all-user ./my_program将使用CPU性能计数器来记录它运行了多少用户空间指令,以及花费了多少核心时钟周期.以及它使用了多少CPU时间,并将为您计算每个核心时钟周期的平均指令,例如

perf stat --all-user ./my_program on Linux will use CPU performance counters to record how many user-space instructions it ran, and how many core clock cycles it took. And how much CPU time it used, and will calculate average instructions per core clock cycle for you, e.g.

3,496,129,612      instructions:u            #    2.61  insn per cycle

它为您计算IPC;这通常比每秒 的指令更有趣.不过,就使前端最大化而言,每个时钟通常uops甚至会更有趣. 您可以根据instructionstask-clock手动计算MIPS.对于大多数其他事件,perf会以每秒的速率打印注释.

It calculates IPC for you; this is usually more interesting than instructions per second. uops per clock is usually even more interesting in terms of how close you are to maxing out the front-end, though. You can manually calculate MIPS from instructions and task-clock. For most other events perf prints a comment with a per-second rate.

(如果不使用--all-user,则可以使用perf stat -e task-clock:u,instructions:u,...使那些特定的事件仅在用户空间中计数,而其他事件可以始终计数,包括内部中断处理程序和系统调用)

(If you don't use --all-user, you can use perf stat -e task-clock:u,instructions:u , ... to have those specific events count in user-space only, while other events can count always, including inside interrupt handlers and system calls.)

但请参阅如何使用性能统计信息来计算MIPS ,以了解更多信息instructions / task-clockinstructions / elapsed_time的详细信息,如果您确实想要跨内核的总MIPS或平均MIPS,并计算睡眠与否.

But see How to calculate MIPS using perf stat for more detail on instructions / task-clock vs. instructions / elapsed_time if you do actually want total or average MIPS across cores, and counting sleep or not.

有关在静态可执行文件的微小基准测试循环中使用它的示例输出,请参见

For an example output from using it on a tiny microbenchmark loop in a static executable, see Can x86's MOV really be "free"? Why can't I reproduce this at all?

您的意思是仅在程序内部进行概要分析吗?有一个perf API,您可以在其中执行perf_event_open之类的功能.或使用其他库直接访问硬件性能计数器.

Do you mean from within the program, to profile only part of it? There's a perf API where you can do perf_event_open or something. Or use a different library for direct access to the HW perf counters.

perf stat非常适合对已隔离到独立程序中的循环进行微基准测试,该程序仅运行热循环一秒钟左右.

perf stat is great for microbenchmarking a loop that you've isolated into a stand-alone program that just runs the hot loop for a second or so.

或者您可能要说别的什么. perf stat -I 1000 ... ./a.out将每1000毫秒(1秒)打印一次计数器值,以查看程序行为在所需的任何时间窗口内(以10ms为间隔)实时变化.

Or maybe you mean something else. perf stat -I 1000 ... ./a.out will print counter values every 1000 ms (1 second), to see how program behaviour changes in real time with whatever time window you want (down to 10ms intervals).

sudo perf top是系统范围的,有点像Unix top

sudo perf top is system-wide, slightly like Unix top

还有一个perf record --timestamp记录每个事件样本的时间戳. perf report -D可能与此同时有用.请参见 http://www.brendangregg.com/perf.html ,他提到了有关-T(--timestamp).我还没有真正使用过它.我主要隔离要调整为可以在perf stat下运行的静态可执行文件的单循环.

There's also perf record --timestamp to record a timestamp with each event sample. perf report -D might be useful along with this. See http://www.brendangregg.com/perf.html, he mentions something about -T (--timestamp). I haven't really used this; I mostly isolate single loops I'm tuning into a static executable I can run under perf stat.

Intel x86 CPU至少具有一个用于分支指令的计数器,但是除FP指令外,其他类型没有区别.对于大多数根本没有性能计数器的体系结构,这可能很常见.

Intel x86 CPUs at least have a counter for branch instructions, but other types aren't differentiated, other than FP instructions. This is probably common to most architectures that have perf counters at all.

对于Intel CPU,有 ocperf.py ,它是perf的包装,带有符号更多微体系结构事件的名称. (更新:平原perf现在知道大多数特定于uarch的计数器的名称,因此您不再需要ocperf.py.)

For Intel CPUs, there's ocperf.py, a wrapper for perf with symbolic names for more microarchitectural events. (Update: plain perf now knows the names of most uarch-specific counters so you don't need ocperf.py anymore.)

perf stat -e task_clock,cycles,instructions,fp_arith_inst_retired.128b_packed_single,fp_arith_inst_retired.scalar_double,uops_executed.x87 ./my_program

它并非旨在告诉您正在运行的指令,您已经可以通过跟踪执行来知道.大多数指令都是完全流水线的,所以有趣的是哪个端口的压力最大.除法是除法/平方单位:arith.divider_active有一个计数器:"除法单位忙于执行除法或平方根运算时循环.说明整数和浮点运算".分频器未完全流水线化,因此即使没有旧的uops准备在端口0上执行,新的divpssqrtps也无法始终启动.( http://agner.org/optimize/)

It's not designed to tell you what instructions are running, you can already tell that from tracing execution. Most instructions are fully pipelined, so the interesting thing is which ports have the most pressure. The exception is the divide/sqrt unit: there's a counter for arith.divider_active: "Cycles when divide unit is busy executing divide or square root operations. Accounts for integer and floating-point operations". The divider isn't fully pipelined, so a new divps or sqrtps can't always start even if no older uops are ready to execute on port 0. (http://agner.org/optimize/)

相关: linux perf:如何解释和查找热点用于使用perf识别热点.尤其是使用自上而下的性能分析,您已对调用堆栈进行了perf采样,以查看哪些函数产生了大量昂贵的子调用. (我提到这一点是为了您真正想知道的,而不是指令的混合.)

Related: linux perf: how to interpret and find hotspots for using perf to identify hotspots. Especially using top-down profiling you have perf sample the call-stack to see which functions make a lot of expensive child calls. (I mention this in case that's what you really wanted to know, rather than instruction mix.)

相关:

  • How do I determine the number of x86 machine instructions executed in a C program?
  • How to characterize a workload by obtaining the instruction type breakdown?
  • How do I monitor the amount of SIMD instruction usage

对于精确的动态指令计数,如果您使用的是x86,则可以使用Intel PIN之类的检测工具. https://software.intel.com/zh-cn/articles/pin-a-dynamic-binary-instrumentation-tool .

perf statinstructions:u硬件的计数甚至也应该或多或少准确,并且实际上在执行相同工作的同一程序的运行之间具有很高的可重复性.

perf stat counts for the instructions:u hardware even should also be more or less exact, and is in practice very repeatable across runs of the same program doing the same work.

在最新的Intel CPU上,硬件支持记录条件/间接分支的运行方式,因此您可以假设没有自修改代码并且可以读取任何JIT缓冲区,从而准确地重构哪些指令按哪个顺序运行. 英特尔PT .

On recent Intel CPUs, there's HW support for recording which way conditional / indirect branches went, so you can reconstruct exactly which instructions ran in which order, assuming no self-modifying code and that you can still read any JIT buffers. Intel PT.

抱歉,我不知道AMD CPU上的等效功能.

Sorry I don't know what the equivalents are on AMD CPUs.

这篇关于基准测试-如何计算发送到CPU的指令数以查找消耗的MIPS的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-29 13:37