本章是系列文章的第二章,介绍了基于控制流图的一些优化方法。包括DAG、值标记、相同子表达式等方法。这章的后面介绍了llvm的一些基本概念,并引导大家写了个简单的pass。
本章是系列文章的第二章,介绍了基于控制流图的一些优化方法。包括DAG、值标记、相同子表达式等方法。这章的后面介绍了llvm的一些基本概念,并引导大家写了个简单的pass。
为了找到更多的CSE(Common SubExpressions),需要制定更多的定理:
交换律:对+运算符,x+y和y+x等同。
特性转换:x<y一般转换成t=x-y; t<0
结合律: 对
a=b+c; t=c+d; e=t+b;
等同于:
a=b+c; e=a+d;
算术特性转换:
x+0=0+x=x;
x*1=1*x=x;
x-0=x;
x/1=x;
计算强度降维转换:
x2=x*x;
2*x=x+x;
常量折叠:在编译阶段计算表达式的值,并将表达式替换成对应的值。
死代码(Dead Code)是满足下面2个条件的DAG结点:
窥孔优化的实例:
局部寄存器分配的伪代码类似这样:
1 allocate(Block b) { 2 for (Inst i : b.instructions()) { 3 for (Operand o : i.operands()) { 4 if (o is in memory m) { 5 r = find_reg(i) assign r to o add "r = load m" before i 6 } 7 } 8 for (Operand o : i.operands()) { 9 if (i is the last use of o) { 10 return the register bound to o to the list of free registers 11 } 12 } 13 v = i.definition r = find_reg(i) assign r to v 14 } 15 }
溢出(spilling):大多数情况下寄存器是有限的,需要在寄存器不够用的情况下将之前保存在寄存器里面的内容映射回内存,这个操作叫做溢出。
find_reg函数的伪代码是这样的:
1 find_reg(i) { 2 if there is free register r 3 return r 4 else 5 let v be the latest variable to be used after i, that is in a register 6 if v does not have a memory slot 7 let m be a fresh memory slot 8 else 9 let m be the memory slot assigned to v 10 add "store v m" right after the definition of v 11 return r 12 }
可以看出局部寄存器的分配,主要依赖在变量使用前插入"r = load m"指令,并在变量使用完之后插入"store v m"来实现。
但当寄存器不足的时候,要选择将哪个变量从寄存器溢出到内存里面?
伯克利算法策略:溢出时通常选择离溢出点最远的变量,也称为LRU(Least Recently Used,最近最少使用算法)。该算法在各种缓存溢出过程中广泛采用,包括页面置换,cache miss等过程。
对只有2个寄存器的机器,要实现下面的计算:
1 a = load m0 2 b = load m1 3 c = a + b 4 d = 4 * c 5 e = b + 1 6 f = e + a 7 g = d / e 8 h = f - g 9 ret h
实际完成局部寄存器分配之后的代码和各变量在寄存器,内存里面的生命周期是这样的:
上面的算法在这次运算中其实不是最优解。如果在"c=a+b"计算之前不把b踢出寄存器,而是在"d=4*c"计算中让d复用c的寄存器,就可以少store一次b并且少load一次b。
但在1998年就有科学家证明了,找到每次分配寄存器的最优解是NP完全问题(NP-completeness,是"nondeterministic polynomial-time completeness"的简称,也就是不确定的多项式时间完全问题,其中不确定性指的是不确定图灵机,是数学上形式化描述的暴力搜索算法。对确定性的算法,只需要进行一次迭代就能得出结果,对不确定的算法,需要遍历整个空间。)。也就是说,如果遍历所有分配选项,当然是能找出一个最优解的,但时间消耗非常大,所以各个类似领域都是用LRU作为较优解。
llvm有很多编译工具:
1 root@e6db4f256fba:/home/.devtools/efb/clang11/bin# cd /home/.devtools/efb/clang11/bin/ 2 root@e6db4f256fba:/home/.devtools/efb/clang11/bin# ls 3 bugpoint ld64.lld llvm-gsymutil llvm-rtdyld 4 c-index-test llc llvm-ifs llvm-size 5 clang lld llvm-install-name-tool llvm-special-case-list-fuzzer 6 clang++ lld-link llvm-isel-fuzzer llvm-split 7 clang-11 lldb llvm-itanium-demangle-fuzzer llvm-stress 8 clang-apply-replacements lldb-argdumper llvm-jitlink llvm-strings 9 clang-change-namespace lldb-instr llvm-lib llvm-strip 10 clang-check lldb-server llvm-link llvm-symbolizer 11 clang-cl lldb-vscode llvm-lipo llvm-tblgen 12 clang-cpp lli llvm-lit llvm-undname 13 clang-doc llvm-addr2line llvm-locstats llvm-xray 14 clang-extdef-mapping llvm-ar llvm-lto llvm-yaml-numeric-parser-fuzzer 15 clang-format llvm-as llvm-lto2 mlir-cpu-runner 16 clang-include-fixer llvm-bcanalyzer llvm-mc mlir-edsc-builder-api-test 17 clang-move llvm-c-test llvm-mca mlir-linalg-ods-gen 18 clang-offload-bundler llvm-cat llvm-microsoft-demangle-fuzzer mlir-opt 19 clang-offload-wrapper llvm-cfi-verify llvm-ml mlir-reduce 20 clang-query llvm-config llvm-modextract mlir-sdbm-api-test 21 clang-refactor llvm-cov llvm-mt mlir-tblgen 22 clang-rename llvm-cvtres llvm-nm mlir-translate 23 clang-reorder-fields llvm-cxxdump llvm-objcopy modularize 24 clang-scan-deps llvm-cxxfilt llvm-objdump obj2yaml 25 clang-tblgen llvm-cxxmap llvm-opt-fuzzer opt 26 clang-tidy llvm-diff llvm-opt-report pp-trace 27 clangd llvm-dis llvm-pdbutil sancov 28 diagtool llvm-dlltool llvm-profdata sanstats 29 dsymutil llvm-dwarfdump llvm-ranlib scan-build 30 find-all-symbols llvm-dwp llvm-rc scan-view 31 git-clang-format llvm-elfabi llvm-readelf verify-uselistorder 32 hmaptool llvm-exegesis llvm-readobj wasm-ld 33 ld.lld llvm-extract llvm-reduce yaml2obj
1 root@e6db4f256fba:/home/.devtools/efb/clang11/bin# opt --help 2 OVERVIEW: llvm .bc -> .bc modular optimizer and analysis printer 3 4 USAGE: opt [options] <input bitcode file> 5 6 OPTIONS: 7 8 Color Options: 9 10 --color - Use colors in output (default=autodetect) 11 12 General options: 13 14 --Emit-dtu-info - Enable DTU info section generation 15 --O0 - Optimization level 0. Similar to clang -O0 16 --O1 - Optimization level 1. Similar to clang -O1 17 --O2 - Optimization level 2. Similar to clang -O2 18 --O3 - Optimization level 3. Similar to clang -O3 19 --Os - Like -O2 with extra optimizations for size. Similar to clang -Os 20 --Oz - Like -Os but reduces code size further. Similar to clang -Oz 21 -S - Write output as LLVM assembly 22 --aarch64-neon-syntax=<value> - Choose style of NEON code to emit from AArch64 backend: 23 =generic - Emit generic NEON assembly 24 =apple - Emit Apple-style NEON assembly 25 --addrsig - Emit an address-significance table 26 --analyze - Only perform analysis, no optimization 27 --asm-show-inst - Emit internal instruction representation to assembly file 28 --atomic-counter-update-promoted - Do counter update using atomic fetch add for promoted counters only 29 Optimizations available: 30 --X86CondBrFolding - X86CondBrFolding 31 --aa - Function Alias Analysis Results 32 --aa-eval - Exhaustive Alias Analysis Precision Evaluator 33 --aarch64-a57-fp-load-balancing - AArch64 A57 FP Load-Balancing 34 --aarch64-branch-targets - AArch64 Branch Targets 35 --aarch64-ccmp - AArch64 CCMP Pass 36 --aarch64-collect-loh - AArch64 Collect Linker Optimization Hint (LOH) 37 --aarch64-condopt - AArch64 CondOpt Pass 38 --aarch64-copyelim - AArch64 redundant copy elimination pass 39 --aarch64-dead-defs - AArch64 Dead register definitions 40 --aarch64-expand-pseudo - AArch64 pseudo instruction expansion pass 41 --aarch64-fix-cortex-a53-835769-pass - AArch64 fix for A53 erratum 835769 42 --aarch64-jump-tables - AArch64 compress jump tables pass 43 --aarch64-ldst-opt - AArch64 load / store optimization pass 44 --aarch64-local-dynamic-tls-cleanup - AArch64 Local Dynamic TLS Access Clean-up 45 --aarch64-prelegalizer-combiner - Combine AArch64 machine instrs before legalization 46 --aarch64-promote-const - AArch64 Promote Constant Pass 47 --aarch64-simd-scalar - AdvSIMD Scalar Operation Optimization 48 --aarch64-simdinstr-opt - AArch64 SIMD instructions optimization pass 49 --aarch64-speculation-hardening - AArch64 speculation hardening pass 50 --aarch64-stack-tagging-pre-ra - AArch64 Stack Tagging PreRA Pass 51 --aarch64-stp-suppress - AArch64 Store Pair Suppression 52 --adce - Aggressive Dead Code Elimination 53 …………
不同优化级别的优化使能的优化选项:
1 root@e6db4f256fba:/home/.devtools/efb/clang11/bin# llvm-as < /dev/null | opt -O0 -disable-output -debug-pass=Arguments 2 Pass Arguments: -tti -verify -ee-instrument 3 Pass Arguments: -targetlibinfo -tti -assumption-cache-tracker -profile-summary-info -forceattrs -basiccg -always-inline -verify 4 root@e6db4f256fba:/home/.devtools/efb/clang11/bin# llvm-as < /dev/null | opt -O3 -disable-output -debug-pass=Arguments 5 Pass Arguments: -tti -tbaa -scoped-noalias -assumption-cache-tracker -targetlibinfo -verify -ee-instrument -simplifycfg -domtree -sroa -early-cse -lower-expect 6 Pass Arguments: -targetlibinfo -tti -tbaa -scoped-noalias -assumption-cache-tracker -profile-summary-info -forceattrs -inferattrs -domtree -callsite-splitting -ipsccp -called-value-propagation -attributor -globalopt -domtree -mem2reg -deadargelim -domtree -basicaa -aa -loops -lazy-branch-prob -lazy-block-freq -opt-remark-emitter -instcombine -simplifycfg -basiccg -globals-aa -prune-eh -inline -functionattrs -argpromotion -domtree -sroa -basicaa -aa -memoryssa -early-cse-memssa -speculative-execution -basicaa -aa -lazy-value-info -jump-threading -correlated-propagation -simplifycfg -domtree -aggressive-instcombine -basicaa -aa -loops -lazy-branch-prob -lazy-block-freq -opt-remark-emitter -instcombine -libcalls-shrinkwrap -loops -branch-prob -block-freq -lazy-branch-prob -lazy-block-freq -opt-remark-emitter -pgo-memop-opt -basicaa -aa -loops -lazy-branch-prob -lazy-block-freq -opt-remark-emitter -tailcallelim -simplifycfg -reassociate -domtree -basicaa -aa -memoryssa -loops -loop-simplify -lcssa-verification -lcssa -scalar-evolution -loop-rotate -licm -loop-unswitch -simplifycfg -domtree -basicaa -aa -loops -lazy-branch-prob -lazy-block-freq -opt-remark-emitter -instcombine -loop-simplify -lcssa-verification -lcssa -scalar-evolution -indvars -loop-idiom -loop-deletion -loop-unroll -mldst-motion -phi-values -basicaa -aa -memdep -lazy-branch-prob -lazy-block-freq -opt-remark-emitter -gvn -phi-values -basicaa -aa -memdep -memcpyopt -sccp -demanded-bits -bdce -basicaa -aa -lazy-branch-prob -lazy-block-freq -opt-remark-emitter -instcombine -lazy-value-info -jump-threading -correlated-propagation -basicaa -aa -phi-values -memdep -dse -basicaa -aa -memoryssa -loops -loop-simplify -lcssa-verification -lcssa -scalar-evolution -licm -postdomtree -adce -simplifycfg -domtree -basicaa -aa -loops -lazy-branch-prob -lazy-block-freq -opt-remark-emitter -instcombine -barrier -elim-avail-extern -basiccg -rpo-functionattrs -globalopt -globaldce -basiccg -globals-aa -domtree -float2int -lower-constant-intrinsics -domtree -basicaa -aa -memoryssa -loops -loop-simplify -lcssa-verification -lcssa -scalar-evolution -loop-rotate -loop-accesses -lazy-branch-prob -lazy-block-freq -opt-remark-emitter -loop-distribute -branch-prob -block-freq -scalar-evolution -basicaa -aa -loop-accesses -demanded-bits -lazy-branch-prob -lazy-block-freq -opt-remark-emitter -inject-tli-mappings -loop-vectorize -loop-simplify -scalar-evolution -basicaa -aa -loop-accesses -lazy-branch-prob -lazy-block-freq -loop-load-elim -basicaa -aa -lazy-branch-prob -lazy-block-freq -opt-remark-emitter -instcombine -simplifycfg -domtree -loops -scalar-evolution -basicaa -aa -demanded-bits -lazy-branch-prob -lazy-block-freq -opt-remark-emitter -slp-vectorizer -opt-remark-emitter -instcombine -loop-simplify -lcssa-verification -lcssa -scalar-evolution -loop-unroll -lazy-branch-prob -lazy-block-freq -opt-remark-emitter -instcombine -memoryssa -loop-simplify -lcssa-verification -lcssa -scalar-evolution -licm -lazy-branch-prob -lazy-block-freq -opt-remark-emitter -transform-warning -alignment-from-assumptions -strip-dead-prototypes -globaldce -constmerge -domtree -loops -branch-prob -block-freq -loop-simplify -lcssa-verification -lcssa -basicaa -aa -scalar-evolution -block-freq -loop-sink -lazy-branch-prob -lazy-block-freq -opt-remark-emitter -instsimplify -div-rem-pairs -simplifycfg -verify 7 Pass Arguments: -domtree 8 Pass Arguments: -targetlibinfo -domtree -loops -branch-prob -block-freq 9 Pass Arguments: -targetlibinfo -domtree -loops -branch-prob -block-freq
pass是llvm特有的概念。Llvm的paas框架是llvm系统中的重要部分,也是编译器中最有趣的部分,llvm通过应用一连串的pass来达到优化效果。部分pass为编译器提供转换或者优化功能,而这些pass又依赖其他pass提供这些转换和优化需要的分析结果。Pass是llvm提供的编译代码的一种结构化的技术。
所有llvm的pass都是Pass类的子类,它们重载继承自Pass类的虚拟函数,根据pass的功能需要,可以选择继承自ModulePass, CallGraphSCCPass, FunctionPass, LoopPass, RegionPass, 或者 BasicBlockPass 类,这些类相对于最上层的Pass类,提供了更多上下文信息。
例如下面这个函数:
1 int main() { 2 int c1 = 17; 3 int c2 = 25; 4 int c3 = c1 + c2; 5 printf("Value = %d\n", c3); 6 }
编译生成的llvm ir是这样的(省略了一些属性,只显示了函数体):
1 clang -S -emit-llvm mem2reg.cc 2 llvm-dis mem2reg.bc 3 cat mem2reg.ll 4 ; ModuleID = 'mem2reg.bc' 5 source_filename = "mem2reg.cc" 6 target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128" 7 target triple = "x86_64-unknown-linux-gnu" 8 9 @.str = private unnamed_addr constant [12 x i8] c"Value = %d\0A\00", align 1 10 11 ; Function Attrs: noinline norecurse optnone uwtable 12 define dso_local i32 @main() #0 { 13 %1 = alloca i32, align 4 14 %2 = alloca i32, align 4 15 %3 = alloca i32, align 4 16 store i32 17, i32* %1, align 4 17 store i32 25, i32* %2, align 4 18 %4 = load i32, i32* %1, align 4 19 %5 = load i32, i32* %2, align 4 20 %6 = add nsw i32 %4, %5 21 store i32 %6, i32* %3, align 4 22 %7 = load i32, i32* %3, align 4 23 %8 = call i32 (i8*, ...) @printf(i8* getelementptr inbounds ([12 x i8], [12 x i8]* @.str, i64 0, i64 0), i32 %7) 24 ret i32 0 25 } 26 27 declare dso_local i32 @printf(i8*, ...) #1 28 29 attributes #0 = { noinline norecurse optnone uwtable "correctly-rounded-divide-sqrt-fp-math"="false" "disable-tail-calls"="false" "frame-pointer"="all" "less-precise-fpmad"="false" "min-legal-vector-width"="0" "no-infs-fp-math"="false" "no-jump-tables"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="x86-64" "target-features"="+cx8,+fxsr,+mmx,+sse,+sse2,+x87" "unsafe-fp-math"="false" "use-soft-float"="false" } 30 attributes #1 = { "correctly-rounded-divide-sqrt-fp-math"="false" "disable-tail-calls"="false" "frame-pointer"="all" "less-precise-fpmad"="false" "no-infs-fp-math"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="x86-64" "target-features"="+cx8,+fxsr,+mmx,+sse,+sse2,+x87" "unsafe-fp-math"="false" "use-soft-float"="false" } 31 32 !llvm.module.flags = !{!0} 33 !llvm.ident = !{!1} 34 35 !0 = !{i32 1, !"wchar_size", i32 4} 36 !1 = !{!"clang version 11.1.0"}
直接对上面生成的bc文件跑opt,发现优化之后的llvm ir和优化前的llvm ir几乎完全一样,没有达到预期的优化效果。问了一圈,有同事怀疑是属性里面的optnone 再使坏,删掉之后确实能正常优化了。
下面是优化之后的llvm ir和优化过程中运行的命令,可以看出main函数从之前的8行指令优化成了2行。在寄存器足够的情况下,仅仅mem2reg这个优化pass的效果也是非常可观的。
1 root@e6db4f256fba:~/DCC888# opt --mem2reg mem2reg.ll > mem2reg_after.bc 2 root@e6db4f256fba:~/DCC888# llvm-dis mem2reg_after.bc 3 root@e6db4f256fba:~/DCC888# cat mem2reg_after.ll 4 ; Function Attrs: noinline norecurse uwtable 5 define dso_local i32 @main() #0 { 6 %1 = add nsw i32 17, 25 7 %2 = call i32 (i8*, ...) @printf(i8* getelementptr inbounds ([12 x i8], [12 x i8]* @.str, i64 0, i64 0), i32 %1) 8 ret i32 0 9 }
经过一轮常量折叠后main函数代码可以精简到1行,下面是命令和优化后的llvm ir。
1 root@e6db4f256fba:~/DCC888# opt --constprop mem2reg_after.ll > mem2reg_constprop.bc 2 root@e6db4f256fba:~/DCC888# llvm-dis mem2reg_constprop.bc 3 root@e6db4f256fba:~/DCC888# cat mem2reg_constprop.ll 4 define dso_local i32 @main() #0 { 5 %1 = call i32 (i8*, ...) @printf(i8* getelementptr inbounds ([12 x i8], [12 x i8]* @.str, i64 0, i64 0), i32 42) 6 ret i32 0 7 }
再来一个CSE的例子,源代码如下:
1 root@e6db4f256fba:~/DCC888# cat cse.cc 2 #include<stdio.h> 3 int main(int argc, char** argv) { 4 char c1 = argc + 1; 5 char c2 = argc - 1; 6 char c3 = c1 + c2; 7 char c4 = c1 + c2; 8 char c5 = c4 * 4; 9 if (argc % 2) 10 printf("Value = %d\n", c3); 11 else 12 printf("Value = %d\n", c5); 13 }
先秀一下没做CSE之前的llir(llvm ir可以简写成llir,省去不关心的各种attributes,但自己跑的时候注意把optnone的attribute删除掉再跑):
root@e6db4f256fba:~/DCC888# clang -S -emit-llvm cse.cc root@e6db4f256fba:~/DCC888# opt --mem2reg cse.ll > cse_mem2reg.bc root@e6db4f256fba:~/DCC888# llvm-dis cse_mem2reg.bc root@e6db4f256fba:~/DCC888# cat cse_mem2reg.ll define dso_local i32 @main(i32 %0, i8** %1) #0 { %3 = add nsw i32 %0, 1 %4 = trunc i32 %3 to i8 %5 = sub nsw i32 %0, 1 %6 = trunc i32 %5 to i8 %7 = sext i8 %4 to i32 %8 = sext i8 %6 to i32 %9 = add nsw i32 %7, %8 %10 = trunc i32 %9 to i8 %11 = sext i8 %4 to i32 %12 = sext i8 %6 to i32 %13 = add nsw i32 %11, %12 %14 = trunc i32 %13 to i8 %15 = sext i8 %14 to i32 %16 = mul nsw i32 %15, 4 %17 = trunc i32 %16 to i8 %18 = srem i32 %0, 2 %19 = icmp ne i32 %18, 0 br i1 %19, label %20, label %23 20: ; preds = %2 %21 = sext i8 %10 to i32 %22 = call i32 (i8*, ...) @printf(i8* getelementptr inbounds ([12 x i8], [12 x i8]* @.str, i64 0, i64 0), i32 %21) br label %26 23: ; preds = %2 %24 = sext i8 %17 to i32 %25 = call i32 (i8*, ...) @printf(i8* getelementptr inbounds ([12 x i8], [12 x i8]* @.str, i64 0, i64 0), i32 %24) br label %26 26: ; preds = %23, %20 ret i32 0 }
经过cse优化之后的结果,注意,优化前的7~10和11~14行的代码一样,被优化成一份了:
1 root@e6db4f256fba:~/DCC888# opt --early-cse cse_mem2reg.ll | llvm-dis 2 ; Function Attrs: noinline norecurse uwtable 3 define dso_local i32 @main(i32 %0, i8** %1) #0 { 4 %3 = add nsw i32 %0, 1 5 %4 = trunc i32 %3 to i8 6 %5 = sub nsw i32 %0, 1 7 %6 = trunc i32 %5 to i8 8 %7 = sext i8 %4 to i32 9 %8 = sext i8 %6 to i32 10 %9 = add nsw i32 %7, %8 11 %10 = trunc i32 %9 to i8 12 %11 = sext i8 %10 to i32 13 %12 = mul nsw i32 %11, 4 14 %13 = trunc i32 %12 to i8 15 %14 = srem i32 %0, 2 16 %15 = icmp ne i32 %14, 0 17 br i1 %15, label %16, label %18 18 19 16: ; preds = %2 20 %17 = call i32 (i8*, ...) @printf(i8* getelementptr inbounds ([12 x i8], [12 x i8]* @.str, i64 0, i64 0), i32 %11) 21 br label %21 22 23 18: ; preds = %2 24 %19 = sext i8 %13 to i32 25 %20 = call i32 (i8*, ...) @printf(i8* getelementptr inbounds ([12 x i8], [12 x i8]* @.str, i64 0, i64 0), i32 %19) 26 br label %21 27 28 21: ; preds = %18, %16 29 ret i32 0 30 }
上面代码里面的trunc指令负责把大的数据类型转换成小的数据类型,因为函数入参是int,但计算时需要转换成char,但计算时编译器又会自动把它扩展成int32,这时又需要调用sext指令,加法运算完又要调用trunc指令转回char。
这就是为何经常有人建议,除非涉及协议对接,要不然不要用char当做整数处理,如果编译器不做优化的话,性能比直接用int会差很多。
写pass之前需要先了解一下llvm对代码的抽象。llvm代码抽象的最上层是Module,每个Module由一个或者多个Function组成,再往下依次是BasicBlock,Instruction,每条Instruction由一个OpCode和一个或者多个个operand组成。
我们的pass可以是针对下面任意一个层次的处理。
原ppt里面的写pass的流程是llvm2的,当前比较新的版本已经不可用,下面的例子是基于llvm11验证通过,参考了官网的例子https://www.llvm.org/docs/WritingAnLLVMPass.html#quick-start-writing-hello-world。
例如要写一个基于Function的Pass,需要继承自FunctionPass类,重载基于该类的runOnFunction虚函数。下面是一个计算函数内操作符个数的Pass:
1 #define DEBUG_TYPE "opCounter" 2 #include "llvm/IR/Function.h" 3 #include "llvm/Pass.h" 4 #include "llvm/Support/raw_ostream.h" 5 #include <map> 6 using namespace llvm; 7 namespace { 8 struct CountOp : public FunctionPass { 9 std::map<std::string, int> opCounter; 10 static char ID; 11 CountOp() : FunctionPass(ID) {} 12 virtual bool runOnFunction(Function &F) { 13 errs() << "Function " << F.getName() << '\n'; 14 for (Function::iterator bb = F.begin(), e = F.end(); bb != e; ++bb) { 15 for (BasicBlock::iterator i = bb->begin(), e = bb->end(); i != e; ++i) { 16 if (opCounter.find(i->getOpcodeName()) == opCounter.end()) { 17 opCounter[i->getOpcodeName()] = 1; 18 } else { 19 opCounter[i->getOpcodeName()] += 1; 20 } 21 } 22 } 23 std::map<std::string, int>::iterator i = opCounter.begin(); 24 std::map<std::string, int>::iterator e = opCounter.end(); 25 while (i != e) { 26 errs() << i->first << ": " << i->second << "\n"; 27 i++; 28 } 29 errs() << "\n"; 30 opCounter.clear(); 31 return false; 32 } 33 }; 34 } // namespace 35 char CountOp::ID = 0; 36 static RegisterPass<CountOp> X("opCounter", "Counts opcodes per functions");
通过runOnFunction的入参就是Function,通过遍历Function找到BasicBlock,通过遍历BasicBlock找到Instruction,获得Instruction的OpcodeName,并增加累加功能。函数遍历完之后,使用errors()错误输出流将计算的操作符的次数打印出来。
每个Pass都会定义一个ID,看大家代码都是统一赋值成0,看起来不可思议,实际上传给FunctionPass的是一个引用,这个引用再作为引用传递给FunctionPass的父类Pass,Pass将ID的地址作传给PassID,最终实现用每个类里面定义的ID的地址作为类的真实ID的效果。下面贴的代码为了表达这个传递关系,省略了其他无关代码:
282 class FunctionPass : public Pass { 283 public: 384 explicit FunctionPass(char &pid) : Pass(PT_Function, pid) {}
78 class Pass { 79 AnalysisResolver *Resolver = nullptr; // Used to resolve analysis 80 const void *PassID; 81 PassKind Kind; 82 83 84 public: 85 explicit Pass(PassKind K, char &pid) : PassID(&pid), Kind(K) {}
将上面写好的CountOp.cpp拷贝到llvm/lib/Transforms/CountOP目录下面,并将llvm/lib/Transforms/Hello/CMakeLists.txt拷贝到本目录下,将其中的Hello替换成我们新创建的CountOP:
1 # If we don't need RTTI or EH, there's no reason to export anything 2 # from the hello plugin. 3 if( NOT LLVM_REQUIRES_RTTI ) 4 if( NOT LLVM_REQUIRES_EH ) 5 set(LLVM_EXPORTED_SYMBOL_FILE ${CMAKE_CURRENT_SOURCE_DIR}/CountOP.exports) 6 endif() 7 endif() 8 9 if(WIN32 OR CYGWIN) 10 set(LLVM_LINK_COMPONENTS Core Support) 11 endif() 12 13 add_llvm_library( LLVMCountOP MODULE BUILDTREE_ONLY 14 CountOP.cpp 15 16 DEPENDS 17 intrinsics_gen 18 PLUGIN_TOOL 19 opt 20 )
修改llvm/lib/Transforms/CMakeLists.txt,增加一行“add_subdirectory(CountOP)”,然后启动llvm的编译。编译完会在当前build目录下面生成lib/LLVMCountOP.so库,将这个库拷贝到当前代码目录,执行下面命令就可以看到这个pass的执行结果:
1 root@e6db4f256fba:~/DCC888# opt -load LLVMCountOP.so --opCounter mem2reg.bc -disable-output 2 Function main 3 add: 1 4 alloca: 3 5 call: 1 6 load: 3 7 ret: 1 8 store: 3