问题描述
我写了这个简单的C程序:
INT的main(){
INT I;
诠释计数= 0;
对于(i = 0; I< 20亿;我++){
数=计+ 1;
}
}
我想看看gcc编译器如何优化这个循环(明确添加的 1 的20亿倍应该是增加的 20亿的一次)。所以:
海合会test.c以,然后时间
在的a.out
给出:
真正0m7.717s
用户0m7.710s
SYS 0m0.000s
$ GCC -O2 test.c的,然后时间
a.out`给出了:
真正0m0.003s
用户0m0.000s
SYS 0m0.000s
然后我用 GCC -S
拆卸两者。第一个似乎很清楚:
.filetest.c的
。文本
.globl主
.TYPE为主,@function
主要:
.LFB0:
.cfi_startproc
pushq%RBP
.cfi_def_cfa_offset 16
MOVQ%RSP,RBP%
.cfi_offset 6,-16
.cfi_def_cfa_register 6
MOVL $ 0 -8(%RBP)
MOVL $ 0 -4(RBP%)
JMP .L2
.L3:
ADDL $ 1,-8(%RBP)
ADDL $ 1,-4(RBP%)
.L2:
CMPL $一十九亿九千九百九十九万九千九百九十九,-4(RBP%)
JLE .L3
离开
.cfi_def_cfa 7,8
RET
.cfi_endproc
.LFE0:
.size为主,。,主
.identGCC:(Ubuntu的/ Linaro的4.5.2-8ubuntu4)4.5.2
.section伪.note.GNU堆栈,,@ PROGBITS
L3增加,L2比较 -4(RBP%)
与 1999999999
,如果<$ C循环,以L3 $ C> I&LT; 20亿。
立即优化之一:
.filetest.c的
。文本
.p2align 4日,15
.globl主
.TYPE为主,@function
主要:
.LFB0:
.cfi_startproc
代表
RET
.cfi_endproc
.LFE0:
.size为主,。,主
.identGCC:(Ubuntu的/ Linaro的4.5.2-8ubuntu4)4.5.2
.section伪.note.GNU堆栈,,@ PROGBITS
我不能都明白是怎么回事那里!我有装配一知半解,但是我预计像
ADDL $ 2000000000,-8(%RBP)
我甚至试过的gcc -c -g -Wa,-a,-ad -O2 test.c以一起看C code。与它被转换到组装,但结果还是没有更明显,previous之一。
有人能简单介绍一下:
- 在 GCC -S -O2 输出。
- 如果如我所料的循环优化(的一笔,而不是许多资金)?
编译器是更聪明。 :)
在事实上,它意识到你没有使用循环的结果。因此,它拿出了整个循环彻底!
这是所谓的。
一个更好的测试是打印结果:
的#include&LT;&stdio.h中GT;
诠释主要(无效){
INT I;诠释计数= 0;
对于(i = 0; I&LT; 20亿;我++){
数=计+ 1;
} //打印结果prevent死code消除
的printf(%d个\\ N算);
}
编辑:我已经添加了必需的的#include&LT;&stdio.h中GT;
; MSVC的组装上市相当于没有版本的的#include
,但它应该是相同的。
我没有GCC在此刻我的面前,因为我引导到Windows。但这里有对MSVC与的printf()
版本拆卸:
编辑:我有汇编输出错误。下面是正确的。
; 57:INT主要(){$ LN8:
子RSP,40; 00000028H; 58:
; 59:
; 60:INT I;诠释计数= 0;
; 61:对于(i = 0; I&LT; 20亿;我++){
; 62:数=计+ 1;
; 63:}
; 64:
; 65://打印结果prevent死code消除
; 66:输出(%d个\\ N算); LEA RCX,OFFSET FLAT:?? _ 3. C @ _03PMGGPEJJ @ $ CFD 6 $ @ AA??
MOV EDX,20亿; 77359400H
调用QWORD PTR __imp_printf; 67:
; 68:
; 69:
; 70:
; 71:返回0; XOR EAX,EAX; 72:} 加RSP,40; 00000028H
RET 0
所以,是的,Visual Studio中做这种优化。我认为可能GCC确实太少。
是的,海湾合作委员会执行类似的优化。下面是一个组装清单相同的程序与 GCC -S -O2 test.c以
(GCC 4.5.2,Ubuntu的11.10,86):
.filetest.c的
.section伪.rodata.str1.1,AMS,@ PROGBITS,1
.LC0:
.string%d个\\ N
。文本
.p2align 4日,15
.globl主
.TYPE为主,@function
主要:
pushl%EBP
MOVL%ESP,EBP%
和L $ -16,ESP%
subl $ 16%ESP
MOVL $ 2000000000,8(%ESP)
MOVL $ .LC0,4(%尤)
MOVL $ 1,(%ESP)
调用__printf_chk
离开
RET
.size为主,。,主
.identGCC:(Ubuntu的/ Linaro的4.5.2-8ubuntu4)4.5.2
.section伪.note.GNU堆栈,,@ PROGBITS
I wrote this simple C program:
int main() {
int i;
int count = 0;
for(i = 0; i < 2000000000; i++){
count = count + 1;
}
}
I wanted to see how the gcc compiler optimizes this loop (clearly add 1 2000000000 times should be "add 2000000000 one time"). So:
gcc test.c and then time
on a.out
gives:
real 0m7.717s
user 0m7.710s
sys 0m0.000s
$ gcc -O2 test.c and then time on
a.out` gives:
real 0m0.003s
user 0m0.000s
sys 0m0.000s
Then I disassembled both with gcc -S
. First one seems quite clear:
.file "test.c"
.text
.globl main
.type main, @function
main:
.LFB0:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
movq %rsp, %rbp
.cfi_offset 6, -16
.cfi_def_cfa_register 6
movl $0, -8(%rbp)
movl $0, -4(%rbp)
jmp .L2
.L3:
addl $1, -8(%rbp)
addl $1, -4(%rbp)
.L2:
cmpl $1999999999, -4(%rbp)
jle .L3
leave
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE0:
.size main, .-main
.ident "GCC: (Ubuntu/Linaro 4.5.2-8ubuntu4) 4.5.2"
.section .note.GNU-stack,"",@progbits
L3 adds, L2 compare -4(%rbp)
with 1999999999
and loops to L3 if i < 2000000000
.
Now the optimized one:
.file "test.c"
.text
.p2align 4,,15
.globl main
.type main, @function
main:
.LFB0:
.cfi_startproc
rep
ret
.cfi_endproc
.LFE0:
.size main, .-main
.ident "GCC: (Ubuntu/Linaro 4.5.2-8ubuntu4) 4.5.2"
.section .note.GNU-stack,"",@progbits
I can't understand at all what's going on there! I've got little knowledge of assembly, but I expected something like
addl $2000000000, -8(%rbp)
I even tried with gcc -c -g -Wa,-a,-ad -O2 test.c to see the C code together with the assembly it was converted to, but the result was no more clear that the previous one.
Can someone briefly explain:
- The gcc -S -O2 output.
- If the loop is optimized as I expected (one sum instead of many sums)?
The compiler is even smarter than that. :)
In fact, it realizes that you aren't using the result of the loop. So it took out the entire loop completely!
This is called Dead Code Elimination.
A better test is to print the result:
#include <stdio.h>
int main(void) {
int i; int count = 0;
for(i = 0; i < 2000000000; i++){
count = count + 1;
}
// Print result to prevent Dead Code Elimination
printf("%d\n", count);
}
EDIT : I've added the required #include <stdio.h>
; the MSVC assembly listing corresponds to a version without the #include
, but it should be the same.
I don't have GCC in front of me at the moment, since I'm booted into Windows. But here's the disassembly of the version with the printf()
on MSVC:
EDIT : I had the wrong assembly output. Here's the correct one.
; 57 : int main(){
$LN8:
sub rsp, 40 ; 00000028H
; 58 :
; 59 :
; 60 : int i; int count = 0;
; 61 : for(i = 0; i < 2000000000; i++){
; 62 : count = count + 1;
; 63 : }
; 64 :
; 65 : // Print result to prevent Dead Code Elimination
; 66 : printf("%d\n",count);
lea rcx, OFFSET FLAT:??_C@_03PMGGPEJJ@?$CFd?6?$AA@
mov edx, 2000000000 ; 77359400H
call QWORD PTR __imp_printf
; 67 :
; 68 :
; 69 :
; 70 :
; 71 : return 0;
xor eax, eax
; 72 : }
add rsp, 40 ; 00000028H
ret 0
So yes, Visual Studio does this optimization. I'd assume GCC probably does too.
And yes, GCC performs a similar optimization. Here's an assembly listing for the same program with gcc -S -O2 test.c
(gcc 4.5.2, Ubuntu 11.10, x86):
.file "test.c"
.section .rodata.str1.1,"aMS",@progbits,1
.LC0:
.string "%d\n"
.text
.p2align 4,,15
.globl main
.type main, @function
main:
pushl %ebp
movl %esp, %ebp
andl $-16, %esp
subl $16, %esp
movl $2000000000, 8(%esp)
movl $.LC0, 4(%esp)
movl $1, (%esp)
call __printf_chk
leave
ret
.size main, .-main
.ident "GCC: (Ubuntu/Linaro 4.5.2-8ubuntu4) 4.5.2"
.section .note.GNU-stack,"",@progbits
这篇关于如何优化GCC了一个循环内增加一个未使用的变量?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!