![与Java 与Java]()
本文介绍了C ++与Java?为什么ICC生成的代码比VC慢?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述 29岁程序员,3月因学历无情被辞! 以下是C ++中的一个简单循环。计时器使用QueryPerformanceCounter()并且是相当准确的。我发现Java占据C ++的60%的时间,这不能?我在这里做错了什么? long long var = 0; std :: array< int,1024> arr; int * arrPtr = arr.data(); CHighPrecisionTimer timer; for(int i = 0; i timer.Start(); for(int i = 0; i for(int x = 0; x var + = arrPtr [x]; } } timer.Stop(); printf(Unrestricted:%lld us,Value =%lld\\\,(Int64)timer.GetElapsed()。GetMicros(),var); 此C ++在大约9.5秒内运行。我使用英特尔编译器12.1与主机处理器优化(专门为我的)和一切maxed。所以这是英特尔编译器在其最好的!自动并行化有趣地消耗70%的CPU而不是25%,但不能更快地完成任务;)... 现在我使用下面的Java代码比较: long var = 0; int [] arr = new int [1024]; for(int i = 0; i for(int i = 0; i for(int x = 0; x var + = arr [x]; } } long nanos = System.nanoTime(); for(int i = 0; i for(int x = 0; x var + = arr [x]; } } nanos =(System.nanoTime() - nanos)/ 1000; System.out.print(Value:+ var +,Time:+ nanos); Java代码通过激进优化和服务器VM(无调试)调用。它在我的机器上运行大约7秒钟(只使用一个线程)。 这是英特尔编译器的故障还是我太蠢了? :好的,现在听说的东西...似乎更像是英特尔编译器中的一个错误^^。 [请注意,我运行的是英特尔Quadcore Q6600,这是相当老。这可能是英特尔编译器在最近的CPU,如Core i7上执行得更好。 Intel x86(无矢量化) 3秒 MSVC x64:5秒 Java x86 / x64(Oracle Java 7):7秒 Intel x64(带矢量化):9.5秒 Intel x86(带矢量化) :9.5秒 Intel x64(无矢量化):12秒 MSVC x86:15秒(uhh) :另一个很好的例子)。考虑以下琐碎的lambda表达式 #include< stdio.h> #include< tchar.h> #include< Windows.h> #include< vector> #include< boost / function.hpp> #include< boost / lambda / bind.hpp> #include< boost / typeof / typeof.hpp> 模板< class TValue> struct ArrayList { private: std :: vector< TValue> m_Entries; public: template< class TCallback> void Foreach(TCallback inCallback) { for(int i = 0,size = m_Entries.size(); i< size; i ++) { inCallback(i); } } void Add(TValue inValue) { m_Entries.push_back(inValue); } }; int _tmain(int argc,_TCHAR * argv []) { auto t = [&](){}; ArrayList< int> arr; int res = 0; for(int i = 0; i { arr.Add(i); } long long freq,t1,t2; QueryPerformanceFrequency((LARGE_INTEGER *)& freq); QueryPerformanceCounter((LARGE_INTEGER *)& t1); for(int i = 0; i { arr.Foreach([& $ b res + = i; }); } QueryPerformanceCounter((LARGE_INTEGER *)& t2); printf(Time:%lld\\\,((t2-t1)* 1000000)/ freq); if(res == 4950) return -1; return 0; } Intel编译器再次亮起: MSVC x86 / x64:12毫秒 Intel x86 / x64:1秒 Uhm?好吧,我想90倍慢不是一件坏事... 我不太确定这是否适用: 在这个线程的答案:英特尔编译器是已知的(我知道,但我只是没有想到,他们可以放弃对他们的处理器的支持)在处理器上的编译器不是知道的可怕性能,像AMD处理器,甚至可能甚至过时的英特尔处理器,如我的...所以如果有一个最近的英特尔处理器可以尝试这一切,这将是很好的)。 这里是Intel Compiler的x64输出: std :: array< int,1024& arr; int * arrPtr = arr.data(); QueryPerformanceFrequency((LARGE_INTEGER *)& freq); 000000013F05101D lea rcx,[freq] 000000013F051022 call qword ptr [__imp_QueryPerformanceFrequency(13F052000h)] for(int i = 0; i 000000013F051028 MOV EAX,4 000000013F05102D MOVD XMM0,EAX 000000013F051031 XOR EAX,EAX 000000013F051033 pshufd将xmm1,xmm0,0 000000013F051038 MOVDQA XMM0,xmmword PTR [__xi_z + 28h(13F0521A0h)] 000000013F051040 movdqa xmmword ptr arr [rax * 4],xmm0 000000013F051046 paddd xmm0,xmm1 000000013F05104A movdqa xmmword ptr [rsp + rax * 4 + 60h],xmm0 000000013F051050 paddd xmm0,xmm1 000000013F051054 movdqa xmmword ptr [rsp + rax * 4 + 70h],xmm0 000000013F05105A paddd xmm0,xmm1 000000013F05105E movdqa xmmword ptr [rsp + rax * 4 + 80h],xmm0 000000013F051067 add rax,10h 000000013F05106B paddd xmm0,xmm1 000000013F05106F cmp rax,400h 000000013F051075 jb wmain + 40h(13F051040h) QueryPerformanceCounter((LARGE_INTEGER *)& t1); 000000013F051077 lea rcx,[t1] 000000013F05107C call qword ptr [__imp_QueryPerformanceCounter(13F052008h)] var + = arrPtr [x]; 000000013F051082 movdqa xmm1,xmmword ptr [__xi_z + 38h(13F0521B0h)] for(int i = 0; i 000000013F05108A xor eax,eax var + = arrPtr [x]; 000000013F05108C movdqa xmm0,xmmword ptr [__xi_z + 48h(13F0521C0h)] long long var = 0,freq,t1,t2; 000000013F051094 pxor xmm6,xmm6 for(int x = 0; x 000000013F051098 xor r8d,r8d var + = arrPtr [x] 000000013F05109B lea rdx,[arr] 000000013F0510A0 xor ecx,ecx 000000013F0510A2 movq xmm2,mmword ptr arr [rcx] for(int x = 0; x 000000013F0510A8 add r8,8 var + = arrPtr [x]; 000000013F0510AC punpckldq xmm2,xmm2 for(int x = 0; x 000000013F0510B0 add rcx,20h var + = arrPtr [x] 000000013F0510B4 MOVDQA XMM3,XMM2 000000013F0510B8 PAND XMM2,XMM0 000000013F0510BC MOVQ XMM4,mmword PTR [RDX + 8] 000000013F0510C1 psrad xmm3,1Fh 000000013F0510C6 punpckldq XMM4, XMM4 000000013F0510CA PAND XMM3,将xmm1 000000013F0510CE POR XMM3,XMM2 000000013F0510D2 MOVDQA xmm5,XMM4 000000013F0510D6 MOVQ XMM2,mmword PTR [RDX + 10H] 000000013F0510DB psrad xmm5 ,1Fh的 000000013F0510E0 punpckldq XMM2,XMM2 000000013F0510E4 PAND xmm5,将xmm1 000000013F0510E8 paddq xmm6,XMM3 000000013F0510EC PAND XMM4,XMM0 000000013F0510F0 MOVDQA XMM3,XMM2 000000013F0510F4 POR xmm5,XMM4 000000013F0510F8 psrad xmm3,1Fh 000000013F0510FD MOVQ XMM4,mmword PTR [RDX + 18H] 000000013F051102 PAND XMM3,将xmm1 000000013F051106 punpckldq XMM4,XMM4 000000013F05110A PAND XMM2,XMM0 000000013F05110E POR XMM3,XMM2 000000013F051112 MOVDQA XMM2,XMM4 000000013F051116 paddq xmm6,xmm5 000000013F05111A psrad xmm2,1Fh 000000013F05111F PAND XMM4 ,xmm0 000000013F051123 pand xmm2,xmm1 for(int x = 0; x 000000013F051127 add rdx,20h var + = arrPtr [x]; 000000013F05112B paddq xmm6,xmm3 000000013F05112F por xmm2,xmm4 for(int x = 0; x 000000013F051133 cmp r8,400h var + = arrPtr [x]; 000000013F05113A paddq xmm6,xmm2 for(int x = 0; x 000000013F05113E jb wmain + 0A2h(13F0510A2h) int i = 0; i 000000013F051144 inc eax 000000013F051146 cmp eax,0A00000h 000000013F05114B jb wmain + 98h(13F051098h)} } QueryPerformanceCounter((LARGE_INTEGER *)& t2); 000000013F051151 lea rcx,[t2] 000000013F051156 call qword ptr [__imp_QueryPerformanceCounter(13F052008h)] printf(Unrestricted:%lld ms,Value =%lld\\\ ,((t2-t1)* 1000 / freq),var); 000000013F05115C mov r9,qword ptr [t2] long long var = 0,freq,t1,t2; 000000013F051161 movdqa xmm0,xmm6 printf(Unrestricted:%lld ms,Value =%lld\\\,((t2-t1)* 1000 / freq),var); 000000013F051165 sub r9,qword ptr [t1] 0000a13F05116A lea rcx,[stringUnrestricted:%lld ms,Value =%...(13F0521D0h)] 000000013F051171 imul rax,r9 ,3E8h 000000013F051178 cqo 000000013F05117A mov r10,qword ptr [freq] 000000013F05117F idiv rax,r10 long long var = 0,freq,t1,t2; 000000013F051182 psrldq xmm0,8 printf(Unrestricted:%lld ms,Value =%lld\\\,((t2-t1)* 1000 / freq),var); 000000013F051187 mov rdx,rax long long var = 0,freq,t1,t2; 000000013F05118A paddq xmm6,xmm0 000000013F05118E movd r8,xmm6 printf(Unrestricted:%lld ms,Value =%lld\\\,((t2-t1) * 1000 / freq),var); 000000013F051193 call qword ptr [__imp_printf(13F052108h)] MSVC x64版本: int _tmain(int argc,_TCHAR * argv []) { 000000013FF61000推RBX 000000013FF61002 MOV EAX,1050H 000000013FF61007调用__chkstk(13FF61950h) 000000013FF6100C子RSP,RAX 000000013FF6100F MOV RAX,四字PTR [__security_cookie(13FF63000h)] 000000013FF61016 xor rax,rsp 000000013FF61019 mov qword ptr [rsp + 1040h],rax long long var = 0,freq,t1,t2; std :: array< int,1024> arr; int * arrPtr = arr.data(); QueryPerformanceFrequency((LARGE_INTEGER *)& freq); 000000013FF61021 lea rcx,[rsp + 28h] 000000013FF61026 xor ebx,ebx 000000013FF61028 call qword ptr [__imp_QueryPerformanceFrequency(13FF62000h)] for(int i = 0 ; i 000000013FF6102E xor r11d,r11d 000000013FF61031 lea rax,[rsp + 40h] 000000013FF61036 mov dword ptr [rax],r11d 000000013FF61039 inc r11d 000000013FF6103C add rax, 4 000000013FF61040 cmp r11d,400h 000000013FF61047 jl wmain + 36h(13FF61036h) QueryPerformanceCounter((LARGE_INTEGER *)& t1); 000000013FF61049 LEA RCX,[RSP + 20H] 000000013FF6104E通话四字PTR [__imp_QueryPerformanceCounter(13FF62008h) 000000013FF61054 MOV r11d,0A00000h 000000013FF6105A NOP单词PTR [RAX + RAX] for(int i = 0; i for(int x = 0; x 000000013FF61060 xor edx,edx 000000013FF61062 xor r8d,r8d 000000013FF61065 lea rcx,[rsp + 48h] 000000013FF6106A xor r9d,r9d 000000013FF6106D mov r10d,100h 000000013FF61073 nop word ptr [rax + rax] var + = arrPtr [x]; 000000013FF61080 movsxd rax,dword ptr [rcx-8] 000000013FF61084 add rcx,10h 000000013FF61088 add rbx,rax 000000013FF6108B movsxd rax,dword ptr [rcx-14h] 000000013FF6108F add r9,rax 000000013FF61092 movsxd rax,dword ptr [rcx-10h] 000000013FF61096 add r8,rax 000000013FF61099 movsxd rax,dword ptr [rcx-0Ch] 000000013FF6109D添加rdx,rax 000000013FF610A0 dec r10 000000013FF610A3 jne wmain + 80h(13FF61080h) for(int i = 0; i for(int x = 0; x 000000013FF610A5 lea rax,[rdx + r8] 000000013FF610A9 add rax,r9 000000013FF610AC add rbx, rax 000000013FF610AF dec r11 000000013FF610B2 jne wmain + 60h(13FF61060h)} } QueryPerformanceCounter((LARGE_INTEGER *)& t2); 000000013FF610B4 lea rcx,[rsp + 30h] 000000013FF610B9 call qword ptr [__imp_QueryPerformanceCounter(13FF62008h)] printf(Unrestricted:%lld ms,Value =%lld\ n,((t2-t1)* 1000 / freq),var); 000000013FF610BF mov rax,qword ptr [rsp + 30h] 000000013FF610C4 lea rcx,[stringUnrestricted:%lld ms,Value =%...(13FF621B0h)] 000000013FF610CB sub rax ,qword ptr [rsp + 20h] 000000013FF610D0 mov r8,rbx 000000013FF610D3 imul rax,rax,3E8h 000000013FF610DA cqo 000000013FF610DC idiv rax,qword ptr [rsp + 28h] 000000013FF610E1 mov rdx,rax 000000013FF610E4 call qword ptr [__imp_printf(13FF62138h)] return 0; 000000013FF610EA xor eax,eax 英特尔®编译器配置无向量化,64位, (这是非常慢,12秒): 000000013FC0102F lea rcx,[freq] double var = 0;长长频率,t1,t2; 000000013FC01034 xorps xmm6,xmm6 std :: array< double,1024> arr; double * arrPtr = arr.data(); QueryPerformanceFrequency((LARGE_INTEGER *)& freq); 000000013FC01037 call qword ptr [__imp_QueryPerformanceFrequency(13FC02000h)] for(int i = 0; i 000000013FC0103D MOV EAX,2 000000013FC01042 MOV RDX,100000000h 000000013FC0104C MOVD XMM0,EAX 000000013FC01050 XOR EAX,EAX 000000013FC01052 pshufd将xmm1,xmm0,0 000000013FC01057 MOVD XMM0,RDX 000000013FC0105C NOP DWORD PTR [RAX] 000000013FC01060 cvtdq2pd XMM2,XMM0 000000013FC01064 paddd XMM0,xmm1中的 000000013FC01068 cvtdq2pd XMM3,XMM0 000000013FC0106C paddd XMM0,xmm1中的 000000013FC01070 cvtdq2pd XMM4,XMM0 000000013FC01074 paddd XMM0,xmm1中的 000000013FC01078 cvtdq2pd xmm5,XMM0 000000013FC0107C MOVAPS xmmword PTR改编[RAX * 8],XMM2 000000013FC01081 paddd XMM0,xmm1中的 000000013FC01085 MOVAPS xmmword PTR [RSP + RAX * 8 + 60H],XMM3 000000013FC0108A MOVAPS xmmword PTR [RSP + RAX * 8 + 70H],XMM4 000000013FC0108F MOVAPS xmmword ptr [rsp + rax * 8 + 80h],xmm5 000000013FC01097 add rax,8 000000013FC0109B cmp rax,400h 000000013FC010A1 jb wmain + 60h(13FC01060h) QueryPerformanceCounter((LARGE_INTEGER *)& t1); 000000013FC010A3 lea rcx,[t1] 000000013FC010A8 call qword ptr [__imp_QueryPerformanceCounter(13FC02008h)] for(int i = 0; i 000000013FC010AE xor eax,eax for(int x = 0; x 000000013FC010B0 xor edx,edx var + = arrPtr [x] ; 000000013FC010B2 lea ecx,[rdx + rdx] for(int x = 0; x 000000013FC010B5 inc edx for(int x = 0; x 000000013FC010B7 cmp edx,200h var + = arrPtr [x]; 000000013FC010BD addsd xmm6,mmword ptr arr [rcx * 8] 000000013FC010C3 addsd xmm6,mmword ptr [rsp + rcx * 8 + 58h] for(int x = 0; x 000000013FC010C9 jb wmain + 0B2h(13FC010B2h) for(int i = 0; i 000000013FC010CB inc eax 000000013FC010CD cmp eax,0A00000h 000000013FC010D2 jb wmain + 0B0h(13FC010B0h)} } QueryPerformanceCounter((LARGE_INTEGER *)& t2); 000000013FC010D4 lea rcx,[t2] 000000013FC010D9 call qword ptr [__imp_QueryPerformanceCounter(13FC02008h)] $ b b 没有向量化的英特尔编译器,32位和最高优化(这显然是现在的胜利者,运行时间约为3秒,程序集看起来更好): 00B81088 lea eax,[t1] 00B8108C push eax 00B8108D call dword ptr [__imp__QueryPerformanceCounter @ 4(0B82004h)] 00B81093 xor eax,eax 00B81095 pxor xmm0,xmm0 00B81099 movaps xmm1,xmm0 for(int x = 0; x 00B8109C xor edx,edx var + = arrPtr [x]; 00B8109E addpd xmm0,xmmword ptr arr [edx * 8] 00B810A4 addpd xmm1,xmmword ptr [esp + edx * 8 + 40h] 00B810AA addpd xmm0,xmmword ptr [esp + (int x = 0; x 00B810B6 add edx,8 + 50h] 00B810B0 addpd xmm1,xmmword ptr [esp + edx * 8 + 60h] 8 00B810B9 cmp edx,400h 00B810BF jb wmain + 9Eh(0B8109Eh) for(int i = 0; i 00B810C1 inc eax 00B810C2 cmp eax,0A00000h 00B810C7 jb wmain + 9Ch(0B8109Ch) double var = 0;长长频率,t1,t2; 00B810C9 addpd xmm0,xmm1 } } QueryPerformanceCounter((LARGE_INTEGER *)& t2); 00B810CD LEA EAX,[T2] 00B810D1推EAX 00B810D2 MOVAPS xmmword PTR [ESP + 4],XMM0 00B810D7通话DWORD PTR [__imp__QueryPerformanceCounter @ 4(0B82004h)] 00B810DD movaps xmm0,xmmword ptr [esp] 解决方案 tl; dr:你在这里看到的似乎是 ICC未能尝试向量化循环。 MSVC x64: 这是关键的循环: $ LL3 @ main: movsxd rax,DWORD PTR [rdx-4] movsxd rcx,DWORD PTR [rdx-8] add rdx,16 add r10,rax movsxd rax,DWORD PTR [rdx-16] add rbx,rcx add r9,rax movsxd rax,DWORD PTR [rdx-12] add r8 ,rax dec r11 jne SHORT $ LL3 @ main 这里是由编译器展开的标准循环。 MSVC正在展开4次迭代,并将四个寄存器中的 var 变量拆分: r10 , rbx , r9 和 r8 。 这里是4个和重新组合在一起的地方: lea rax,QWORD PTR [r8 + r9] add rax,r10 add rbx,rax dec rdi jne SHORT $ LL6 @ main 注意,MSVC目前不执行自动矢量化。 strong> 现在让我们看看您的ICC输出的一部分: b $ b 000000013F0510A2 movq xmm2,mmword ptr arr [rcx] 000000013F0510A8 add r8,8 000000013F0510AC punpckldq xmm2,xmm2 000000013F0510B0添加RCX,20H 000000013F0510B4 MOVDQA XMM3,XMM2 000000013F0510B8 PAND XMM2,XMM0 000000013F0510BC MOVQ XMM4,mmword PTR [RDX + 8] 000000013F0510C1 psrad xmm3,1Fh 000000013F0510C6 punpckldq XMM4,XMM4 000000013F0510CA PAND XMM3,将xmm1 000000013F0510CE POR XMM3,XMM2 000000013F0510D2 MOVDQA xmm5,XMM4 000000013F0510D6 MOVQ XMM2,mmword PTR [RDX + 10H] 000000013F0510DB psrad xmm5,1Fh 000000013F0510E0 punpckldq XMM2,XMM2 000000013F0510E4 PAND xmm5,将xmm1 000000013F0510E8 paddq xmm6,XMM3 ... 您在这里看到的是ICC尝试将此循环向量化。这是以类似MSVC所做的方式完成的(分成多个和),但使用SSE寄存器,每个寄存器有两个和。 向量化的开销大于向量化的好处。 如果我们一个接一个地走这些指令,我们可以看到ICC尝试将其向量化: //使用64位加载加载两个int。 {x,y,0,0} movq xmm2,mmword ptr arr [rcx] //将数据拖入此表单。 punpckldq xmm2,xmm2 xmm2 = {x,x,y,y} movdqa xmm3,xmm2 xmm3 = {x,x,y,y} //索引1和3. pand xmm2,xmm0 xmm2 = {x,0,y,0} //算术右移以复制整个字的符号位。 psrad xmm3,1Fh xmm3 = {sign(x),sign(x),sign(y),sign(y)} //屏蔽掉索引0和2. pand xmm3,xmm1 xmm3 = {0,sign(x),0,sign(y)} //组合以获得符号扩展值。 por xmm3,xmm2 xmm3 = {x,sign(x),y,sign(y)} xmm3 = {x,y} //添加到累加器。 .. paddq xmm6,xmm3 所以它做一些非常混乱的解包只是为了矢量化。 SSE4.1实际上提供了 PMOVSXDQ 指令。但是目标机器不支持SSE4.1,或者ICC在这种情况下不够聪明。 但是要点是: 英特尔编译器试图将循环向量化。但是增加的开销似乎超过了首先将其矢量化的好处。 编辑:使用OP的结果更新: ICC x64没有向量化 ICC x86与向量化 您将数据类型更改为 double 。所以现在是浮点。 但是因为你禁用了x64版本的矢量化,它显然变得更慢了。 ICC x86与向量化: 00B8109E addpd xmm0,xmmword ptr arr [edx * 8] 00B810A4 addpd xmm1,xmmword ptr [esp + edx * 8 + 40h] 00B810AA addpd xmm0,xmmword ptr [esp + edx * 8 + 50h] 00B810B0 addpd xmm1,xmmword ptr [esp + edx * 8 + 60h] 00B810B6 add edx,8 00B810B9 cmp edx,400h 00B810BF jb wmain + 9Eh(0B8109Eh) 这里没有多少 - 标准向量化+ 4x循环展开。 没有向量化的ICC x64: 000000013FC010B2 lea ecx ,[rdx + rdx] 000000013FC010B5 inc edx 000000013FC010B7 cmp edx,200h 000000013FC010BD addsd xmm6,mmword ptr arr [rcx * 8] 000000013FC010C3 addsd xmm6,mmword ptr [rsp + rcx * 8 + 58h] 000000013FC010C9 jb wmain + 0B2h(13FC010B2h) 只有2x循环展开。 所有事情都相同,禁用矢量化会损害这种浮点型的性能。 The following is a simple loop in C++. The timer is using QueryPerformanceCounter() and is quite accurate. I found Java to take 60% of the time C++ takes and this can't be?! What am I doing wrong here? Even strict aliasing (which is not included in the code here) doesn't help at all...long long var = 0;std::array<int, 1024> arr;int* arrPtr = arr.data();CHighPrecisionTimer timer;for(int i = 0; i < 1024; i++) arrPtr[i] = i;timer.Start();for(int i = 0; i < 1024 * 1024 * 10; i++){ for(int x = 0; x < 1024; x++){ var += arrPtr[x]; }}timer.Stop();printf("Unrestricted: %lld us, Value = %lld\n", (Int64)timer.GetElapsed().GetMicros(), var);This C++ runs through in about 9.5 seconds. I am using the Intel Compiler 12.1 with host processor optimization (specifically for mine) and everything maxed. So this is Intel Compiler at its best! Auto-Parallelization funnily consumes 70% CPU instead of 25% but doesn't get the job done any faster ;)...Now I use the following Java code for comparison: long var = 0; int[] arr = new int[1024]; for(int i = 0; i < 1024; i++) arr[i] = i; for(int i = 0; i < 1024 * 1024; i++){ for(int x = 0; x < 1024; x++){ var += arr[x]; } } long nanos = System.nanoTime(); for(int i = 0; i < 1024 * 1024 * 10; i++){ for(int x = 0; x < 1024; x++){ var += arr[x]; } } nanos = (System.nanoTime() - nanos) / 1000; System.out.print("Value: " + var + ", Time: " + nanos);The Java code is invoked with aggressive optimization and the server VM (no debug). It runs in about 7 seconds on my machine (only uses one thread).Is this a failure of the Intel Compiler or am I just too dumb again?[EDIT]: Ok now heres the thing... Seems more like a bug in the Intel compiler ^^.[Please note that I am running on the Intel Quadcore Q6600, which is rather old. And it might be that the Intel Compiler performs way better on recent CPUs, like Core i7]Intel x86 (without vectorization): 3 secondsMSVC x64: 5 secondsJava x86/x64 (Oracle Java 7): 7 secondsIntel x64 (with vectorization): 9.5 secondsIntel x86 (with vectorization): 9.5 secondsIntel x64 (without vectorization): 12 secondsMSVC x86: 15 seconds (uhh)[EDIT]: Another nice case ;). Consider the following trivial lambda expression#include <stdio.h>#include <tchar.h>#include <Windows.h>#include <vector>#include <boost/function.hpp>#include <boost/lambda/bind.hpp>#include <boost/typeof/typeof.hpp>template<class TValue>struct ArrayList{private: std::vector<TValue> m_Entries;public: template<class TCallback> void Foreach(TCallback inCallback) { for(int i = 0, size = m_Entries.size(); i < size; i++) { inCallback(i); } } void Add(TValue inValue) { m_Entries.push_back(inValue); }};int _tmain(int argc, _TCHAR* argv[]){ auto t = [&]() {}; ArrayList<int> arr; int res = 0; for(int i = 0; i < 100; i++) { arr.Add(i); } long long freq, t1, t2; QueryPerformanceFrequency((LARGE_INTEGER*)&freq); QueryPerformanceCounter((LARGE_INTEGER*)&t1); for(int i = 0; i < 1000 * 1000 * 10; i++) { arr.Foreach([&](int v) { res += i; }); } QueryPerformanceCounter((LARGE_INTEGER*)&t2); printf("Time: %lld\n", ((t2-t1) * 1000000) / freq); if(res == 4950) return -1; return 0;}Intel compiler shines again:MSVC x86/x64: 12 milli secondsIntel x86/x64: 1 secondUhm?! Well, I guess 90 times slower is not a bad thing...I am not really sure anymore that this applies:Okay and based on an answer to this thread: The intel compiler is known (and I knew that too but I just didn't think about that they could drop support for their processors) to have terrible performance on processors which are not "known" to the compiler, like AMD processors, and maybe even outdated Intel processors like mine... So if someone with a recent Intel processor could try this out it would be nice ;).Here is the x64 output of the Intel Compiler: std::array<int, 1024> arr; int* arrPtr = arr.data(); QueryPerformanceFrequency((LARGE_INTEGER*)&freq);000000013F05101D lea rcx,[freq]000000013F051022 call qword ptr [__imp_QueryPerformanceFrequency (13F052000h)] for(int i = 0; i < 1024; i++) arrPtr[i] = i;000000013F051028 mov eax,4000000013F05102D movd xmm0,eax000000013F051031 xor eax,eax000000013F051033 pshufd xmm1,xmm0,0000000013F051038 movdqa xmm0,xmmword ptr [__xi_z+28h (13F0521A0h)]000000013F051040 movdqa xmmword ptr arr[rax*4],xmm0000000013F051046 paddd xmm0,xmm1000000013F05104A movdqa xmmword ptr [rsp+rax*4+60h],xmm0000000013F051050 paddd xmm0,xmm1000000013F051054 movdqa xmmword ptr [rsp+rax*4+70h],xmm0000000013F05105A paddd xmm0,xmm1000000013F05105E movdqa xmmword ptr [rsp+rax*4+80h],xmm0000000013F051067 add rax,10h000000013F05106B paddd xmm0,xmm1000000013F05106F cmp rax,400h000000013F051075 jb wmain+40h (13F051040h) QueryPerformanceCounter((LARGE_INTEGER*)&t1);000000013F051077 lea rcx,[t1]000000013F05107C call qword ptr [__imp_QueryPerformanceCounter (13F052008h)] var += arrPtr[x];000000013F051082 movdqa xmm1,xmmword ptr [__xi_z+38h (13F0521B0h)] for(int i = 0; i < 1024 * 1024 * 10; i++){000000013F05108A xor eax,eax var += arrPtr[x];000000013F05108C movdqa xmm0,xmmword ptr [__xi_z+48h (13F0521C0h)] long long var = 0, freq, t1, t2;000000013F051094 pxor xmm6,xmm6 for(int x = 0; x < 1024; x++){000000013F051098 xor r8d,r8d var += arrPtr[x];000000013F05109B lea rdx,[arr]000000013F0510A0 xor ecx,ecx000000013F0510A2 movq xmm2,mmword ptr arr[rcx] for(int x = 0; x < 1024; x++){000000013F0510A8 add r8,8 var += arrPtr[x];000000013F0510AC punpckldq xmm2,xmm2 for(int x = 0; x < 1024; x++){000000013F0510B0 add rcx,20h var += arrPtr[x];000000013F0510B4 movdqa xmm3,xmm2000000013F0510B8 pand xmm2,xmm0000000013F0510BC movq xmm4,mmword ptr [rdx+8]000000013F0510C1 psrad xmm3,1Fh000000013F0510C6 punpckldq xmm4,xmm4000000013F0510CA pand xmm3,xmm1000000013F0510CE por xmm3,xmm2000000013F0510D2 movdqa xmm5,xmm4000000013F0510D6 movq xmm2,mmword ptr [rdx+10h]000000013F0510DB psrad xmm5,1Fh000000013F0510E0 punpckldq xmm2,xmm2000000013F0510E4 pand xmm5,xmm1000000013F0510E8 paddq xmm6,xmm3000000013F0510EC pand xmm4,xmm0000000013F0510F0 movdqa xmm3,xmm2000000013F0510F4 por xmm5,xmm4000000013F0510F8 psrad xmm3,1Fh000000013F0510FD movq xmm4,mmword ptr [rdx+18h]000000013F051102 pand xmm3,xmm1000000013F051106 punpckldq xmm4,xmm4000000013F05110A pand xmm2,xmm0000000013F05110E por xmm3,xmm2000000013F051112 movdqa xmm2,xmm4000000013F051116 paddq xmm6,xmm5000000013F05111A psrad xmm2,1Fh000000013F05111F pand xmm4,xmm0000000013F051123 pand xmm2,xmm1 for(int x = 0; x < 1024; x++){000000013F051127 add rdx,20h var += arrPtr[x];000000013F05112B paddq xmm6,xmm3000000013F05112F por xmm2,xmm4 for(int x = 0; x < 1024; x++){000000013F051133 cmp r8,400h var += arrPtr[x];000000013F05113A paddq xmm6,xmm2 for(int x = 0; x < 1024; x++){000000013F05113E jb wmain+0A2h (13F0510A2h) for(int i = 0; i < 1024 * 1024 * 10; i++){000000013F051144 inc eax000000013F051146 cmp eax,0A00000h000000013F05114B jb wmain+98h (13F051098h) } } QueryPerformanceCounter((LARGE_INTEGER*)&t2);000000013F051151 lea rcx,[t2]000000013F051156 call qword ptr [__imp_QueryPerformanceCounter (13F052008h)] printf("Unrestricted: %lld ms, Value = %lld\n", ((t2-t1)*1000/freq), var);000000013F05115C mov r9,qword ptr [t2] long long var = 0, freq, t1, t2;000000013F051161 movdqa xmm0,xmm6 printf("Unrestricted: %lld ms, Value = %lld\n", ((t2-t1)*1000/freq), var);000000013F051165 sub r9,qword ptr [t1]000000013F05116A lea rcx,[string "Unrestricted: %lld ms, Value = %"... (13F0521D0h)]000000013F051171 imul rax,r9,3E8h000000013F051178 cqo000000013F05117A mov r10,qword ptr [freq]000000013F05117F idiv rax,r10 long long var = 0, freq, t1, t2;000000013F051182 psrldq xmm0,8 printf("Unrestricted: %lld ms, Value = %lld\n", ((t2-t1)*1000/freq), var);000000013F051187 mov rdx,rax long long var = 0, freq, t1, t2;000000013F05118A paddq xmm6,xmm0000000013F05118E movd r8,xmm6 printf("Unrestricted: %lld ms, Value = %lld\n", ((t2-t1)*1000/freq), var);000000013F051193 call qword ptr [__imp_printf (13F052108h)]And this one is the assembly of the MSVC x64 build:int _tmain(int argc, _TCHAR* argv[]){000000013FF61000 push rbx000000013FF61002 mov eax,1050h000000013FF61007 call __chkstk (13FF61950h)000000013FF6100C sub rsp,rax000000013FF6100F mov rax,qword ptr [__security_cookie (13FF63000h)]000000013FF61016 xor rax,rsp000000013FF61019 mov qword ptr [rsp+1040h],rax long long var = 0, freq, t1, t2; std::array<int, 1024> arr; int* arrPtr = arr.data(); QueryPerformanceFrequency((LARGE_INTEGER*)&freq);000000013FF61021 lea rcx,[rsp+28h]000000013FF61026 xor ebx,ebx000000013FF61028 call qword ptr [__imp_QueryPerformanceFrequency (13FF62000h)] for(int i = 0; i < 1024; i++) arrPtr[i] = i;000000013FF6102E xor r11d,r11d000000013FF61031 lea rax,[rsp+40h]000000013FF61036 mov dword ptr [rax],r11d000000013FF61039 inc r11d000000013FF6103C add rax,4000000013FF61040 cmp r11d,400h000000013FF61047 jl wmain+36h (13FF61036h) QueryPerformanceCounter((LARGE_INTEGER*)&t1);000000013FF61049 lea rcx,[rsp+20h]000000013FF6104E call qword ptr [__imp_QueryPerformanceCounter (13FF62008h)]000000013FF61054 mov r11d,0A00000h000000013FF6105A nop word ptr [rax+rax] for(int i = 0; i < 1024 * 1024 * 10; i++){ for(int x = 0; x < 1024; x++){000000013FF61060 xor edx,edx000000013FF61062 xor r8d,r8d000000013FF61065 lea rcx,[rsp+48h]000000013FF6106A xor r9d,r9d000000013FF6106D mov r10d,100h000000013FF61073 nop word ptr [rax+rax] var += arrPtr[x];000000013FF61080 movsxd rax,dword ptr [rcx-8]000000013FF61084 add rcx,10h000000013FF61088 add rbx,rax000000013FF6108B movsxd rax,dword ptr [rcx-14h]000000013FF6108F add r9,rax000000013FF61092 movsxd rax,dword ptr [rcx-10h]000000013FF61096 add r8,rax000000013FF61099 movsxd rax,dword ptr [rcx-0Ch]000000013FF6109D add rdx,rax000000013FF610A0 dec r10000000013FF610A3 jne wmain+80h (13FF61080h) for(int i = 0; i < 1024 * 1024 * 10; i++){ for(int x = 0; x < 1024; x++){000000013FF610A5 lea rax,[rdx+r8]000000013FF610A9 add rax,r9000000013FF610AC add rbx,rax000000013FF610AF dec r11000000013FF610B2 jne wmain+60h (13FF61060h) } } QueryPerformanceCounter((LARGE_INTEGER*)&t2);000000013FF610B4 lea rcx,[rsp+30h]000000013FF610B9 call qword ptr [__imp_QueryPerformanceCounter (13FF62008h)] printf("Unrestricted: %lld ms, Value = %lld\n", ((t2-t1)*1000/freq), var);000000013FF610BF mov rax,qword ptr [rsp+30h]000000013FF610C4 lea rcx,[string "Unrestricted: %lld ms, Value = %"... (13FF621B0h)]000000013FF610CB sub rax,qword ptr [rsp+20h]000000013FF610D0 mov r8,rbx000000013FF610D3 imul rax,rax,3E8h000000013FF610DA cqo000000013FF610DC idiv rax,qword ptr [rsp+28h]000000013FF610E1 mov rdx,rax000000013FF610E4 call qword ptr [__imp_printf (13FF62138h)] return 0;000000013FF610EA xor eax,eaxIntel Compiler configured without Vectorization, 64-Bit, highest optimizations (this is surprisingly slow, 12 seconds):000000013FC0102F lea rcx,[freq] double var = 0; long long freq, t1, t2;000000013FC01034 xorps xmm6,xmm6 std::array<double, 1024> arr; double* arrPtr = arr.data(); QueryPerformanceFrequency((LARGE_INTEGER*)&freq);000000013FC01037 call qword ptr [__imp_QueryPerformanceFrequency (13FC02000h)] for(int i = 0; i < 1024; i++) arrPtr[i] = i;000000013FC0103D mov eax,2000000013FC01042 mov rdx,100000000h000000013FC0104C movd xmm0,eax000000013FC01050 xor eax,eax000000013FC01052 pshufd xmm1,xmm0,0000000013FC01057 movd xmm0,rdx000000013FC0105C nop dword ptr [rax]000000013FC01060 cvtdq2pd xmm2,xmm0000000013FC01064 paddd xmm0,xmm1000000013FC01068 cvtdq2pd xmm3,xmm0000000013FC0106C paddd xmm0,xmm1000000013FC01070 cvtdq2pd xmm4,xmm0000000013FC01074 paddd xmm0,xmm1000000013FC01078 cvtdq2pd xmm5,xmm0000000013FC0107C movaps xmmword ptr arr[rax*8],xmm2000000013FC01081 paddd xmm0,xmm1000000013FC01085 movaps xmmword ptr [rsp+rax*8+60h],xmm3000000013FC0108A movaps xmmword ptr [rsp+rax*8+70h],xmm4000000013FC0108F movaps xmmword ptr [rsp+rax*8+80h],xmm5000000013FC01097 add rax,8000000013FC0109B cmp rax,400h000000013FC010A1 jb wmain+60h (13FC01060h) QueryPerformanceCounter((LARGE_INTEGER*)&t1);000000013FC010A3 lea rcx,[t1]000000013FC010A8 call qword ptr [__imp_QueryPerformanceCounter (13FC02008h)] for(int i = 0; i < 1024 * 1024 * 10; i++){000000013FC010AE xor eax,eax for(int x = 0; x < 1024; x++){000000013FC010B0 xor edx,edx var += arrPtr[x];000000013FC010B2 lea ecx,[rdx+rdx] for(int x = 0; x < 1024; x++){000000013FC010B5 inc edx for(int x = 0; x < 1024; x++){000000013FC010B7 cmp edx,200h var += arrPtr[x];000000013FC010BD addsd xmm6,mmword ptr arr[rcx*8]000000013FC010C3 addsd xmm6,mmword ptr [rsp+rcx*8+58h] for(int x = 0; x < 1024; x++){000000013FC010C9 jb wmain+0B2h (13FC010B2h) for(int i = 0; i < 1024 * 1024 * 10; i++){000000013FC010CB inc eax000000013FC010CD cmp eax,0A00000h000000013FC010D2 jb wmain+0B0h (13FC010B0h) } } QueryPerformanceCounter((LARGE_INTEGER*)&t2);000000013FC010D4 lea rcx,[t2]000000013FC010D9 call qword ptr [__imp_QueryPerformanceCounter (13FC02008h)]Intel Compiler without vectorization, 32-Bit and highest optimization (this one clearly is the winner now, runs in about 3 seconds and the assembly looks much better):00B81088 lea eax,[t1]00B8108C push eax00B8108D call dword ptr [__imp__QueryPerformanceCounter@4 (0B82004h)]00B81093 xor eax,eax00B81095 pxor xmm0,xmm000B81099 movaps xmm1,xmm0 for(int x = 0; x < 1024; x++){00B8109C xor edx,edx var += arrPtr[x];00B8109E addpd xmm0,xmmword ptr arr[edx*8]00B810A4 addpd xmm1,xmmword ptr [esp+edx*8+40h]00B810AA addpd xmm0,xmmword ptr [esp+edx*8+50h]00B810B0 addpd xmm1,xmmword ptr [esp+edx*8+60h] for(int x = 0; x < 1024; x++){00B810B6 add edx,800B810B9 cmp edx,400h00B810BF jb wmain+9Eh (0B8109Eh) for(int i = 0; i < 1024 * 1024 * 10; i++){00B810C1 inc eax00B810C2 cmp eax,0A00000h00B810C7 jb wmain+9Ch (0B8109Ch) double var = 0; long long freq, t1, t2;00B810C9 addpd xmm0,xmm1 } } QueryPerformanceCounter((LARGE_INTEGER*)&t2);00B810CD lea eax,[t2]00B810D1 push eax00B810D2 movaps xmmword ptr [esp+4],xmm000B810D7 call dword ptr [__imp__QueryPerformanceCounter@4 (0B82004h)]00B810DD movaps xmm0,xmmword ptr [esp] 解决方案 tl;dr: What you're seeing here seems to be ICC's failed attempt at vectorizing the loop.Let's start with MSVC x64:Here's the critical loop:$LL3@main:movsxd rax, DWORD PTR [rdx-4]movsxd rcx, DWORD PTR [rdx-8]add rdx, 16add r10, raxmovsxd rax, DWORD PTR [rdx-16]add rbx, rcxadd r9, raxmovsxd rax, DWORD PTR [rdx-12]add r8, raxdec r11jne SHORT $LL3@mainWhat you see here is the standard loop unrolling by the compiler. MSVC is unrolling to 4 iterations, and splitting the var variable across four registers: r10, rbx, r9, and r8. Then at the end of the loop, these 4 registers are summed up back together.Here's where the 4 sums are recombined:lea rax, QWORD PTR [r8+r9]add rax, r10add rbx, raxdec rdijne SHORT $LL6@mainNote that MSVC currently does not do automatic vectorization.Now let's look at part of your ICC output:000000013F0510A2 movq xmm2,mmword ptr arr[rcx]000000013F0510A8 add r8,8000000013F0510AC punpckldq xmm2,xmm2000000013F0510B0 add rcx,20h000000013F0510B4 movdqa xmm3,xmm2000000013F0510B8 pand xmm2,xmm0000000013F0510BC movq xmm4,mmword ptr [rdx+8]000000013F0510C1 psrad xmm3,1Fh000000013F0510C6 punpckldq xmm4,xmm4000000013F0510CA pand xmm3,xmm1000000013F0510CE por xmm3,xmm2000000013F0510D2 movdqa xmm5,xmm4000000013F0510D6 movq xmm2,mmword ptr [rdx+10h]000000013F0510DB psrad xmm5,1Fh000000013F0510E0 punpckldq xmm2,xmm2000000013F0510E4 pand xmm5,xmm1000000013F0510E8 paddq xmm6,xmm3...What you're seeing here is an attempt by ICC to vectorize this loop. This is done in a similar manner as what MSVC did (splitting into multiple sums), but using SSE registers instead and with two sums per register.But it turns out that the overhead of vectorization happens to outweigh the benefits of vectorizing.If we walk these instructions down one-by-one, we can see how ICC tries to vectorize it:// Load two ints using a 64-bit load. {x, y, 0, 0}movq xmm2,mmword ptr arr[rcx]// Shuffle the data into this form.punpckldq xmm2,xmm2 xmm2 = {x, x, y, y}movdqa xmm3,xmm2 xmm3 = {x, x, y, y}// Mask out index 1 and 3.pand xmm2,xmm0 xmm2 = {x, 0, y, 0}// Arithmetic right-shift to copy sign-bit across the word.psrad xmm3,1Fh xmm3 = {sign(x), sign(x), sign(y), sign(y)}// Mask out index 0 and 2.pand xmm3,xmm1 xmm3 = {0, sign(x), 0, sign(y)}// Combine to get sign-extended values.por xmm3,xmm2 xmm3 = {x, sign(x), y, sign(y)} xmm3 = {x, y}// Add to accumulator...paddq xmm6,xmm3So it's doing some very messy unpacking just to vectorize. The mess comes from needing to sign-extend the 32-bit integers to 64-bit using only SSE instructions.SSE4.1 actually provides the PMOVSXDQ instruction for this purpose. But either the target machine doesn't support SSE4.1, or ICC isn't smart enough to use it in this case.But the point is:The Intel compiler is trying to vectorize the loop. But the overhead added seems to outweigh the benefit of vectorizing it in the first place. Hence why it's slower.EDIT : Update with OP's results on:ICC x64 no vectorizationICC x86 with vectorizationYou changed the data-type to double. So now it's floating-point. There's no more of that ugly sign-fill shifts that were plaguing the integer version.But since you disabled vectorization for the x64 version, it obviously becomes slower.ICC x86 with vectorization:00B8109E addpd xmm0,xmmword ptr arr[edx*8]00B810A4 addpd xmm1,xmmword ptr [esp+edx*8+40h]00B810AA addpd xmm0,xmmword ptr [esp+edx*8+50h]00B810B0 addpd xmm1,xmmword ptr [esp+edx*8+60h]00B810B6 add edx,800B810B9 cmp edx,400h00B810BF jb wmain+9Eh (0B8109Eh)Not much here - standard vectorization + 4x loop-unrolling.ICC x64 with no vectorization:000000013FC010B2 lea ecx,[rdx+rdx]000000013FC010B5 inc edx000000013FC010B7 cmp edx,200h000000013FC010BD addsd xmm6,mmword ptr arr[rcx*8]000000013FC010C3 addsd xmm6,mmword ptr [rsp+rcx*8+58h]000000013FC010C9 jb wmain+0B2h (13FC010B2h)No vectorization + only 2x loop-unrolling.All things equal, disabling vectorization will hurt performance in this floating-point case. 这篇关于C ++与Java?为什么ICC生成的代码比VC慢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持! 上岸,阿里云! 08-29 09:06