问题描述
我目前的理解是,在某些情况下(当发生大量的YMM读/写时)第二代英特尔执行不正确,当YMM寄存器被相应的4个QWORD替换时,它可以工作,测试用例:
My current understanding is that in some cases (when massive YMM reads/writes occur) 2nd gen Intel executes them improperly, when YMM registers are replaced by corresponding 4 QWORD ones then it works, the test case:
/*
; 'Tsubame' decompression loop, 96-15+6=135 bytes long, 40 instructions:
; mark_description "Intel(R) C++ Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 15.0.0.108 Build 20140";
; mark_description "-TP -O3 -QxSSE4.1 -D_N_YMM -D_N_prefetch_4096 -D_N_HIGH_PRIORITY -FAcs";
.B16.3::
00015 41 0f 18 8b 00
10 00 00 prefetcht0 BYTE PTR [4096+r11]
0001d 41 8b 13 mov edx, DWORD PTR [r11]
00020 89 d1 mov ecx, edx
00022 83 e1 03 and ecx, 3
00025 75 34 jne .B16.7
.B16.4::
00027 0f b6 d2 movzx edx, dl
0002a 85 d2 test edx, edx
0002c 74 0a je .B16.6
.B16.5::
0002e c4 c1 7e 6f 43
01 vmovdqu ymm0, YMMWORD PTR [1+r11]
00034 c5 fe 7f 00 vmovdqu YMMWORD PTR [rax], ymm0
.B16.6::
00038 89 d1 mov ecx, edx
0003a 41 b9 01 00 00
00 mov r9d, 1
00040 ba 00 00 00 00 mov edx, 0
00045 41 0f 44 d1 cmove edx, r9d
00049 c1 e9 03 shr ecx, 3
0004c c1 e2 04 shl edx, 4
0004f 03 d1 add edx, ecx
00051 ff c1 inc ecx
00053 48 03 c2 add rax, rdx
00056 4c 03 d9 add r11, rcx
00059 eb 38 jmp .B16.8
.B16.7::
0005b c1 e1 03 shl ecx, 3
0005e 41 b9 ff ff ff
ff mov r9d, -1
00064 41 d3 e9 shr r9d, cl
00067 44 23 ca and r9d, edx
0006a 83 e2 0c and edx, 12
0006d 41 c1 e9 04 shr r9d, 4
00071 f7 da neg edx
00073 83 c2 10 add edx, 16
00076 49 f7 d9 neg r9
00079 4c 03 c8 add r9, rax
0007c c1 e9 03 shr ecx, 3
0007f f7 d9 neg ecx
00081 83 c1 04 add ecx, 4
00084 c4 c1 7e 6f 01 vmovdqu ymm0, YMMWORD PTR [r9]
00089 c5 fe 7f 00 vmovdqu YMMWORD PTR [rax], ymm0
0008d 48 03 c2 add rax, rdx
00090 4c 03 d9 add r11, rcx
.B16.8::
00093 4d 3b d8 cmp r11, r8
00096 0f 82 79 ff ff
ff jb .B16.3
*/
因为我只有Core 2和i5 2540M我无法尝试下一个减压功能是否适用于3 ???和下一个Intel CPU正常,所以我要求有人运行这个命令行并分享是否'FAILED':
Since I have only Core 2 and i5 2540M I cannot try whether next decompression function works on 3??? and next ones Intel CPUs properly, so I ask for someone to run this command line and share whether 'FAILED':
D:\Tsubame\buggy_AVX_compile>Nakamichi_Tsubame_YMM_PREFETCH_4096_Intel_15.0_64bit_SSE41.exe alice29.txt
Nakamichi 'Tsubame', written by Kaze, based on Nobuo Ito's LZSS source, babealicious suggestion by m^2 enforced, muffinesque suggestion by Jim Dempsey enforced.
Note: Conor Stokes' LZSSE2(FASTEST Textual Decompressor) is embedded, all credits along with many thanks go to him.
Limitation: Uncompressed 8192 MB of filesize.
Current priority class is HIGH_PRIORITY_CLASS.
Allocating Source-Buffer 0 MB ...
Allocating Target-Buffer 32 MB ...
Allocating Verification-Buffer 0 MB ...
Compressing 152,089 bytes ...
-; Each rotation means 64KB are encoded; Done 100%
NumberOfFullLiterals (lower-the-better): 4
NumberOf(Tiny)Matches[Tiny]Window (4): 157
NumberOf(Short)Matches[Tiny]Window (8): 52
NumberOf(Medium)Matches[Tiny]Window (12): 11
RAM-to-RAM performance: 11 KB/s.
Compressed to 73,071 bytes.
Source-file-Hash(FNV1A_YoshimitsuTRIAD) = 0x1366,78ee
Target-file-Hash(FNV1A_YoshimitsuTRIAD) = 0x8cec,be70
Decompressing 73,071 (being the compressed stream) bytes ...
RAM-to-RAM performance: 1152 MB/s.
Verification (input and output sizes match) OK.
Verification (input and output blocks mismatch) FAILED!
我感兴趣的命令行:
The command line that interests me:
D:\Tsubame\buggy_AVX_compile>Nakamichi_Tsubame_YMM_PREFETCH_4096_Intel_15.0_64bit_SSE41.exe alice29.txt
[]
我在英特尔的论坛上问了同样的问题,遗憾的是,似乎没有人关心:
[]
我尝试过:
笔记本电脑东芝i5-2540M,Windows 7,英特尔C优化器v15.0
The test suite, 241KB zip file, executables & source & testdatafile[^]
I asked the same question on Intel's forum, sadly, no one seems to care:
YMMWORD != 4xQWORD[^]
What I have tried:
Laptop Toshiba i5-2540M, Windows 7, Intel C Optimizer v15.0
推荐答案
这篇关于英特尔第三代(以及下一代)是否以错误的方式执行此代码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!