问题描述
我正在尝试检查我的程序中的任何未对齐的读取。我通过(使用Linux上的g ++上的x86_64 3.19)启用未对齐的访问处理器异常: asm volatile(pushf \\\
pop %% rax \\\
或$ 0x40000,%% rax \\\
push %% rax \\\
popf \\ \\ n:::rax);
我做了一个可选的强制非对齐读取,触发异常,所以我知道它的工作。在我禁用后,我得到一个错误的代码,否则似乎很好:
char fullpath [eMaxPath];
snprintf(fullpath,eMaxPath,%s /%s,blah,blah2);
stacktrace通过 __ memcpy_sse2
显示失败导致我怀疑标准库正在使用sse来完成我的memcpy,但它并没有意识到我现在已经将未对齐的读取不能接受。
我的想法是否正确,是否有任何方法(即我可以使标准库使用不对齐的安全sprintf / memcpy)?
谢谢
像我对这个问题的评论一样,那个asm不安全,因为它。而是使用
asm volatile(add $ -128,%rsp\\\
\t
pushf\\\
\t
orl $ 0x40000,(%rsp)\\\
\t
popf\\\
\t
sub $ -128 ,%rsp\\\
\t
);
( -128
扩展8位立即,但 128
不会,因此使用添加$ -128
减去128。)
或者在这种情况下,有专门的说明来切换该位,就像有进位和方向标志一样:
asm(stac); //设置AC标志
asm(stac); //清除AC标志
这是一个好主意当您的代码使用未对齐的内存时,有一些想法。在每种情况下,更改代码并不一定是一个好主意。有时从包装数据更接近的地方更有价值。
鉴于您不一定旨在消除所有未对齐的访问,我不认为这是最简单的方式来找到您所需要的。
现代x86硬件具有快速的硬件支持,用于未对齐的加载/存储。当他们没有跨越缓存线边界或导致商店转发摊位时,确实没有任何惩罚。
您可能会尝试查看性能计数器其中一些事件:
misalign_mem_ref.loads [推测缓存行拆分负载uops分派到L1缓存]
misalign_mem_ref。存储[推测缓存行拆分STA uops分派到L1缓存]
ld_blocks.store_forward [此事件将存储后的加载计算到同一地址,数据无法从管道中转发到存储到负载。
存储转发被阻止的最常见原因是当一个负载的地址范围与之前的较小的未完成的存储重叠时。
在英特尔看到不支持的存储转发表? 64和IA-32体系结构优化参考手册。
阻止商店转发的处罚是负载必须等待商店完成才能发出。]
(从在我的Sandybridge CPU)。
可能还有其他方法来检测未对齐的内存访问。也许valgrind?我在上搜索,并发现。可能还没有实现。
手动优化的库函数使用未对齐的访问,因为它是获取最快的方式他们的工作完成了例如将字符串的字节6到13复制到其他位置,只需单个8字节的加载/存储就可以完成。
所以是的,你需要特别的缓慢;图书馆功能的安全版本。
如果您的代码必须执行额外的指令才能避免使用未对齐的加载,不值得。 Esp。如果输入是通常对齐,在启动主循环之前,具有执行第一个对齐边界元素的循环可能只会使事情减慢。在对齐的情况下,一切都能最佳地工作,没有检查对齐的开销。在未对齐的情况下,事情可能会慢几个百分点,但只要不对齐的情况很少,不值得避免。
Esp。如果不是SSE代码,由于非AVX旧版SSE只能在保证对齐的情况下将负载转换为ALU指令的内存操作数。
拥有足够硬件的好处支持未对齐的内存操作是在对齐的情况下软件可以更快。它可以将对齐处理留给硬件,而不是运行额外的指令来处理可能对齐的指针。 (Linus Torvalds在论坛上提供了一些有趣的帖子,但他们不是可搜索,所以我找不到它。
I am trying to check for any unaligned reads in my program. I enable unaligned access processor exception via (using x86_64 on g++ on linux kernel 3.19):
asm volatile("pushf \n"
"pop %%rax \n"
"or $0x40000, %%rax \n"
"push %%rax \n"
"popf \n" ::: "rax");
I do an optional forced unaligned read which triggers the exception so i know its working. After i disable that I get an error in a piece of code which otherwise seems fine :
char fullpath[eMaxPath];
snprintf(fullpath, eMaxPath, "%s/%s", "blah", "blah2");
the stacktrace shows a failure via __memcpy_sse2
which leads me to suspect that the standard library is using sse to fulfill my memcpy but it doesnt realize that i have now made unaligned reads unacceptable.
Is my thinking correct and is there any way around this (ie can i make the standard library use an unaligned safe sprintf/memcpy instead)?
thanks
Like I commented on the question, that asm isn't safe, because it steps on the red-zone. Instead, use
asm volatile ("add $-128, %rsp\n\t"
"pushf\n\t"
"orl $0x40000, (%rsp)\n\t"
"popf\n\t"
"sub $-128, %rsp\n\t"
);
(-128
fits in a sign-extended 8bit immediate, but 128
doesn't, hence using add $-128
to subtract 128.)
Or in this case, there are dedicated instructions for toggling that bit, like there are for the carry and direction flags:
asm("stac"); // Set AC flag
asm("stac"); // Clear AC flag
It's a good idea to have some idea when your code uses unaligned memory. It's not necessarily a good idea to change your code to avoid it in every case. Sometimes better locality from packing data closer together is more valuable.
Given that you shouldn't necessarily aim to eliminate all unaligned accesses anyway, I don't think this is the easiest way to find the ones you do have.
modern x86 hardware has fast hardware support for unaligned loads/stores. When they don't span a cache-line boundary, or lead to store-forwarding stalls, there's literally no penalty.
What you might try is looking at performance counters for some of these events:
misalign_mem_ref.loads [Speculative cache line split load uops dispatched to L1 cache]
misalign_mem_ref.stores [Speculative cache line split STA uops dispatched to L1 cache]
ld_blocks.store_forward [This event counts loads that followed a store to the same address, where the data could not be forwarded inside the pipeline from the store to the load.
The most common reason why store forwarding would be blocked is when a load's address range overlaps with a preceeding smaller uncompleted store.
See the table of not supported store forwards in the Intel? 64 and IA-32 Architectures Optimization Reference Manual.
The penalty for blocked store forwarding is that the load must wait for the store to complete before it can be issued.]
(from ocperf.py list
output on my Sandybridge CPU).
There are probably other ways to detect unaligned memory access. Maybe valgrind? I searched on valgrind detect unaligned and found this mailing list discussion from 13 years ago. Probably still not implemented.
The hand-optimized library functions do use unaligned accesses because it's the fastest way for them to get their job done. e.g. copying bytes 6 to 13 of a string to somewhere else can and should be done with just a single 8-byte load/store.
So yes, you would need special slow&safe versions of library functions.
If your code would have to execute extra instructions to avoid using unaligned loads, it's often not worth it. Esp. if the input is usually aligned, having a loop that does the first up-to-alignment-boundary elements before starting the main loop may just slow things down. In the aligned case, everything works optimally, with no overhead of checking alignment. In the unaligned case, things might work a few percent slower, but as long as the unaligned cases are rare, it's not worth avoiding them.
Esp. if it's not SSE code, since non-AVX legacy SSE can only fold loads into memory operands for ALU instructions when alignment is guaranteed.
The benefit of having good-enough hardware support for unaligned memory ops is that software can be faster in the aligned case. It can leave alignment-handling to hardware, instead of running extra instructions to handle pointers that are probably aligned. (Linus Torvalds had some interesting posts about this on the http://realworldtech.com/ forums, but they're not searchable so I can't find it.
这篇关于任何方式可以停止x86_64上c ++标准库的未对齐访问?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!