问题描述
我目前正在尝试构建一种代码,该代码应该可以在从手持式口袋和传感器到数据中心的大型服务器的各种机器上工作.
这些体系结构之间的(许多)差异之一是需要对齐的内存访问.
在标准" x86 CPU上不需要对齐的内存访问,但是如果不遵守规则,许多其他CPU则需要对齐的内存访问.
到目前为止,我一直在使用压缩属性(或编译指示),通过强制编译器在已知有风险的特定数据访问上保持谨慎态度来进行处理.而且效果很好.
问题是,编译器非常谨慎,以至于在该过程中会损失很多性能.
由于性能很重要,因此最好重写代码的某些部分以专门用于严格对齐的cpus.另一方面,这样的代码在支持未对齐内存访问的cpus(例如x86)上会比较慢,因此我们只想在需要严格对齐内存访问的cpus上仅使用它们. >
现在的问题是:在编译时如何检测目标体系结构需要严格对齐的内存访问? (或者相反)
据我所知,没有C实现提供任何预处理器宏来帮助您解决这一问题.由于您的代码应该在各种各样的机器上运行,因此我假设您可以访问各种各样的机器进行测试,因此可以通过测试程序找出答案.然后,您可以编写自己的宏,如下所示:
#if defined(__sparc__)
/* Unaligned access will crash your app on a SPARC */
#define ALIGN_ACCESS 1
#elif defined(__ppc__) || defined(__POWERPC__) || defined(_M_PPC)
/* Unaligned access is too slow on a PowerPC (maybe?) */
#define ALIGN_ACCESS 1
#elif defined(__i386__) || defined(__x86_64__) || \
defined(_M_IX86) || defined(_M_X64)
/* x86 / x64 are fairly forgiving */
#define ALIGN_ACCESS 0
#else
#warning "Unsupported architecture"
#define ALIGN_ACCESS 1
#endif
请注意,未对齐访问的速度将取决于它跨越的边界.例如,如果访问越过4k页面边界,它将慢得多,并且可能还有其他边界导致访问速度仍然变慢.即使在x86上,某些未对齐的访问也不由处理器处理,而是由OS内核处理.这太慢了.
也不能保证将来(或当前)实现不会突然改变未对齐访问的性能特征. 曾经发生,并且将来可能会发生. PowerPC 601非常容忍未对齐的访问,而PowerPC 603e却没有.
使事情变得更加复杂的事实是,您编写的用于进行未对齐访问的代码在不同平台上的实现会有所不同.例如,在PowerPC上,如果x
是32位,则x << 32
和x >> 32
始终为0简化了事实,但是在x86上则没有这种运气.
I'm currently trying to build a code which is supposed to work on a wide range of machines, from handheld pockets and sensors to big servers in data centers.
One of the (many) differences between these architectures is the requirement for aligned memory access.
Aligned memory access is not required on "standard" x86 CPU, but many other CPU need it and produce an exception if the rule is not respected.
Up to now, i've been dealing with it by forcing the compiler to be cautious on specific data accesses which are known to be risky, using the packed attribute (or pragma). And it works fine.
The problem is, the compiler is so cautious that a lot of performance is lost in the process.
Since performance is important, we would be better of to rewrite some portion of the code to specifically work on strict-aligned cpus. Such code would, on the other hand, be slower on cpus which support unaligned memory access (such as x86), so we want to use it only on cpus which require strict-aligned memory access.
And now the question :how to detect, at compile time, that the target architecture requires strict-aligned memory access ? (or the other way round)
No C implementation that I know of provides any preprocessor macro to help you figure this out. Since your code supposedly runs on a wide range of machines, I assume that you have access to a wide variety of machines for testing, so you can figure out the answer with a test program. Then you can write your own macro, something like below:
#if defined(__sparc__)
/* Unaligned access will crash your app on a SPARC */
#define ALIGN_ACCESS 1
#elif defined(__ppc__) || defined(__POWERPC__) || defined(_M_PPC)
/* Unaligned access is too slow on a PowerPC (maybe?) */
#define ALIGN_ACCESS 1
#elif defined(__i386__) || defined(__x86_64__) || \
defined(_M_IX86) || defined(_M_X64)
/* x86 / x64 are fairly forgiving */
#define ALIGN_ACCESS 0
#else
#warning "Unsupported architecture"
#define ALIGN_ACCESS 1
#endif
Note that the speed of an unaligned access will depend on the boundaries which it crosses. For example, if the access crosses a 4k page boundary it will be much slower, and there may be other boundaries which cause it to be slower still. Even on x86, some unaligned accesses are not handled by the processor and are instead handled by the OS kernel. That is incredibly slow.
There is also no guarantee that a future (or current) implementation will not suddenly change the performance characteristics of unaligned accesses. This has happened in the past and may happen in the future; the PowerPC 601 was very forgiving of unaligned access but the PowerPC 603e was not.
Complicating things even further is the fact that the code you'd write to make an unaligned access would differ in implementation across platforms. For example, on PowerPC it's simplified by the fact that x << 32
and x >> 32
are always 0 if x
is 32 bits, but on x86 you have no such luck.
这篇关于在目标CPU上检测对齐内存要求的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!