问题描述
我想创建一个 LDM
(RESP。 STM
)内联汇编指令,但有问题的前preSS操作数(尤其是:它们的顺序)。
I am trying to create an ldm
(resp. stm
) instruction with inline assembly but have problems to express the operands (especially: their order).
一个平凡
void *ptr;
unsigned int a;
unsigned int b;
__asm__("ldm %0!,{%1,%2}" : "+&r"(ptr), "=r"(a), "=r"(b));
不工作,因为它可能把 A
到 R1
和 B
到 R0
:
ldm ip!, {r1, r0}
LDM
预计升序寄存器(因为它们是连接codeD中的一个位域),所以我需要一种方法来说,用于<$ C寄存器$ C> A 比这个的低于b
。
ldm
expects registers in ascending order (as they are encoded in a bitfield) so I need a way to say that the register used for a
is lower than this of b
.
一个平凡的方式是登记的固定分配的:
A trivial way is the fixed assignment of registers:
register unsigned int a asm("r0");
register unsigned int b asm("r1");
__asm__("ldm %0!,{%1,%2}" : "+&r"(ptr), "=r"(a), "=r"(b));
但是,这消除了很大的灵活性,可能使生成的code不是最优的。
But this removes a lot of flexibility and might make the generated code not optimal.
请问GCC(4.8)支持 LDM / STM
特殊限制?或者,是否有更好的方法来解决这个问题(例如某些 __内置
功能)?
Does gcc (4.8) support special constraints for ldm/stm
? Or, are there better ways to solve this (e.g. some __builtin
function)?
由于有建议用更上一层楼构建......我想解决的问题是包装的32位字的20位(例如输入为8个字,输出为5个字)。伪code是
Because there are recommendations to use "higher level" constructs... The problem I want to solve is packing of 20 bits of a 32 bit word (e.g. input is 8 words, output is 5 words). Pseudo code is
asm("ldm %[in]!,{ %[a],%[b],%[c],%[d] }" ...)
asm("ldm %[in]!,{ %[e],%[f],%[g],%[h] }" ...) /* splitting of ldm generates better code;
gcc gets out of registers else */
/* do some arithmetic on a - h */
asm volatile("stm %[out]!,{ %[a],%[b],%[c],%[d],%[e] }" ...)
这里
速度事项和 LDM
比快50%贷存比
。算术是棘手,因为 GCC
产生更好的code比我;)我想与提供有关优化内存访问一些线索来解决它内联汇编
Speed matters here and ldm
is 50% faster than ldr
. The arithmetic is tricky and because gcc
generates much better code than me ;) I would like to solve it in inline assembly with giving some hints about optimized memory access.
推荐答案
我建议相同的解决方案在。即,明确指定的寄存器。在GCC-帮助的分析是错误的。有没有必要重新编写GCC的寄存器分配。所需要的唯一的事情就是让寄存器的顺序在一个汇编程序规范。
I have recommended the same solution in ARM memtest. Ie, explicitly assign the registers. The analysis on gcc-help is wrong. There is no need to re-write GCC's register allocation. The only thing that is needed is to allow the ordering of registers in an assembler specification.
这表示,下面将组装
int main(void)
{
void *ptr;
register unsigned int a __asm__("r1");
register unsigned int b __asm__("r0");
__asm__("ldm %0!,{%1,%2}" : "+&r"(ptr), "=r"(a), "=r"(b));
return 0;
}
这将无法编译,因为是非法的ARM指令, LDM R3!,{R1,R0}
在我的海湾合作委员会。一个解决方案是使用的 -S 的标志,只组装,然后运行一个脚本,将责令<$ C $ C> LDM / STM
操作数。 Perl可以很容易地做到这一点用,
This will not compile as there is an illegal ARM instruction, ldm r3!,{r1,r0}
in my gcc. A solution is to use the -S flag to assemble only and then run a script that will order the ldm
/stm
operands. Perl can easily do this with,
$reglist = join(',', sort(split(',', $reglist)));
或者任何其他方式。不幸的是,似乎没有要反正做到这一点使用汇编限制。如果我们能够获得一个分配的登记号码,内嵌的替代的或条件编译可以使用。
Or any other way. Unfortunately, there doesn't appear to be anyway to do this using assembler constraints. If we had access to an assigned register number, inline alternative or conditional compiling could be used.
也许最简单的解决方法是使用显式寄存器分配。除非你正在编写需要加载/存储多个值的矢量库,你想给编译器一定的自由来生成更好code。在这种情况下,它可能是更好的使用结构作为更高级别的gcc的优化将能够检测非必要的操作(如通过的有一个的或添加的零的乘等)。
Probably the easiest solution is to use explicit register assignment. Unless you are writing a vector library that needs to load/store multiple values and you want to give the compiler some freedom to generate better code. In this case, it is probably better to use structures as the higher level gcc optimizations will be able to detect un-needed operation (such as multiplies by one or addition of zero, etc).
编辑:
由于有建议用更上一层楼构建......我想解决的就是包装的32位字的20位(例如输入8个字,输出为5个字)。问题
这可能会给出更好的结果,
This will probably give better results,
u32 *ip, *op;
u32 in, out, mask;
int shift = 0;
const u32 *op_end = op + 5;
while(op != op_end) {
in = *ip++;
/* mask and accumulate... */
if(shift >= 32) {
*op++ = out;
shift -=32;
}
}
的原因是ARM的管道通常为几个阶段。用一个单独的加载/存储单元。 ALU(算术)可以继续进行与负载和商店平行。所以,你可以在第一个字工作时要装入后话。在这种情况下,你也可以替换值的原位的,这将给缓存的好处,除非你需要重新使用20位值。一旦code是在缓存中, LDM / STM
如果你搪塞上的数据有什么好处。这将是你的情况。
The reasoning is that the ARM pipeline is generally several stages. With a separate load/store unit. ALU (arithmetic) may proceed in parallel with the load and the store. So you can be working on the first word while you are loading later words. In this case, you may also replace the value in-place which will give a cache benefit, unless you need to re-use the 20-bit values. Once the code is in the cache, the ldm/stm
has little benefit if you stall on data. That will be your case.
2日编辑:编译器的主要工作是不是从内存中加载值。也就是说,寄存器分配是至关重要的。一般来说, LDM
/ STM
在内存传输功能非常有用。也就是说,内存测试,一个的memcpy()
的实施等,如果你正在做的计算与数据,则编译器可能对管线调度更好的知识。你可能需要选择接受普通的'C'code或移动来完成装配。记住, LDM
有可用的立即使用的第一个操作数。随后寄存器使用ALU的可能会导致数据加载摊位。同样, STM
需要的第一个寄存器的计算来完成执行时;但是,这是较不关键。
2nd The main job of a compiler is to not load values from memory. Ie, register assignment is crucial. Generally, the ldm
/stm
are most useful in memory transfer functions. Ie, a memory test, a memcpy()
implementation, etc. If you are doing computation with the data, then the compiler may have better knowledge about pipe line scheduling. You probably need to either accept plain 'C' code or move to complete assembler. Remember, the ldm
has the first operands available to use immediately. Use of the ALU with subsequent registers can cause a stall for the data to load. Similarly, the stm
needs the first register calculations to be complete when it executes; but this is less critical.
这篇关于`LDM / stm`在GCC内联ARM汇编的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!