问题描述
我有一个用C语言编写的代码,其执行方式与人类相同,因此例如,如果我有两个数组A[0..n-1]
和B[0..n-1]
,则该方法将执行C[0]=A[0]+B[0]
,C[1]=A[1]+B[1]
.
I have a code in C which does additions in the same way as a human does, so if for example I have two arrays A[0..n-1]
and B[0..n-1]
, the method will do C[0]=A[0]+B[0]
, C[1]=A[1]+B[1]
...
即使解决方案使用内在函数,我也需要帮助以加快此功能的运行时间.
I need help in making this function faster, even if the solution is using intrinsics.
我的主要问题是我有一个非常大的依赖性问题,因为迭代i+1
依赖于迭代i
的进位,只要我使用10为底.因此,如果A[0]=6
和B[0]=5
,C[0]
必须为1
,下一个加法则为1
.
My main problem is that I have a really big dependency problem, as the iteration i+1
depends on the carry of the iteration i
, as long as I use base 10. So if A[0]=6
and B[0]=5
, C[0]
must be 1
and I have a carry of 1
for the next addition.
我能做的更快的代码是这样的:
The faster code I could do was this one:
void LongNumAddition1(unsigned char *Vin1, unsigned char *Vin2,
unsigned char *Vout, unsigned N) {
for (int i = 0; i < N; i++) {
Vout[i] = Vin1[i] + Vin2[i];
}
unsigned char carry = 0;
for (int i = 0; i < N; i++) {
Vout[i] += carry;
carry = Vout[i] / 10;
Vout[i] = Vout[i] % 10;
}
}
但是我也尝试了这些方法,结果却变得更慢:
But I also tried these approaches which turned out being slower:
void LongNumAddition1(unsigned char *Vin1, unsigned char *Vin2,
unsigned char *Vout, unsigned N) {
unsigned char CARRY = 0;
for (int i = 0; i < N; i++) {
unsigned char R = Vin1[i] + Vin2[i] + CARRY;
Vout[i] = R % 10; CARRY = R / 10;
}
}
void LongNumAddition1(char *Vin1, char *Vin2, char *Vout, unsigned N) {
char CARRY = 0;
for (int i = 0; i < N; i++) {
char R = Vin1[i] + Vin2[i] + CARRY;
if (R <= 9) {
Vout[i] = R;
CARRY = 0;
} else {
Vout[i] = R - 10;
CARRY = 1;
}
}
}
我一直在Google中进行研究,发现一些伪代码与我已经实现的伪代码相似,而且在GeeksforGeeks内部,还有另一种实现此问题的方法,但它也较慢.
I've been researching in google and found some pseudocodes which were similar to what I have implemented, also inside GeeksforGeeks there's another implementation to this problem but it is also slower.
你能帮我吗?
推荐答案
如果您不想更改数据格式,可以尝试SIMD.
If you don't want to change the format of the data, you can try SIMD.
typedef uint8_t u8x16 __attribute__((vector_size(16)));
void add_digits(uint8_t *const lhs, uint8_t *const rhs, uint8_t *out, size_t n) {
uint8_t carry = 0;
for (size_t i = 0; i + 15 < n; i += 16) {
u8x16 digits = *(u8x16 *)&lhs[i] + *(u8x16 *)&rhs[i] + (u8x16){carry};
// Get carries and almost-carries
u8x16 carries = digits >= 10; // true is -1
u8x16 full = digits == 9;
// Shift carries
carry = carries[15] & 1;
__uint128_t carries_i = ((__uint128_t)carries) << 8;
carry |= __builtin_add_overflow((__uint128_t)full, carries_i, &carries_i);
// Add to carry chains and wrap
digits += (((u8x16)carries_i) ^ full) & 1;
// faster: digits = (u8x16)_mm_min_epu8((__m128i)digits, (__m128i)(digits - 10));
digits -= (digits >= 10) & 10;
*(u8x16 *)&out[i] = digits;
}
}
这是每位数〜2条指令.您需要添加代码来处理尾端.
This is ~2 instructions per digit. You'll need to add code to handle the tail-end.
这是算法的一个贯穿部分.
Here's a run-through of the algorithm.
首先,我们将数字与最后一次迭代的进位相加:
First, we add our digits with our carry from the last iteration:
lhs 7 3 5 9 9 2
rhs 2 4 4 9 9 7
carry 1
+ -------------------------
digits 9 7 9 18 18 10
我们计算哪些数字将产生进位(≥10),哪些将传播(= 9).无论出于何种原因,SIMD的值为-1.
We calculate which digits will produce carries (≥10), and which would propagate them (=9). For whatever reason, true is -1 with SIMD.
carries 0 0 0 -1 -1 -1
full -1 0 -1 0 0 0
我们将carries
转换为整数并将其移位,还将full
转换为整数.
We convert carries
to an integer and shift it over, and also convert full
to an integer.
_ _ _ _ _ _
carries_i 000000001111111111110000
full 111100001111000000000000
现在,我们可以将它们加在一起以传播进位.请注意,只有最低位是正确的.
Now we can add these together to propagate carries. Note that only the lowest bit is correct.
_ _ _ _ _ _
carries_i 111100011110111111110000
(relevant) ___1___1___0___1___1___0
有两个指标需要注意:
-
carries_i
的最低位设置为digit ≠ 9
.这个广场上有一个提包.
carries_i
has its lowest bit set, anddigit ≠ 9
. There has been a carry into this square.
carries_i
的最低位设置为 un ,digit = 9
.在这个方块上有一个进位,重置了位.
carries_i
has its lowest bit unset, and digit = 9
. There has been a carry over this square, resetting the bit.
我们用(((u8x16)carries_i) ^ full) & 1
计算得出,然后加到digits
.
We calculate this with (((u8x16)carries_i) ^ full) & 1
, and add to digits
.
(c^f) & 1 0 1 1 1 1 0
digits 9 7 9 18 18 10
+ -------------------------
digits 9 8 10 19 19 10
然后我们删除所有已经携带的10s.
Then we remove the 10s, which have all been carried already.
digits 9 8 10 19 19 10
(d≥10)&10 0 0 10 10 10 10
- -------------------------
digits 9 8 0 9 9 0
我们还跟踪可能发生在两个地方的执行情况.
We also keep track of carries out, which can happen in two places.
这篇关于有没有办法使此功能更快? (C)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!