问题描述
的WebSocket规范定义揭露数据
websocket spec defines unmasking data as
j = i MOD 4
transformed-octet-i = original-octet-i XOR masking-key-octet-j
,其中掩模是4字节长,揭露具有每字节被应用
where mask is 4 bytes long and unmasking has to be applied per byte.
有没有办法更有效地做到这一点,而不是仅仅循环字节?
Is there a way to do this more efficiently, than to just loop bytes?
服务器运行code可以假定为Haswell的CPU,操作系统是Linux的内核> 3.2,所以SSE等都是present。编码是用C做,但如果需要,我可以做ASM为好。
Server running the code can assumed to be a Haswell CPU, OS is Linux with kernel > 3.2, so SSE etc are all present. Coding is done in C, but I can do asm as well if necessary.
我倒是想看看了自己的解决方案,但无法弄清楚是否有任何数十SSE1-5 / AVE /的一个恰当的指令(无论扩展 - 记不清了许多多年来)
I'd tried to look up the solution myself, but was unable to figure out if there was an appropriate instruction in any of the dozens of SSE1-5/AVE/(whatever extension - lost track of the many over the years)
非常感谢你!
编辑:重读规范了几次之后,它似乎它实际上只用异或运算面具字节,我可以在一个时间,直到最后几个字节做8个字节的数据字节。问题仍然是开放的,因为我认为有可能可能会依然优化这个使用SSE或类似的方法(可能在处理一次甚至16个字节?让过程做的循环?...)
After rereading the spec a couple of times it seems that it's actually only XOR'ing the data bytes with the mask bytes, which I can do 8 bytes at a time till the last few bytes. Question is still open, as I think there could probably be still a way to optimize this using SSE or the like (maybe processing even 16 bytes at a time? letting the process do the for loop? ...)
推荐答案
是的,你可以在XOR一条指令16字节使用SSE2,或同时与AVX2 32字节(Haswell的和更高版本)。
Yes, you can XOR 16 bytes in one instruction using SSE2, or 32 bytes at a time with AVX2 (Haswell and later).
SSE2:
#include <emmintrin.h> // SSE2 instrinsics
__m128i v, v_mask;
uint8_t *buff; // buffer - must be 16 byte aligned
for (int i = 0; i < N; i += 16) // note that N must be multiple of 16
{
v = _mm_load_si128(&buff[i]); // load 16 bytes
v = _mm_xor_si128(v, v_mask); // XOR with mask
v = _mm_store_si128(&buff[i], v); // store 16 masked bytes
}
AVX2:
#include <immintrin.h> // AVX2 intrinsics
__m256i w, w_mask;
uint8_t *buff; // buffer - must be 16 byte aligned,
// and preferably 32 byte aligned
for (int i = 0; i < N; i += 32) // note that N must be multiple of 32
{
w = _mm256_load_si256(&buff[i]); // load 32 bytes
w = _mm256_xor_si256(w, w_mask); // XOR with mask
w = _mm256_store_si256(&buff[i], w); // store 32 masked bytes
}
这篇关于WebSocket的数据揭露/多字节XOR的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!