从cython循环数组的循环矢量化 | 从cython循环数组的循环矢量化

本文介绍了从cython循环数组的循环矢量化的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

  #cython：boundscheck = False，环绕= False，initializedcheck = False，nonecheck = False，cdivision = True 
 from libc.stdlib cimport malloc，free 
 from libc.stdio cimport printf 
 cimport numpy as np 
 import numpy作为np 
 
 
 cdef extern从time.h：
 int clock（）
 
 
 cdef void inplace_add（double [： ：1] a，double [:: 1] b）：
 cdef int i 
在范围内（a.shape [0]）：
a [i] + = b [i ] 
 
 
 cdef void inplace_addlocal（double [:: 1] a，double [:: 1] b）：
 cdef int i，n = a.shape [0] 
为范围（n）：
a [i] + = b [i] 
 
 
 def main（int N）：
 cdef： 
 int rep = 1000000，i 
 double * pa =< double *> malloc（N * sizeof（double））
 double * pb =< double *> malloc（N * sizeof（double））
 double [:: 1] a =< double [：N]> pa 
 （[N]）：$ b $ [b] [b] [b] （b）
 printf（loop％i \\\\\\\\\\\\\\\\\\\\\\\\\' （n）：
开始=时钟（） - 开始）
 print（np.asarray（a）[：4]）
 start = clock ba [i] = b [i] = 1. /（1 + i）
为范围（代表）：
 inplace_addlocal（a，b）
 printf（loop_local％i \ n，clock（） -  start）
 print（np.asarray（a）[：4]）

使用这些Cython指令，看起来相当的 inplace_add 和 inplace_addlocal 都可以编译为紧凑的C循环。但对于 N = 128 （我期望的近似大小） inplace_addlocal 的两倍（！）快于 inplace_add ，编译后用 gcc -Ofast （并直接编写一个C函数，采用（int，double *，double *）或多或少地快于 addlocal ，有或没有 #openmp simd ）。将 -fopt-info 传递给 gcc 显示 inplace_addlocal 获取向量化，但不是 inplace_add 。

这是Cython生成的C代码的问题（即gcc是真正的无法推断出它需要向量化代码所需的任何保证），或者使用gcc（即缺少一些优化）或其他内容？

谢谢。

（交叉发布到cython-users）

解决方案
生成的唯一区别C代码是在 inplace_addlocal 中，循环的结束变量是 int ，而在 inplace_add 它是 Py_ssize_t 。由于你的循环计数器是 int ，所以inplace_add
inplace_add （相关部分）
Py_ssize_t __pyx_t_1; int __pyx_t_2; int __pyx_t_3; int __pyx_t_4; __pyx_t_1 =（__pyx_v_a.shape [0]）; （__pyx_t_2 = 0; __pyx_t_2< __pyx_t_1; __pyx_t_2 + = 1）{ __pyx_v_i = __pyx_t_2; ; inplace_addlocal （相关章节） int __pyx_t_1; int __pyx_t_2; int __pyx_t_3; int __pyx_t_4; __pyx_v_n =（__pyx_v_a.shape [0]）; __pyx_t_1 = __pyx_v_n; （__pyx_t_2 = 0; __pyx_t_2< __pyx_t_1; __pyx_t_2 + = 1）{ __pyx_v_i = __pyx_t_2; ; 此中提到，最好使用 Py_ssize_t 作为索引（并且它必须在默认情况下在Cython中假设），这可以解决这个问题。 Consider the following example of doing an inplace-add on a Cython memoryview: #cython: boundscheck=False, wraparound=False, initializedcheck=False, nonecheck=False, cdivision=True from libc.stdlib cimport malloc, free from libc.stdio cimport printf cimport numpy as np import numpy as np cdef extern from "time.h": int clock() cdef void inplace_add(double[::1] a, double[::1] b): cdef int i for i in range(a.shape[0]): a[i] += b[i] cdef void inplace_addlocal(double[::1] a, double[::1] b): cdef int i, n = a.shape[0] for i in range(n): a[i] += b[i] def main(int N): cdef: int rep = 1000000, i double* pa = <double*>malloc(N * sizeof(double)) double* pb = <double*>malloc(N * sizeof(double)) double[::1] a = <double[:N]>pa double[::1] b = <double[:N]>pb int start start = clock() for i in range(N): a[i] = b[i] = 1. / (1 + i) for i in range(rep): inplace_add(a, b) printf("loop %i\n", clock() - start) print(np.asarray(a)[:4]) start = clock() for i in range(N): a[i] = b[i] = 1. / (1 + i) for i in range(rep): inplace_addlocal(a, b) printf("loop_local %i\n", clock() - start) print(np.asarray(a)[:4]) With these Cython directives, the seemingly equivalent inplace_add and inplace_addlocal both compile to tight C loops. But for N=128 (the approximate size I'm expecting) inplace_addlocal is twice(!) faster than inplace_add, after compilation with gcc -Ofast (and directly writing a C function taking a (int, double*, double*) is more or less as fast as addlocal, with or without #openmp simd). Passing -fopt-info to gcc shows that inplace_addlocal gets vectorized, but not inplace_add. Is this an issue with the C code that Cython generates (i.e., gcc truly cannot infer whatever guarantees it needs to vectorize the code), or with gcc (i.e., some optimization is missing), or something else? Thanks. (cross-posted to cython-users) 解决方案 The only difference for the generated C code is that in inplace_addlocal the end variable for the loop is an int, while in inplace_add it's a Py_ssize_t. Since your loop counter is an int, in the inplace_add version, there would be an aditional overhead due to casting between the two types when the comparison is performed. inplace_add (relevant section) Py_ssize_t __pyx_t_1; int __pyx_t_2; int __pyx_t_3; int __pyx_t_4; __pyx_t_1 = (__pyx_v_a.shape[0]); for (__pyx_t_2 = 0; __pyx_t_2 < __pyx_t_1; __pyx_t_2+=1) { __pyx_v_i = __pyx_t_2; inplace_addlocal (relevant section) int __pyx_t_1; int __pyx_t_2; int __pyx_t_3; int __pyx_t_4; __pyx_v_n = (__pyx_v_a.shape[0]); __pyx_t_1 = __pyx_v_n; for (__pyx_t_2 = 0; __pyx_t_2 < __pyx_t_1; __pyx_t_2+=1) { __pyx_v_i = __pyx_t_2; This answer mentions that is it preferable to use Py_ssize_t for indices (and it must be assumed by default in Cython), which would solve this problem. 这篇关于从cython循环数组的循环矢量化的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！