从cython循环数组的循环矢量化

从cython循环数组的循环矢量化

本文介绍了从cython循环数组的循环矢量化的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

  #cython:boundscheck = False,环绕= False,initializedcheck = False,nonecheck = False,cdivision = True 
from libc.stdlib cimport malloc,free
from libc.stdio cimport printf
cimport numpy as np
import numpy作为np


cdef extern从time.h:
int clock()


cdef void inplace_add(double [: :1] a,double [:: 1] b):
cdef int i
在范围内(a.shape [0]):
a [i] + = b [i ]


cdef void inplace_addlocal(double [:: 1] a,double [:: 1] b):
cdef int i,n = a.shape [0]
为范围(n):
a [i] + = b [i]


def main(int N):
cdef:
int rep = 1000000,i
double * pa =< double *> malloc(N * sizeof(double))
double * pb =< double *> malloc(N * sizeof(double))
double [:: 1] a =< double [:N]> pa
([N]):$ b $ [b] [b] [b] (b)
printf(loop%i \\\\\\\\\\\\\\\\\\\\\\\\\' (n):
开始=时钟() - 开始)
print(np.asarray(a)[:4])
start = clock ba [i] = b [i] = 1. /(1 + i)
为范围(代表):
inplace_addlocal(a,b)
printf(loop_local%i \ n,clock() - start)
print(np.asarray(a)[:4])

使用这些Cython指令,看起来相当的 inplace_add inplace_addlocal 都可以编译为紧凑的C循环。但对于 N = 128 (我期望的近似大小) inplace_addlocal 的两倍(!)快于 inplace_add ,编译后用 gcc -Ofast (并直接编写一个C函数,采用(int,double *,double *)或多或少地快于 addlocal ,有或没有 #openmp simd )。将 -fopt-info 传递给 gcc 显示 inplace_addlocal 获取向量化,但不是 inplace_add



这是Cython生成的C代码的问题(即gcc是真正的无法推断出它需要向量化代码所需的任何保证),或者使用gcc(即缺少一些优化)或其他内容?

谢谢。



(交叉发布到cython-users)

解决方案

生成的唯一区别C代码是在 inplace_addlocal 中,循环的结束变量是 int ,而在 inplace_add 它是 Py_ssize_t 。由于你的循环计数器是 int ,所以 inplace_add

inplace_add (相关部分)



  Py_ssize_t __pyx_t_1; 
int __pyx_t_2;
int __pyx_t_3;
int __pyx_t_4;

__pyx_t_1 =(__pyx_v_a.shape [0]); (__pyx_t_2 = 0; __pyx_t_2< __pyx_t_1; __pyx_t_2 + = 1){
__pyx_v_i = __pyx_t_2;
;

inplace_addlocal (相关章节)

  int __pyx_t_1; 
int __pyx_t_2;
int __pyx_t_3;
int __pyx_t_4;

__pyx_v_n =(__pyx_v_a.shape [0]);
__pyx_t_1 = __pyx_v_n; (__pyx_t_2 = 0; __pyx_t_2< __pyx_t_1; __pyx_t_2 + = 1){
__pyx_v_i = __pyx_t_2;
;

此中提到,最好使用 Py_ssize_t 作为索引(并且它必须在默认情况下在Cython中假设),这可以解决这个问题。


Consider the following example of doing an inplace-add on a Cython memoryview:

#cython: boundscheck=False, wraparound=False, initializedcheck=False, nonecheck=False, cdivision=True
from libc.stdlib cimport malloc, free
from libc.stdio cimport printf
cimport numpy as np
import numpy as np


cdef extern from "time.h":
    int clock()


cdef void inplace_add(double[::1] a, double[::1] b):
    cdef int i
    for i in range(a.shape[0]):
        a[i] += b[i]


cdef void inplace_addlocal(double[::1] a, double[::1] b):
    cdef int i, n = a.shape[0]
    for i in range(n):
        a[i] += b[i]


def main(int N):
    cdef:
        int rep = 1000000, i
        double* pa = <double*>malloc(N * sizeof(double))
        double* pb = <double*>malloc(N * sizeof(double))
        double[::1] a = <double[:N]>pa
        double[::1] b = <double[:N]>pb
        int start
    start = clock()
    for i in range(N):
        a[i] = b[i] = 1. / (1 + i)
    for i in range(rep):
        inplace_add(a, b)
    printf("loop %i\n", clock() - start)
    print(np.asarray(a)[:4])
    start = clock()
    for i in range(N):
        a[i] = b[i] = 1. / (1 + i)
    for i in range(rep):
        inplace_addlocal(a, b)
    printf("loop_local %i\n", clock() - start)
    print(np.asarray(a)[:4])

With these Cython directives, the seemingly equivalent inplace_add and inplace_addlocal both compile to tight C loops. But for N=128 (the approximate size I'm expecting) inplace_addlocal is twice(!) faster than inplace_add, after compilation with gcc -Ofast (and directly writing a C function taking a (int, double*, double*) is more or less as fast as addlocal, with or without #openmp simd). Passing -fopt-info to gcc shows that inplace_addlocal gets vectorized, but not inplace_add.

Is this an issue with the C code that Cython generates (i.e., gcc truly cannot infer whatever guarantees it needs to vectorize the code), or with gcc (i.e., some optimization is missing), or something else?

Thanks.

(cross-posted to cython-users)

解决方案

The only difference for the generated C code is that in inplace_addlocal the end variable for the loop is an int, while in inplace_add it's a Py_ssize_t.

Since your loop counter is an int, in the inplace_add version, there would be an aditional overhead due to casting between the two types when the comparison is performed.

inplace_add (relevant section)

Py_ssize_t __pyx_t_1;
int __pyx_t_2;
int __pyx_t_3;
int __pyx_t_4;

__pyx_t_1 = (__pyx_v_a.shape[0]);
for (__pyx_t_2 = 0; __pyx_t_2 < __pyx_t_1; __pyx_t_2+=1) {
  __pyx_v_i = __pyx_t_2;

inplace_addlocal (relevant section)

int __pyx_t_1;
int __pyx_t_2;
int __pyx_t_3;
int __pyx_t_4;

__pyx_v_n = (__pyx_v_a.shape[0]);
__pyx_t_1 = __pyx_v_n;
for (__pyx_t_2 = 0; __pyx_t_2 < __pyx_t_1; __pyx_t_2+=1) {
  __pyx_v_i = __pyx_t_2;

This answer mentions that is it preferable to use Py_ssize_t for indices (and it must be assumed by default in Cython), which would solve this problem.

这篇关于从cython循环数组的循环矢量化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-11 16:03