本文介绍了在没有显式循环的情况下构建具有多个自定义索引范围的 numpy 数组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在 Numpy 中,是否有一种 pythonic 方法可以在没有循环的情况下使用来自 array1 和 array2 的自定义范围创建 array3?迭代范围的直接解决方案有效,但由于我的数组遇到数百万个项目,我正在寻找更有效的解决方案(也许也是语法糖).

In Numpy, is there a pythonic way to create array3 with custom ranges from array1 and array2 without a loop? The straightforward solution of iterating over the ranges works but since my arrays run into millions of items, I am looking for a more efficient solution (maybe syntactic sugar too).

例如,

array1 = np.array([10, 65, 200])
array2 = np.array([14, 70, 204])
array3 = np.concatenate([np.arange(array1[i], array2[i]) for i in
                         np.arange(0,len(array1))])

print array3

结果:[10,11,12,13,65,66,67,68,69,200,201,202,203].

推荐答案

前瞻性方法

我将倒退如何解决这个问题.

Prospective Approach

I will go backwards on how to approach this problem.

选取问题中列出的样本.我们有 -

Take the sample listed in the question. We have -

array1 = np.array([10, 65, 200])
array2 = np.array([14, 70, 204])

现在,看看想要的结果 -

Now, look at the desired result -

result: [10,11,12,13,65,66,67,68,69,200,201,202,203]

让我们计算组长度,因为接下来我们将需要这些长度来解释解决方法.

Let's calculate the group lengths, as we would be needing those to explain the solution approach next.

In [58]: lens = array2 - array1

In [59]: lens
Out[59]: array([4, 5, 4])

这个想法是使用 1 的初始化数组,当在整个长度上进行累计求和时,它会给出我们想要的结果.这个累积求和将是我们解决方案的最后一步.为什么 1 被初始化?好吧,因为我们有一个以 1 为步长递增的数组,除了在我们有移位的特定位置对应新加入的群体.

The idea is to use 1's initialized array, which when cumumlative summed across the entire length would give us the desired result.This cumumlative summation would be the last step to our solution.Why 1's initialized? Well, because we have an array that increasing in steps of 1's except at specific places where we have shiftscorresponding to new groups coming in.

现在,由于 cumsum 将是最后一步,所以它之前的步骤应该给我们类似 -

Now, since cumsum would be the last step, so the step before it should give us something like -

array([ 10,   1,   1,   1,  52,   1,   1,   1,   1, 131,   1,   1,   1])

如前所述,1 在特定位置填充了 [10,52,131].10 似乎来自 array1 中的第一个元素,但其余的呢?第二个 52 作为 65-13(查看 result)进来,其中 13 进来以 10 开头并因为长度而跑的组第一组4.所以,如果我们做65 - 10 - 4,我们会得到51,然后加上1来适应边界停止,我们会有52,即所需的移位值.类似地,我们会得到 131.

As discussed before, it's 1's filled with [10,52,131] at specific places. That 10 seems to be coming in from the first element in array1, but what about the rest?The second one 52 came in as 65-13 (looking at the result) and in it 13 came in the group that started with 10 and ran because of the length ofthe first group 4. So, if we do 65 - 10 - 4, we will get 51 and then add 1 to accomodate for boundary stop, we would have 52, which is thedesired shifting value. Similarly, we would get 131.

因此,可以像这样计算那些 shifting-values -

Thus, those shifting-values could be computed, like so -

In [62]: np.diff(array1) - lens[:-1]+1
Out[62]: array([ 52, 131])

接下来,为了得到发生这种变化的那些shifting-places,我们可以简单地对组长度进行累积求和 -

Next up, to get those shifting-places where such shifts occur, we can simply do cumulative summation on the group lengths -

In [65]: lens[:-1].cumsum()
Out[65]: array([4, 9])

为了完整性,我们需要在 shifting-placesarray1[0] 的数组中为 预先附加 0移位值.

For completeness, we need to pre-append 0 with the array of shifting-places and array1[0] for shifting-values.

因此,我们准备以分步形式展示我们的方法!

So, we are set to present our approach in a step-by-step format!

1] 获取每组的长度:

1] Get lengths of each group :

lens = array2 - array1

2] 获取发生移位的索引以及要放入 1 的初始化数组中的值:

2] Get indices at which shifts occur and values to be put in 1's initialized array :

shift_idx = np.hstack((0,lens[:-1].cumsum()))
shift_vals = np.hstack((array1[0],np.diff(array1) - lens[:-1]+1))

3] 设置 1 的初始化 ID 数组,以便在之前步骤中列出的索引处插入这些值:

3] Setup 1's initialized ID array for inserting those values at those indices listed in the step before :

id_arr = np.ones(lens.sum(),dtype=array1.dtype)
id_arr[shift_idx] = shift_vals

4] 最后对 ID 数组进行累加求和:

4] Finally do cumulative summation on the ID array :

output = id_arr.cumsum()

以函数格式列出,我们会有 -

Listed in a function format, we would have -

def using_ones_cumsum(array1, array2):
    lens = array2 - array1
    shift_idx = np.hstack((0,lens[:-1].cumsum()))
    shift_vals = np.hstack((array1[0],np.diff(array1) - lens[:-1]+1))
    id_arr = np.ones(lens.sum(),dtype=array1.dtype)
    id_arr[shift_idx] = shift_vals
    return id_arr.cumsum()

它也适用于重叠范围!

In [67]: array1 = np.array([10, 11, 200])
    ...: array2 = np.array([14, 18, 204])
    ...:

In [68]: using_ones_cumsum(array1, array2)
Out[68]:
array([ 10,  11,  12,  13,  11,  12,  13,  14,  15,  16,  17, 200, 201,
       202, 203])


运行时测试

让我们将提议的方法与 @unutbu 基于 flatnonzero 的解决方案 中的其他矢量化方法进行比较,这已经被证明比循环方法要好得多 -

Let's time the proposed approach against the other vectorized approach in @unutbu's flatnonzero based solution, which already proved to be much better than the loopy approach -

In [38]: array1, array2 = (np.random.choice(range(1, 11), size=10**4, replace=True)
    ...:                   .cumsum().reshape(2, -1, order='F'))

In [39]: %timeit using_flatnonzero(array1, array2)
1000 loops, best of 3: 889 µs per loop

In [40]: %timeit using_ones_cumsum(array1, array2)
1000 loops, best of 3: 235 µs per loop


改进!

现在,代码方式的 NumPy 不喜欢追加.因此,对于下面列出的稍微改进的版本,可以避免那些 np.hstack 调用 -

def get_ranges_arr(starts,ends):
    counts = ends - starts
    counts_csum = counts.cumsum()
    id_arr = np.ones(counts_csum[-1],dtype=int)
    id_arr[0] = starts[0]
    id_arr[counts_csum[:-1]] = starts[1:] - ends[:-1] + 1
    return id_arr.cumsum()

让我们用我们原来的方法来计时 -

Let's time it against our original approach -

In [151]: array1,array2 = (np.random.choice(range(1, 11),size=10**4, replace=True)\
     ...:                                      .cumsum().reshape(2, -1, order='F'))

In [152]: %timeit using_ones_cumsum(array1, array2)
1000 loops, best of 3: 276 µs per loop

In [153]: %timeit get_ranges_arr(array1, array2)
10000 loops, best of 3: 193 µs per loop

因此,我们在那里获得了 30% 的性能提升!

So, we have a 30% performance boost there!

这篇关于在没有显式循环的情况下构建具有多个自定义索引范围的 numpy 数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

07-24 10:00