问题描述
在 Numpy 中,是否有一种 pythonic 方法可以在没有循环的情况下使用来自 array1 和 array2 的自定义范围创建 array3?迭代范围的直接解决方案有效,但由于我的数组遇到数百万个项目,我正在寻找更有效的解决方案(也许也是语法糖).
In Numpy, is there a pythonic way to create array3 with custom ranges from array1 and array2 without a loop? The straightforward solution of iterating over the ranges works but since my arrays run into millions of items, I am looking for a more efficient solution (maybe syntactic sugar too).
例如,
array1 = np.array([10, 65, 200])
array2 = np.array([14, 70, 204])
array3 = np.concatenate([np.arange(array1[i], array2[i]) for i in
np.arange(0,len(array1))])
print array3
结果:[10,11,12,13,65,66,67,68,69,200,201,202,203]
.
推荐答案
前瞻性方法
我将倒退如何解决这个问题.
Prospective Approach
I will go backwards on how to approach this problem.
选取问题中列出的样本.我们有 -
Take the sample listed in the question. We have -
array1 = np.array([10, 65, 200])
array2 = np.array([14, 70, 204])
现在,看看想要的结果 -
Now, look at the desired result -
result: [10,11,12,13,65,66,67,68,69,200,201,202,203]
让我们计算组长度,因为接下来我们将需要这些长度来解释解决方法.
Let's calculate the group lengths, as we would be needing those to explain the solution approach next.
In [58]: lens = array2 - array1
In [59]: lens
Out[59]: array([4, 5, 4])
这个想法是使用 1
的初始化数组,当在整个长度上进行累计求和时,它会给出我们想要的结果.这个累积求和将是我们解决方案的最后一步.为什么 1
被初始化?好吧,因为我们有一个以 1
为步长递增的数组,除了在我们有移位的特定位置对应新加入的群体.
The idea is to use 1
's initialized array, which when cumumlative summed across the entire length would give us the desired result.This cumumlative summation would be the last step to our solution.Why 1
's initialized? Well, because we have an array that increasing in steps of 1
's except at specific places where we have shiftscorresponding to new groups coming in.
现在,由于 cumsum
将是最后一步,所以它之前的步骤应该给我们类似 -
Now, since cumsum
would be the last step, so the step before it should give us something like -
array([ 10, 1, 1, 1, 52, 1, 1, 1, 1, 131, 1, 1, 1])
如前所述,1
在特定位置填充了 [10,52,131]
.10
似乎来自 array1
中的第一个元素,但其余的呢?第二个 52
作为 65-13
(查看 result
)进来,其中 13
进来以 10
开头并因为长度而跑的组第一组4
.所以,如果我们做65 - 10 - 4
,我们会得到51
,然后加上1
来适应边界停止,我们会有52
,即所需的移位值.类似地,我们会得到 131
.
As discussed before, it's 1
's filled with [10,52,131]
at specific places. That 10
seems to be coming in from the first element in array1
, but what about the rest?The second one 52
came in as 65-13
(looking at the result
) and in it 13
came in the group that started with 10
and ran because of the length ofthe first group 4
. So, if we do 65 - 10 - 4
, we will get 51
and then add 1
to accomodate for boundary stop, we would have 52
, which is thedesired shifting value. Similarly, we would get 131
.
因此,可以像这样计算那些 shifting-values
-
Thus, those shifting-values
could be computed, like so -
In [62]: np.diff(array1) - lens[:-1]+1
Out[62]: array([ 52, 131])
接下来,为了得到发生这种变化的那些shifting-places
,我们可以简单地对组长度进行累积求和 -
Next up, to get those shifting-places
where such shifts occur, we can simply do cumulative summation on the group lengths -
In [65]: lens[:-1].cumsum()
Out[65]: array([4, 9])
为了完整性,我们需要在 shifting-places
和 array1[0]
的数组中为 预先附加
.0
移位值
For completeness, we need to pre-append 0
with the array of shifting-places
and array1[0]
for shifting-values
.
因此,我们准备以分步形式展示我们的方法!
So, we are set to present our approach in a step-by-step format!
1] 获取每组的长度:
1] Get lengths of each group :
lens = array2 - array1
2] 获取发生移位的索引以及要放入 1
的初始化数组中的值:
2] Get indices at which shifts occur and values to be put in 1
's initialized array :
shift_idx = np.hstack((0,lens[:-1].cumsum()))
shift_vals = np.hstack((array1[0],np.diff(array1) - lens[:-1]+1))
3] 设置 1
的初始化 ID 数组,以便在之前步骤中列出的索引处插入这些值:
3] Setup 1
's initialized ID array for inserting those values at those indices listed in the step before :
id_arr = np.ones(lens.sum(),dtype=array1.dtype)
id_arr[shift_idx] = shift_vals
4] 最后对 ID 数组进行累加求和:
4] Finally do cumulative summation on the ID array :
output = id_arr.cumsum()
以函数格式列出,我们会有 -
Listed in a function format, we would have -
def using_ones_cumsum(array1, array2):
lens = array2 - array1
shift_idx = np.hstack((0,lens[:-1].cumsum()))
shift_vals = np.hstack((array1[0],np.diff(array1) - lens[:-1]+1))
id_arr = np.ones(lens.sum(),dtype=array1.dtype)
id_arr[shift_idx] = shift_vals
return id_arr.cumsum()
它也适用于重叠范围!
In [67]: array1 = np.array([10, 11, 200])
...: array2 = np.array([14, 18, 204])
...:
In [68]: using_ones_cumsum(array1, array2)
Out[68]:
array([ 10, 11, 12, 13, 11, 12, 13, 14, 15, 16, 17, 200, 201,
202, 203])
运行时测试
让我们将提议的方法与 @unutbu 基于 flatnonzero 的解决方案
中的其他矢量化方法进行比较,这已经被证明比循环方法要好得多 -
Let's time the proposed approach against the other vectorized approach in @unutbu's flatnonzero based solution
, which already proved to be much better than the loopy approach -
In [38]: array1, array2 = (np.random.choice(range(1, 11), size=10**4, replace=True)
...: .cumsum().reshape(2, -1, order='F'))
In [39]: %timeit using_flatnonzero(array1, array2)
1000 loops, best of 3: 889 µs per loop
In [40]: %timeit using_ones_cumsum(array1, array2)
1000 loops, best of 3: 235 µs per loop
改进!
现在,代码方式的 NumPy 不喜欢追加.因此,对于下面列出的稍微改进的版本,可以避免那些 np.hstack
调用 -
def get_ranges_arr(starts,ends):
counts = ends - starts
counts_csum = counts.cumsum()
id_arr = np.ones(counts_csum[-1],dtype=int)
id_arr[0] = starts[0]
id_arr[counts_csum[:-1]] = starts[1:] - ends[:-1] + 1
return id_arr.cumsum()
让我们用我们原来的方法来计时 -
Let's time it against our original approach -
In [151]: array1,array2 = (np.random.choice(range(1, 11),size=10**4, replace=True)\
...: .cumsum().reshape(2, -1, order='F'))
In [152]: %timeit using_ones_cumsum(array1, array2)
1000 loops, best of 3: 276 µs per loop
In [153]: %timeit get_ranges_arr(array1, array2)
10000 loops, best of 3: 193 µs per loop
因此,我们在那里获得了 30%
的性能提升!
So, we have a 30%
performance boost there!
这篇关于在没有显式循环的情况下构建具有多个自定义索引范围的 numpy 数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!