问题描述
Forgive me for a vague title. I honestly don't know which title will suit this question. If you have a better title, let's change it so that it will be apt for the problem at hand.
假设result
是2D数组,而values
是1D数组. values
保留一些与result
中每个元素关联的值. values
中的元素到result
的映射存储在x_mapping
和y_mapping
中. result
中的位置可以与不同的值关联.现在,我必须找到按关联分组的值的总和.
Let's say result
is a 2D array and values
is a 1D array. values
holds some values associated with each element in result
. The mapping of an element in values
to result
is stored in x_mapping
and y_mapping
. A position in result
can be associated with different values. Now, I have to find the sum of the values grouped by associations.
一个更好地说明问题的例子.
An example for better clarification.
result
数组:
[[0, 0],
[0, 0],
[0, 0],
[0, 0]]
values
数组:
[ 1., 2., 3., 4., 5., 6., 7., 8.]
注意:此处result
和values
具有相同数量的元素.但事实并非如此.大小之间根本没有关系.
Note: Here result
and values
have the same number of elements. But it might not be the case. There is no relation between the sizes at all.
x_mapping
和y_mapping
具有从1D values
到2D result
的映射. x_mapping
,y_mapping
和values
的大小将相同.
x_mapping
and y_mapping
have mappings from 1D values
to 2D result
. The sizes of x_mapping
, y_mapping
and values
will be the same.
x_mapping
-[0, 1, 0, 0, 0, 0, 0, 0]
y_mapping
-[0, 3, 2, 2, 0, 3, 2, 1]
此处,第一个值(values[0]
)的x为0,y为0(x_mapping[0]
和y_mappping[0]
),因此与result[0, 0]
相关联.如果我们正在计算关联数,则result[0,0]
处的元素值将为2,因为第1个值和第5个值与result[0, 0]
相关联.如果我们求和,则result[0, 0] = value[0] + value[4]
为6.
Here, 1st value(values[0]
) have x as 0 and y as 0(x_mapping[0]
and y_mappping[0]
) and hence associated with result[0, 0]
. If we are counting the number of associations, then element value at result[0,0]
will be 2 as 1st value and 5th value are associated with result[0, 0]
. If we are taking the sum, the result[0, 0] = value[0] + value[4]
which is 6.
# Initialisation. No connection with the solution.
result = np.zeros([4,2], dtype=np.int16)
values = np.linspace(start=1, stop=8, num=8)
y_mapping = np.random.randint(low=0, high=values.shape[0], size=values.shape[0])
x_mapping = np.random.randint(low=0, high=values.shape[1], size=values.shape[0])
# Summing the values associated with x,y (current solution.)
for i in range(values.size):
x = x_mapping[i]
y = y_mapping[i]
result[-y, x] = result[-y, x] + values[i]
result
[[6, 0],
[ 6, 2],
[14, 0],
[ 8, 0]]
解决方案失败;但是为什么呢?
test_result = np.zeros_like(result)
test_result[-y_mapping, x_mapping] = test_result[-y_mapping, x_mapping] + values # solution
令我惊讶的是,test_result
中的元素被覆盖. test_result
,
To my surprise elements are overwritten in test_result
. Values at test_result
,
[[5, 0],
[6, 2],
[7, 0],
[8, 0]]
问题
1.为什么在第二种解决方案中,每个元素都被覆盖?
正如@Divakar在回答中的评论中指出的那样-在test_result[-y_mapping, x_mapping] =
中重复索引时,NumPy不分配累积/求和的值.它从实例之一中随机分配.
Question
1. Why, in the second solution, every element is overwritten?
As @Divakar has pointed out in the comment in his answer -NumPy doesn't assign accumulated/summed values when the indices are repeated in test_result[-y_mapping, x_mapping] =
. It randomly assigns from one of the instances.
@Divakar答案中的方法2给了我很好的结果.对于23315个关联,for
循环花费了50毫秒,而方法1花费了1.85毫秒.击败所有这些方法后,方法2耗时668 µs.
Approach #2 in @Divakar's answer gives me good results. For 23315 associations, for
loop took 50 ms while Approach #1 took 1.85 ms. Beating all these, Approach #2 took 668 µs.
我正在i7处理器上使用Numpy版本1.14.3和Python 3.5.2.
I'm using Numpy version 1.14.3 with Python 3.5.2 on an i7 processor.
推荐答案
方法1
对于大多数重复索引,最直观的是np.add.at
-
Most intutive one would be with np.add.at
for those repeated indices -
np.add.at(result, [-y_mapping, x_mapping], values)
方法2
由于x,y索引的可能重复性质,我们需要执行合并求和.因此,另一种方法可能是使用NumPy的装箱求和func:np.bincount
并具有类似的实现-
We need to perform binned summations owing to the possible repeated nature of x,y indices. Hence, another way could be to use NumPy's binned summation func : np.bincount
and have an implementation like so -
# Get linear index equivalents off the x and y indices into result array
m,n = result.shape
out_dtype = result.dtype
lidx = ((-y_mapping)%m)*n + x_mapping
# Get binned summations off values based on linear index as bins
binned_sums = np.bincount(lidx, values, minlength=m*n)
# Finally add into result array
result += binned_sums.astype(result.dtype).reshape(m,n)
如果您始终从result
的零数组开始,则可以使用-
If you are always starting off with a zeros array for result
, the last step could be made more performant with -
result = binned_sums.astype(out_dtype).reshape(m,n)
这篇关于numpy:根据关联对值进行分组/装箱的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!