问题描述
任务示例:
data = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9])idx = np.array([2, 0, 1, 1, 2, 0, 1, 1, 2])
预期结果:
binned = np.array([2, 6, 3, 4, 7, 8, 1, 5, 9])
约束:
应该很快.
应该是
O(n+k)
其中 n 是数据的长度,k 是 bin 的数量.应该是稳定的,即保持 bin 内的顺序.
明显的解决方案
data[np.argsort(idx, kind='stable')]
是O(n log n)
.
O(n+k)
解
def sort_to_bins(idx, data, mx=-1):如果 mx==-1:mx = idx.max() + 1cnts = np.zeros(mx + 1, int)对于我在范围内(idx.size):cnts[idx[i] + 1] += 1对于范围内的 i(1,cnts.size):cnts[i] += cnts[i-1]res = np.empty_like(数据)对于我在范围内(数据大小):res[cnts[idx[i]]] = 数据[i]cnts[idx[i]] += 1返回资源
又乱又慢.
纯numpy
有没有更好的方法<scipy
熊猫
numba
/pythran
?
这里有几个解决方案:
无论如何使用
np.argsort
,毕竟它是快速编译的代码.使用
np.bincount
获取 bin 大小和np.argpartition
是O(n)
对于固定数量的垃圾箱.缺点:目前没有稳定的算法可用,因此我们必须对每个 bin 进行排序.使用
scipy.ndimage.measurements.labeled_comprehension
.这大致满足了要求,但不知道它是如何实现的.使用
pandas
.我是一个完整的pandas
菜鸟,所以我在这里使用groupby
拼凑的东西可能不是最理想的.使用
scipy.sparse
在压缩稀疏行和压缩稀疏列格式之间切换恰好实现了我们正在寻找的确切操作.在问题中的循环代码上使用
pythran
(我确定numba
也能正常工作).所需要做的就是在 numpy 导入后在顶部插入
.
#pythran export sort_to_bins(int[:], float[:], int)
然后编译
# pythran stb_pthr.py
基准 100 个箱子,可变数量的项目:
带回家:
如果你对 numba
/pythran
没问题,那就是要走的路,如果不是,scipy.sparse
可以很好地扩展.>
代码:
将 numpy 导入为 np从 scipy 导入稀疏从 scipy.ndimage.measurements 导入labeled_comprehension从 stb_pthr 导入 sort_to_bins 作为 sort_to_bins_pythran将熊猫导入为 pddef sort_to_bins_pandas(idx, data, mx=-1):df = pd.DataFrame.from_dict(data=data)out = np.empty_like(data)j = 0对于 df.groupby(idx).groups.values() 中的 grp:out[j:j+len(grp)] = 数据[np.sort(grp)]j += len(grp)回来def sort_to_bins_ndimage(idx, data, mx=-1):如果 mx==-1:mx = idx.max() + 1out = np.empty_like(data)j = 0定义收集(bin):非局部 jout[j:j+len(bin)] = np.sort(bin)j += len(bin)返回 0标记理解(数据,idx,np.arange(mx),收集,data.dtype,无)回来def sort_to_bins_partition(idx, data, mx=-1):如果 mx==-1:mx = idx.max() + 1返回数据[np.argpartition(idx, np.bincount(idx, None, mx)[:-1].cumsum())]def sort_to_bins_partition_stable(idx, data, mx=-1):如果 mx==-1:mx = idx.max() + 1split = np.bincount(idx, None, mx)[:-1].cumsum()srt = np.argpartition(idx, split)对于 np.split(srt, split) 中的 bin:bin.sort()返回数据[srt]def sort_to_bins_sparse(idx, data, mx=-1):如果 mx==-1:mx = idx.max() + 1返回 sparse.csr_matrix((data, idx, np.arange(len(idx)+1)), (len(idx), mx)).tocsc().datadef sort_to_bins_argsort(idx, data, mx=-1):返回数据[idx.argsort(kind='stable')]从时间导入时间exmpls = [np.random.randint(0, K, (N,)) for K, N in np.c_[np.full(16, 100), 1<
The task by example:
data = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9])
idx = np.array([2, 0, 1, 1, 2, 0, 1, 1, 2])
Expected result:
binned = np.array([2, 6, 3, 4, 7, 8, 1, 5, 9])
Constraints:
Should be fast.
Should be
O(n+k)
where n is the length of data and k is the number of bins.Should be stable, i.e. order within bins is preserved.
Obvious solution
data[np.argsort(idx, kind='stable')]
is O(n log n)
.
O(n+k)
solution
def sort_to_bins(idx, data, mx=-1):
if mx==-1:
mx = idx.max() + 1
cnts = np.zeros(mx + 1, int)
for i in range(idx.size):
cnts[idx[i] + 1] += 1
for i in range(1, cnts.size):
cnts[i] += cnts[i-1]
res = np.empty_like(data)
for i in range(data.size):
res[cnts[idx[i]]] = data[i]
cnts[idx[i]] += 1
return res
is loopy and slow.
Is there a better method in pure numpy
< scipy
< pandas
< numba
/pythran
?
Here are a few solutions:
Use
np.argsort
anyway, after all it is fast compiled code.Use
np.bincount
to get the bin sizes andnp.argpartition
which isO(n)
for fixed number of bins. Downside: currently, no stable algorithm is available, thus we have to sort each bin.Use
scipy.ndimage.measurements.labeled_comprehension
. This does roughly what is required, but no idea how it is implemented.Use
pandas
. I'm a completepandas
noob, so what I cobbled together here usinggroupby
may be suboptimal.Use
scipy.sparse
switching between compressed sparse row and compressed sparse column formats happens to implement the exact operation we are looking for.Use
pythran
(I'm surenumba
works as well) on the loopy code in the question. All that is required is to insert at the top after numpy import
.
#pythran export sort_to_bins(int[:], float[:], int)
and then compile
# pythran stb_pthr.py
Benchmarks 100 bins, variable number of items:
Take home:
If you are ok with numba
/pythran
that is the way to go, if not scipy.sparse
scales rather well.
Code:
import numpy as np
from scipy import sparse
from scipy.ndimage.measurements import labeled_comprehension
from stb_pthr import sort_to_bins as sort_to_bins_pythran
import pandas as pd
def sort_to_bins_pandas(idx, data, mx=-1):
df = pd.DataFrame.from_dict(data=data)
out = np.empty_like(data)
j = 0
for grp in df.groupby(idx).groups.values():
out[j:j+len(grp)] = data[np.sort(grp)]
j += len(grp)
return out
def sort_to_bins_ndimage(idx, data, mx=-1):
if mx==-1:
mx = idx.max() + 1
out = np.empty_like(data)
j = 0
def collect(bin):
nonlocal j
out[j:j+len(bin)] = np.sort(bin)
j += len(bin)
return 0
labeled_comprehension(data, idx, np.arange(mx), collect, data.dtype, None)
return out
def sort_to_bins_partition(idx, data, mx=-1):
if mx==-1:
mx = idx.max() + 1
return data[np.argpartition(idx, np.bincount(idx, None, mx)[:-1].cumsum())]
def sort_to_bins_partition_stable(idx, data, mx=-1):
if mx==-1:
mx = idx.max() + 1
split = np.bincount(idx, None, mx)[:-1].cumsum()
srt = np.argpartition(idx, split)
for bin in np.split(srt, split):
bin.sort()
return data[srt]
def sort_to_bins_sparse(idx, data, mx=-1):
if mx==-1:
mx = idx.max() + 1
return sparse.csr_matrix((data, idx, np.arange(len(idx)+1)), (len(idx), mx)).tocsc().data
def sort_to_bins_argsort(idx, data, mx=-1):
return data[idx.argsort(kind='stable')]
from timeit import timeit
exmpls = [np.random.randint(0, K, (N,)) for K, N in np.c_[np.full(16, 100), 1<<np.arange(5, 21)]]
timings = {}
for idx in exmpls:
data = np.arange(len(idx), dtype=float)
ref = None
for x, f in (*globals().items(),):
if x.startswith('sort_to_bins_'):
timings.setdefault(x.replace('sort_to_bins_', '').replace('_', ' '), []).append(timeit('f(idx, data, -1)', globals={'f':f, 'idx':idx, 'data':data}, number=10)*100)
if x=='sort_to_bins_partition':
continue
if ref is None:
ref = f(idx, data, -1)
else:
assert np.all(f(idx, data, -1)==ref)
import pylab
for k, v in timings.items():
pylab.loglog(1<<np.arange(5, 21), v, label=k)
pylab.xlabel('#items')
pylab.ylabel('time [ms]')
pylab.legend()
pylab.show()
这篇关于将数组排序到由索引数组指定的 bin 的最有效方法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!