问题描述
我有一组数据,并且想要对其进行直方图处理.我需要这些垃圾箱具有相同的 size ,这意味着它们必须包含相同数量的对象,而不是更常见的(empy.histogram)等距间隔 /em>垃圾箱.这自然会以垃圾箱宽度为代价,而垃圾箱宽度可能会有所不同,并且通常会有所不同.
I have a set of data, and want to make an histogram of it. I need the bins to have the same size, by which I mean that they must contain the same number of objects, rather than the more common (numpy.histogram) problem of having equally spaced bins.This will naturally come at the expenses of the bins widths, which can - and in general will - be different.
我将指定所需箱的数量和数据集,以获取箱边缘作为回报.
I will specify the number of desired bins and the data set, obtaining the bins edges in return.
Example:
data = numpy.array([1., 1.2, 1.3, 2.0, 2.1, 2.12])
bins_edges = somefunc(data, nbins=3)
print(bins_edges)
>> [1.,1.3,2.1,2.12]
因此,垃圾箱都包含2个点,但是它们的宽度(0.3、0.8、0.02)不同.
So the bins all contain 2 points, but their widths (0.3, 0.8, 0.02) are different.
有两个限制:-如果一组数据相同,则包含它们的bin可能会更大.-如果有N个数据并请求了M个仓,则将有N/M个仓,如果N%M不为0,则加1.
There are two limitations:- if a group of data is identical, the bin containing them could be bigger.- if there are N data and M bins are requested, there will be N/M bins plus one if N%M is not 0.
这段代码是我写的一些技巧,适用于小型数据集.如果我的积分超过10 ** 9 +并想加快流程怎么办?
This piece of code is some cruft I've written, which worked nicely for small data sets. What if I have 10**9+ points and want to speed up the process?
1 import numpy as np
2
3 def def_equbin(in_distr, binsize=None, bin_num=None):
4
5 try:
6
7 distr_size = len(in_distr)
8
9 bin_size = distr_size / bin_num
10 odd_bin_size = distr_size % bin_num
11
12 args = in_distr.argsort()
13
14 hist = np.zeros((bin_num, bin_size))
15
16 for i in range(bin_num):
17 hist[i, :] = in_distr[args[i * bin_size: (i + 1) * bin_size]]
18
19 if odd_bin_size == 0:
20 odd_bin = None
21 bins_limits = np.arange(bin_num) * bin_size
22 bins_limits = args[bins_limits]
23 bins_limits = np.concatenate((in_distr[bins_limits],
24 [in_distr[args[-1]]]))
25 else:
26 odd_bin = in_distr[args[bin_num * bin_size:]]
27 bins_limits = np.arange(bin_num + 1) * bin_size
28 bins_limits = args[bins_limits]
29 bins_limits = in_distr[bins_limits]
30 bins_limits = np.concatenate((bins_limits, [in_distr[args[-1]]]))
31
32 return (hist, odd_bin, bins_limits)
推荐答案
以您的示例案例(2点的组合,总共6个数据点):
Using your example case (bins of 2 points, 6 total data points):
from scipy import stats
bin_edges = stats.mstats.mquantiles(data, [0, 2./6, 4./6, 1])
>> array([1. , 1.24666667, 2.05333333, 2.12])
这篇关于Python:如何使用大小相同的垃圾箱制作直方图的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!