问题描述
我有类似以下fromat的数据集,我试图找出具有最佳带宽的内核密度估计.
I have dataset like the following fromat and im trying to find out the Kernel density estimation with optimal bandwidth.
data = np.array([[1, 4, 3], [2, .6, 1.2], [2, 1, 1.2],
[2, 0.5, 1.4], [5, .5, 0], [0, 0, 0],
[1, 4, 3], [5, .5, 0], [2, .5, 1.2]])
但是我不知道该如何处理.以及如何找到Σ矩阵.
but I couldn't figure out how to approach it. also how to find the Σ matrix.
更新
我尝试使用scikit-learn工具包中的KDE函数来找出univariate(1D)kde,
I tried KDE function from scikit-learn toolkits to find out univariate(1D) kde,
# kde function
def kde_sklearn(x, x_grid, bandwidth):
kde = KernelDensity(kernel='gaussian', bandwidth=bandwidth).fit(x)
log_pdf = kde.score_samples(x_grid[:, np.newaxis])
return np.exp(log_pdf)
# optimal bandwidth selection
from sklearn.grid_search import GridSearchCV
grid = GridSearchCV(KernelDensity(), {'bandwidth': np.linspace(.1, 1.0, 30)}, cv=20)
grid.fit(x)
bw = grid.best_params_
# pdf using kde
pdf = kde_sklearn(x, x_grid, bw)
ax.plot(x_grid, pdf, label='bw={}'.format(bw))
ax.legend(loc='best')
plt.show()
任何人都可以帮助我将其扩展为多变量/在这种情况下为3D数据吗?
Can any one help me to extend this to multivariate / in this case 3D data?
推荐答案
有趣的问题.您有几种选择:
Interesting problem. You have a few options:
- 继续scikit-learn
- 使用其他库.例如,如果您感兴趣的内核是高斯-那么您可以使用 scipy.gaussian_kde 可以说更容易理解/应用.在这个问题中有一个很好的例子.
- 从第一原理开始自己动手.这非常困难,我不推荐
- Continue with scikit-learn
- Use a different library. For instance, if the kernel you are interested in is the gaussian - then you could use scipy.gaussian_kde which is arguably easier to understand / apply. There is a very good example of this technique in this question.
- roll your own from first principles. This is very difficult and I don't recommend it
此博客文章详细介绍关于内核密度估计(KDE)的各种库实现的相对优点.
This blog post goes into detail about the relative merits of various library implementations of Kernel Density Estimation (KDE).
我将向您展示什么是最简单的方法(我认为-是的,这是基于某种观点的),我认为这是您的情况下的选项2.
I'm going to show you what (in my opinion - yes this is a bit opinion based) is the simplest way, which I think is option 2 in your case.
注意 :此方法使用链接文档中所述的经验法则来确定带宽.所使用的确切规则是Scott的规则.您提到Σ矩阵使我认为选择拇指带宽的规则对您来说还可以,但是您还谈到了最佳带宽,并且您在本示例中使用交叉验证来确定最佳带宽.因此,如果此方法不适合您的目的,请在评论中告诉我.
NOTE This method uses a rule of thumb as described in the linked docs to determine bandwidth. The exact rule used is Scott's rule. Your mention of the Σ matrix makes me think rule of thumb bandwidth selection is OK for you, but you also talk about optimal bandwidth and the example you present uses cross-validation to determine the best bandwidth. Therefore, if this method doesn't suit your purposes - let me know in comments.
import numpy as np
from scipy import stats
data = np.array([[1, 4, 3], [2, .6, 1.2], [2, 1, 1.2],
[2, 0.5, 1.4], [5, .5, 0], [0, 0, 0],
[1, 4, 3], [5, .5, 0], [2, .5, 1.2]])
data = data.T #The KDE takes N vectors of length K for K data points
#rather than K vectors of length N
kde = stats.gaussian_kde(data)
# You now have your kde!! Interpreting it / visualising it can be difficult with 3D data
# You might like to try 2D data first - then you can plot the resulting estimated pdf
# as the height in the third dimension, making visualisation easier.
# Here is the basic way to evaluate the estimated pdf on a regular n-dimensional mesh
# Create a regular N-dimensional grid with (arbitrary) 20 points in each dimension
minima = data.T.min(axis=0)
maxima = data.T.max(axis=0)
space = [np.linspace(mini,maxi,20) for mini, maxi in zip(minima,maxima)]
grid = np.meshgrid(*space)
#Turn the grid into N-dimensional coordinates for each point
#Note - coords will get very large as N increases...
coords = np.vstack(map(np.ravel, grid))
#Evaluate the KD estimated pdf at each coordinate
density = kde(coords)
#Do what you like with the density values here..
#plot them, output them, use them elsewhere...
注意事项
Caveat
这可能会导致可怕的结果,具体取决于您的特定问题.显然要记住的事情是:
this may give terrible results, depending on your particular problem. Things to bear in mind are obviously:
-
随着维数的增加,想要观察的数据点数将呈指数增长-3维的9个点的样本数据非常稀疏-尽管我假设圆点表示事实上,您还有更多.
as your number of dimensions goes up, the number of observed data points you want will have to go up exponentially - your sample data of 9 points in 3 dimensions is pretty sparse - although I assume the dots indicate that in fact you have many more.
如主体中所述-以特定方式选择带宽-这可能会导致估计pdf的平滑过度(或可以想到,但不太可能出现不足).
As mentioned in the main body - the bandwidth is selected in a particular way - this may result in over- (or conceivably but unlikely under-) smoothing of the estimated pdf.
这篇关于如何在多元/3D中实现内核密度估计的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!