python - BIC使用scikit-learn的GaussianMixture过度拟合了图像分割模型中的组件数量

我正在使用GMM分割/聚类800x800像素和4个波段的高光谱图像数据。

我拍了张照片，然后将GMM应用于群集像素。
现在，在我当前的情况下，我很容易手动确定图像中有多少个分量。（草，水，岩石等）
我已经对n_components = 3..8的数据手动运行了GMM，并确定5个组件可能是模拟现实的最佳n_components。

在将来的应用程序中，我将需要能够自动识别我应该在我的GMM中使用的n_components的能力，因为将无法手动确定。

因此，我决定使用BIC作为成本函数来确定要在模型中使用的适当n_components。
我在测试数据上运行BIC，在该数据上，我手动确定n_components = 5个最佳现实模型，并发现BIC非常适合我的数据。
建议我使用尽可能多的组件。

newdata=img_data.reshape(800*800,4)
n_components = np.arange(1, 15)
BIC = np.zeros(n_components.shape)

for i, n in enumerate(n_components):
    gmm = GaussianMixture(n_components=n,
          covariance_type='tied')
    gmm.fit(newdata)

BIC[i] = gmm.bic(newdata)
plt.plot(BIC)

现在，理想情况下，我希望将BIC分数最小化为5，但是正如您在上方看到的那样，随着n_components的出现，它似乎不断下降。

有谁知道这里会发生什么？也许在使用BIC之前我需要以某种方式对数据进行平滑处理以减少噪声？还是我使用BIC功能不正确？

最佳答案

因此，经过一番谷歌搜索后，我决定将一个简单的高斯平滑滤波器应用于我的数组，它似乎在我的BIC得分列表中产生了一个局部最小值，即我期望的n_components。我编写了一个小脚本来挑选第一个本地分钟，并将其用作我的gmm模型的参数。

newdata=img_data.reshape(800*800,4)
#Apply a Gaussian smoothing filter over a pixel neighborhood
newdata=sy.ndimage.filters.gaussian_filter(newdata,(1.5,1.5))
#Create the vector of n_components you wish to test using the BIC alogrithm
n_components = np.arange(1, 10)
#Create an empty vector in which to store BIC scores
BIC = np.zeros(n_components.shape)

for i, n in enumerate(n_components):
    #Fit gmm to data for each value in n_components vector
    gmm = GaussianMixture(n_components=n,
          covariance_type='tied')
    gmm.fit(newdata)
    #Store BIC scores in a list
    BIC[i] = gmm.bic(newdata)

#Plot resulting BIC list (Scores(n_components))
plt.plot(BIC)
plt.show()

BIC Scores With Smoothing