本文介绍了大型数据集上的Sklearn-GMM的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我有一个很大的数据集(我无法在内存中容纳全部数据).我想在此数据集上使用GMM.
I have a large data-set (I can't fit entire data on memory). I want to fit a GMM on this data set.
我可以对小批量数据重复使用GMM.fit()
(sklearn.mixture.GMM
)吗?
Can I use GMM.fit()
(sklearn.mixture.GMM
) repeatedly on mini batch of data ??
推荐答案
没有理由反复进行调整.您可以在合理的时间内随机采样尽可能多的数据点.如果变异不是很高,则随机样本的分布将与整个数据集大致相同.
There is no reason to fit it repeatedly.Just randomly sample as many data points as you think your machine can compute in a reasonable time. If variation is not very high, the random sample will have approximately the same distribution as the full dataset.
randomly_sampled = np.random.choice(full_dataset, size=10000, replace=False)
#If data does not fit in memory you can find a way to randomly sample when you read it
GMM.fit(randomly_sampled)
和用途
GMM.predict(full_dataset)
# Again you can fit one by one or batch by batch if you cannot read it in memory
对其余的进行分类.
这篇关于大型数据集上的Sklearn-GMM的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!