大型数据集上的Sklearn-GMM

本文介绍了大型数据集上的Sklearn-GMM的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个很大的数据集(我无法在内存中容纳全部数据).我想在此数据集上使用GMM.

I have a large data-set (I can't fit entire data on memory). I want to fit a GMM on this data set.

我可以对小批量数据重复使用GMM.fit()(sklearn.mixture.GMM)吗?

Can I use GMM.fit() (sklearn.mixture.GMM) repeatedly on mini batch of data ??

推荐答案

没有理由反复进行调整.您可以在合理的时间内随机采样尽可能多的数据点.如果变异不是很高，则随机样本的分布将与整个数据集大致相同.

There is no reason to fit it repeatedly.Just randomly sample as many data points as you think your machine can compute in a reasonable time. If variation is not very high, the random sample will have approximately the same distribution as the full dataset.

randomly_sampled = np.random.choice(full_dataset, size=10000, replace=False)
#If data does not fit in memory you can find a way to randomly sample when you read it

GMM.fit(randomly_sampled)

和用途

GMM.predict(full_dataset)
# Again you can fit one by one or batch by batch if you cannot read it in memory

对其余的进行分类.

这篇关于大型数据集上的Sklearn-GMM的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！