大数据集的scikit学习矢量化

本文介绍了大数据集的scikit学习矢量化的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我的磁盘上有9GB的分段文档，而我的vps只有4GB的内存.

I have 9GB of segmented documents on my disk and my vps only has 4GB memory.

如何在初始化时不加载所有语料库的情况下向量化所有数据集?有样本代码吗?

How can I vectorize all the data set without loading all the corpus at initialization? Is there any sample code?

我的代码如下:

contents = [open('./seg_corpus/' + filename).read()
            for filename in filenames]
vectorizer = CountVectorizer(stop_words=stop_words)
vectorizer.fit(contents)

推荐答案

尝试此操作，而不是将所有文本加载到内存中，您只能将文件句柄传递到fit方法中，但是必须在input='file' >构造函数.

Try this, instead of loading all texts into memory you can pass only handles to files into fit method, but you must specify input='file' in CountVectorizer constructor.

contents = [open('./seg_corpus/' + filename)
        for filename in filenames]
vectorizer = CountVectorizer(stop_words=stop_words, input='file')
vectorizer.fit(contents)

这篇关于大数据集的scikit学习矢量化的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！