本文介绍了大数据集的scikit学习矢量化的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我的磁盘上有9GB的分段文档,而我的vps只有4GB的内存.
I have 9GB of segmented documents on my disk and my vps only has 4GB memory.
如何在初始化时不加载所有语料库的情况下向量化所有数据集?有样本代码吗?
How can I vectorize all the data set without loading all the corpus at initialization? Is there any sample code?
我的代码如下:
contents = [open('./seg_corpus/' + filename).read()
for filename in filenames]
vectorizer = CountVectorizer(stop_words=stop_words)
vectorizer.fit(contents)
推荐答案
尝试此操作,而不是将所有文本加载到内存中,您只能将文件句柄传递到fit
方法中,但是必须在input='file' >构造函数.
Try this, instead of loading all texts into memory you can pass only handles to files into fit
method, but you must specify input='file'
in CountVectorizer
constructor.
contents = [open('./seg_corpus/' + filename)
for filename in filenames]
vectorizer = CountVectorizer(stop_words=stop_words, input='file')
vectorizer.fit(contents)
这篇关于大数据集的scikit学习矢量化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!