问题描述
TFIDFVectorizer占用大量内存,将470 MB的10万个文档向量化将占用6 GB的空间,如果我们处理2100万个文档,将无法容纳60 GB的RAM.
TFIDFVectorizer takes so much memory ,vectorizing 470 MB of 100k documents takes over 6 GB , if we go 21 million documents it will not fit 60 GB of RAM we have.
所以我们选择了HashingVectorizer,但仍然需要知道如何分发哈希矢量化器.Fit和Partial Fit什么都没做,所以如何使用Huge Corpus?
So we go for HashingVectorizer but still need to know how to distribute the hashing vectorizer.Fit and partial fit does nothing so how to work with Huge Corpus?
推荐答案
我强烈建议您使用 HashingVectorizer .
HashingVectorizer
是独立于数据的,仅vectorizer.get_params()
中的参数很重要.因此,(取消)提取`HashingVectorizer实例应该非常快.
The HashingVectorizer
is data independent, only the parameters from vectorizer.get_params()
are important. Hence (un)pickling `HashingVectorizer instance should be very fast.
基于词汇的矢量化程序更适合对小型数据集进行探索性分析.
The vocabulary based vectorizers are better suited for exploratory analysis on small datasets.
这篇关于如何减少Scikit-Learn矢量化器的内存使用量?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!