本文介绍了Sklearn TFIDF 向量化器作为并行作业运行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何运行 sklearn TFIDF 向量化器(和 COUNT 向量化器)以作为并行作业运行?类似于其他 sklearn 模型中的 n_jobs=-1 参数.

How run sklearn TFIDF vectorizer (and COUNT vectorizer) to run as parallel jobs? Something similar to n_jobs=-1 parameter in other sklearn models.

推荐答案

这不是直接可行的,因为无法并行化/分布对这些矢量化器所需词汇的访问.

This is not directly possible because there is no way to parallelize/distribute access to the vocabulary that is needed for these vectorizers.

要执行并行文档矢量化,请使用 HashingVectorizer 代替.scikit 文档提供了一个示例,使用该矢量化器来训练(和评估)分批分类器.类似的工作流程也适用于并行化,因为输入项​​被映射到相同的向量索引,而并行工作程序之间没有任何通信.

To perform parallel document vectorization, use the HashingVectorizer instead. The scikit docs provide an example using this vectorizer to train (and evaluate) a classifier in batches. A similar workflow also works for parallelization because input terms are mapped to the same vector indices without any communication between the parallel workers.

只需单独计算部分 term-doc 矩阵,并在所有作业完成后将它们连接起来.此时你也可以运行 TfidfTransformer 在连接矩阵上.

Simply compute the partial term-doc matrices separately and concatenate them once all jobs are done. At this point you may also run TfidfTransformer on the concatenated matrix.

不存储输入术语的词汇表的最大缺点是很难找出哪些术语映射到最终矩阵中的哪一列(即逆变换).唯一有效的映射是对术语使用散列函数来查看它被分配给哪个列/索引.对于逆变换,您需要对所有唯一术语(即您的词汇表)执行此操作.

The most significant drawback of not storing the vocabulary of input terms, is that it is difficult to find out which terms are mapped to which column in the final matrix (i.e. inverse transform). The only efficient mapping is to use the hashing function on a term to see which column/index it is assigned to. For an inverse transform, you would need to do this for all unique terms (i.e. your vocabulary).

这篇关于Sklearn TFIDF 向量化器作为并行作业运行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-11 08:43