本文介绍了如何使用csr_matrix初始化gensim语料库变量?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有X作为我使用scikit的tfidf矢量化器获得的csr_matrix,而y是一个数组

I have X as a csr_matrix that I obtained using scikit's tfidf vectorizer, and y which is an array

我的计划是使用LDA创建功能,但是,我没找到如何使用X作为csr_matrix初始化gensim的语料库变量的方法.换句话说,我既不想下载gensim文档中所示的语料库,也不想将X转换为密集矩阵,因为它会占用大量内存,并且计算机可能会挂起.

My plan is to create features using LDA, however, I failed to find how to initialize a gensim's corpus variable with X as a csr_matrix. In other words, I don't want to download a corpus as shown in gensim's documentation nor convert X to a dense matrix, since it would consume a lot of memory and the computer could hang.

简而言之,我的问题如下

In short, my questions are the following,

  1. 如果我有一个代表整个语料库的csr_matrix(稀疏),那么如何初始化gensim语料库?
  2. 如何使用LDA提取特征?

推荐答案

Gensim具有半隐藏功能,可以为您完成此操作:

Gensim has a semi-well-hidden function that can kind of do this for you:

http://radimrehurek.com/gensim/matutils.html#gensim.matutils.Sparse2Corpus

"gengen.matutils.Sparse2Corpus类(稀疏,documents_columns = True) 将scipy.sparse格式的矩阵转换为流式gensim语料库."

"class gensim.matutils.Sparse2Corpus(sparse, documents_columns=True) Convert a matrix in scipy.sparse format into a streaming gensim corpus."

我已经成功地使用了CountVectorizer提取的语料库,然后将其加载到gensim中.

I've had some success with it using a corpus extracted with CountVectorizer, then loaded into gensim.

这篇关于如何使用csr_matrix初始化gensim语料库变量?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-18 16:42