问题描述
我目前想使用python中的余弦相似度和Tfidf功能来计算全对文档相似度.我的基本方法如下:
I currently want to calculate all-pair document similarity using cosine similarity and Tfidf features in python. My basic approach is the following:
from sklearn.feature_extraction.text import TfidfVectorizer
#c = [doc1, doc2, ..., docn]
vec = TfidfVectorizer()
X = vec.fit_transform(c)
del vec
Y = X * X.T
工作正常,但不幸的是,不适用于我的大型数据集. X的尺寸为(350363, 2526183)
,因此输出矩阵Y的尺寸应为(350363, 350363)
.由于tfidf功能,X非常稀疏,因此很容易装入内存(仅2GB左右).但是,乘法运行一段时间后给我一个内存错误(即使内存未满,但我认为scipy非常聪明,可以期望内存使用率).
Works perfectly fine, but unfortunately, not for my very large datasets. X has a dimension of (350363, 2526183)
and hence, the output matrix Y should have (350363, 350363)
. X is very sparse due to the tfidf features, and hence, easily fits into memory (around 2GB only). Yet, the multiplication gives me a memory error after running for some time (even though the memory is not full but I suppose that scipy is so clever as to expect the memory usage).
我已经尝试过dtype,但没有成功.我还确保numpy和scipy链接了其BLAS库-但这不会对csr_matrix点功能产生影响,因为它是在C语言中实现的.那个.
I have already tried to play around with the dtypes without any success. I have also made sure that numpy and scipy have their BLAS libraries linked -- whereas this does not have an effect on the csr_matrix dot functionality as it is implemented in C. I thought of maybe using things like memmap, but I am not sure about that.
有人知道如何最好地解决这个问题吗?
Does anyone have an idea of how to best approach this?
推荐答案
您可能想看看scikit-learn中的random_projection
模块. Johnson-Lindenstrauss引理说,随机投影矩阵可以保证保持成对的距离,直到某个公差eta
,这在您计算所需的随机投影数量时是一个超参数.
You may want to look at the random_projection
module in scikit-learn. The Johnson-Lindenstrauss lemma says that a random projection matrix is guaranteed to preserve pairwise distances up to some tolerance eta
, which is a hyperparameter when you calculate the number of random projections needed.
长话短说,scikit-learn类SparseRandomProjection
在此看到是为您执行此操作的转换器.如果您在vec.fit_transform
之后在X上运行它,则应该会看到功能尺寸大大减小.
To cut a long story short, the scikit-learn class SparseRandomProjection
seen here is a transformer to do this for you. If you run it on X after vec.fit_transform
you should see a fairly large reduction in feature size.
sklearn.random_projection.johnson_lindenstrauss_min_dim
中的公式表明,要保留高达10%的公差,您只需要johnson_lindenstrauss_min_dim(350363, .1)
10942功能.这是一个上限,因此您可以花更少的钱就可以逃脱.甚至1%的公差也只需要johnson_lindenstrauss_min_dim(350363, .01)
1028192功能,仍然比您现在拥有的功能要少得多.
The formula from sklearn.random_projection.johnson_lindenstrauss_min_dim
shows that to preserve up to a 10% tolerance, you only need johnson_lindenstrauss_min_dim(350363, .1)
10942 features. This is an upper bound, so you may be able to get away with much less. Even 1% tolerance would only need johnson_lindenstrauss_min_dim(350363, .01)
1028192 features which is still significantly less than you have right now.
尝试简单-如果您的数据是dtype ='float64',请尝试使用'float32'.仅此一项就可以节省大量空间,尤其是在您不需要双精度的情况下.
Simple thing to try - if your data is dtype='float64', try using 'float32'. That alone can save a massive amount of space, especially if you do not need double precision.
如果问题是您也不能在内存中存储最终矩阵",我建议您在HDF5Store中使用数据(如在使用pytables的熊猫中所见). 此链接有一些不错的入门代码,您可以迭代计算点积的大块并写入磁盘.在最近的45GB数据集项目中,我一直在广泛使用此方法,如果您决定采用这种方法,可以提供更多帮助.
If the issue is that you cannot store the "final matrix" in memory either, I would recommend working with the data in an HDF5Store (as seen in pandas using pytables). This link has some good starter code, and you could iteratively calculate chunks of your dot product and write to disk. I have been using this extensively in a recent project on a 45GB dataset, and could provide more help if you decide to go this route.
这篇关于如何在Python中有效地计算巨大的矩阵乘法(tfidf功能)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!