问题描述
我的csr_matrix很大,我想添加行并获得具有相同列数但行数减少的新csr_matrix. (上下文:该矩阵是从sklearn CountVectorizer获得的文档项矩阵,我希望能够根据与这些文档相关的代码快速组合文档)
I have a big csr_matrix and I want to add over rows and obtain a new csr_matrix with the same number of columns but reduced number of rows. (Context: The matrix is a document-term matrix obtained from sklearn CountVectorizer and I want to be able to quickly combine documents according to codes associated with these documents)
举一个最小的例子,这是我的矩阵:
For a minimal example, this is my matrix:
import numpy as np
from scipy.sparse import csr_matrix
from scipy.sparse import vstack
row = np.array([0, 4, 1, 3, 2])
col = np.array([0, 2, 2, 0, 1])
dat = np.array([1, 2, 3, 4, 5])
A = csr_matrix((dat, (row, col)), shape=(5, 5))
print A.toarray()
[[1 0 0 0 0]
[0 0 3 0 0]
[0 5 0 0 0]
[4 0 0 0 0]
[0 0 2 0 0]]
不能说我想要一个新的矩阵B
,其中行(1,4)和(2,3,5)通过求和相结合,看起来像这样:
No let's say I want a new matrix B
in which rows (1, 4) and (2, 3, 5) are combined by summing them, which would look something like this:
[[5 0 0 0 0]
[0 5 5 0 0]]
并且应该再次采用稀疏格式(因为我正在使用的实际数据很大).我试图对矩阵的切片求和,然后将其堆叠:
And should be again in sparse format (because the real data I'm working with is large). I tried to sum over slices of the matrix and then stack it:
idx1 = [1, 4]
idx2 = [2, 3, 5]
A_sub1 = A[idx1, :].sum(axis=1)
A_sub2 = A[idx2, :].sum(axis=1)
B = vstack((A_sub1, A_sub2))
但是这给了我仅用于切片中非零列的求和值,因此我无法将其与其他切片结合使用,因为求和的切片中的列数不同.
But this gives me the summed up values just for the non-zero columns in the slice, so I can't combine it with the other slices because the number of columns in the summed slices are different.
我觉得必须有一个简单的方法来做到这一点.但是我在网上或文档中都找不到对此的任何讨论.我想念什么?
I feel like there must be an easy way to do this. But I couldn't find any discussion of this online or in the documentation. What am I missing?
谢谢您的帮助
推荐答案
请注意,您可以通过仔细构造另一个矩阵来做到这一点.这对于密集矩阵的工作方式如下:
Note that you can do this by carefully constructing another matrix. Here's how it would work for a dense matrix:
>>> S = np.array([[1, 0, 0, 1, 0,], [0, 1, 1, 0, 1]])
>>> np.dot(S, A.toarray())
array([[5, 0, 0, 0, 0],
[0, 5, 5, 0, 0]])
>>>
稀疏版本仅稍微复杂一点. row
:
The sparse version is only a little more complicated. The information about which rows should be summed together is encoded in row
:
col = range(5)
row = [0, 1, 1, 0, 1]
dat = [1, 1, 1, 1, 1]
S = csr_matrix((dat, (row, col)), shape=(2, 5))
result = S * A
# check that the result is another sparse matrix
print type(result)
# check that the values are the ones we want
print result.toarray()
输出:
<class 'scipy.sparse.csr.csr_matrix'>
[[5 0 0 0 0]
[0 5 5 0 0]]
通过在row
中包含更高的值并相应地扩展S
的形状,可以处理输出中的更多行.
You can handle more rows in your output by including higher values in row
and extending the shape of S
accordingly.
这篇关于对scipy.sparse.csr_matrix中的行求和的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!