问题描述
我有两个scipy_sparse_csr_matrix'a'和scipy_sparse_csr_matrix(boolean)'mask',我想将'a'的元素设置为零,其中mask的元素为True.
I have two scipy_sparse_csr_matrix 'a' and scipy_sparse_csr_matrix(boolean) 'mask', and I want to set elements of 'a' to zero where element of mask is True.
例如
>>>a
<3x3 sparse matrix of type '<type 'numpy.int32'>'
with 4 stored elements in Compressed Sparse Row format>
>>>a.todense()
matrix([[0, 0, 3],
[0, 1, 5],
[7, 0, 0]])
>>>mask
<3x3 sparse matrix of type '<type 'numpy.bool_'>'
with 4 stored elements in Compressed Sparse Row format>
>>>mask.todense()
matrix([[ True, False, True],
[False, False, True],
[False, True, False]], dtype=bool)
然后我想获得以下结果.
Then I want to obtain the following result.
>>>result
<3x3 sparse matrix of type '<type 'numpy.int32'>'
with 2 stored elements in Compressed Sparse Row format>
>>>result.todense()
matrix([[0, 0, 0],
[0, 1, 0],
[7, 0, 0]])
我可以通过类似的操作来实现
I can do it by operation like
result = a - a.multiply(mask)
或
a -= a.multiply(mask) #I don't care either in-place or copy.
但是我认为上述操作效率低下.由于"a"和"mask"的实际形状为67,108,864×2,000,000,因此这些操作在高规格服务器(64核Xeon cpu,512GB内存)上花费数秒钟.例如,"a"具有大约30,000,000个非零元素,而"mask"具有大约1,800,000个非零(True)元素,那么上述操作大约需要2秒钟.
But I think above operations are inefficient. Since actual shape of 'a' and 'mask' are 67,108,864 × 2,000,000, these operations take several seconds on high spec server(64 core Xeon cpu, 512GB memory). For example, 'a' has about 30,000,000 non-zero elements, and 'mask' has about 1,800,000 non-zero(True) elements, then above operation take about 2 seconds.
有没有更有效的方法?
条件在下面.
- a.getnnz()!= mask.getnnz()
- a.shape = mask.shape
谢谢!
其他方式(尝试)
a.data*=~np.array(mask[a.astype(np.bool)]).flatten();a.eliminate_zeros() #This takes twice the time longer than above method.
推荐答案
我的最初印象是这种乘减法是一种合理的方法.即使密集等效项使用更直接的方法,sparse
代码也经常将操作实现为某种乘法.行或列上的稀疏总和使用矩阵乘法与适当的1s的行或列矩阵相乘.偶数行或列索引使用矩阵乘法(至少在csr
格式上).
My initial impression is that this multiply and subtract approach is a reasonable one. Quite often sparse
code implements operations as some sort of multiplication, even if the dense equivalents use more direct methods. The sparse sum over rows or columns uses a matrix multiplication with the appropriate row or column matrix of 1s. Even row or column indexing uses matrix multiplication (at least on the csr
format).
有时候,我们可以通过直接使用矩阵属性(data
,indices
,indptr
)来改进操作.但这需要更多的思考和实验.
Sometimes we can improve on operations by working directly with the matrix attributes (data
, indices
, indptr
). But that requires a lot more thought and experimentation.
对于密集阵列,我的第一个尝试是
For the dense arrays my first try would be
In [611]: a.A*~(mask.A)
Out[611]:
array([[0, 0, 0],
[0, 1, 0],
[7, 0, 0]], dtype=int32)
但是没有直接方法对稀疏矩阵进行not
运算.如果mask
确实是稀疏的,则~mask
不会.在您的示例中,mask
具有4个True项和5个False,因此密集版本同样适用:
But there isn't a direct way of doing not
to a sparse matrix. If mask
was indeed sparse, ~mask
would not be. In your example mask
has 4 True terms, and 5 False, so a dense version would work just as well:
In [612]: nmask=sparse.csr_matrix(~(mask.A))
In [615]: a.multiply(nmask)
Out[615]:
<3x3 sparse matrix of type '<class 'numpy.int32'>'
with 2 stored elements in Compressed Sparse Row format>
CSR scipy矩阵在更新其矩阵后不会更新值探索将稀疏矩阵的对角线设置为0.可以将data
属性的值设置为0,然后在最后将eliminate_zeros
设置一次.
CSR scipy matrix does not update after updating its values explores setting the diagonal of a sparse matrix to 0. It is possible to set values of the data
attribute to 0, and then eliminate_zeros
once at the end.
另一种密集方法是
In [618]: a1=a.A
In [619]: a1[mask.A]=0
这在sparse
中也适用-有点
In [622]: a2=a.copy()
In [624]: a2[mask]
Out[624]: matrix([[0, 3, 5, 0]], dtype=int32)
In [625]: a2[mask]=0
/usr/local/lib/python3.5/dist-packages/scipy/sparse/compressed.py:730: SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.
SparseEfficiencyWarning)
In [626]: a2
Out[626]:
<3x3 sparse matrix of type '<class 'numpy.int32'>'
with 6 stored elements in Compressed Sparse Row format>
如上一个问题所述,我们需要消除零:
As noted in the previous question, we need to eliminate the zeros:
In [628]: a2.eliminate_zeros()
In [629]: a2
Out[629]:
<3x3 sparse matrix of type '<class 'numpy.int32'>'
with 2 stored elements in Compressed Sparse Row format>
从稀疏警告中获取提示,让我们尝试lil
格式
Taking a hint from the sparsity warning let's try the lil
format
In [638]: al=a.tolil()
In [639]: al[mask]
Out[639]:
<1x4 sparse matrix of type '<class 'numpy.int32'>'
with 2 stored elements in LInked List format>
In [640]: al[mask]=0
In [641]: al
Out[641]:
<3x3 sparse matrix of type '<class 'numpy.int32'>'
with 2 stored elements in LInked List format>
有趣的是,al[mask]
仍然很稀疏,而a[mask]
却很密集.这两种格式使用不同的索引方法.
It's interesting that al[mask]
is still sparse, where as a[mask]
is dense. The 2 formats use different indexing methods.
在某种程度的稀疏性下,可能值得对mask
的True(非零)元素进行迭代,将相应的a
项直接设置为零.
At some low level of sparsity, it might be worth iterating over the True (nonzero) elements of mask
, setting the corresponding terms of a
to zero directly.
我不会猜测这些方法的相对速度.需要对真实数据进行测试.
I'm not going to guess as to the relative speeds of these methods. That needs to be tested on realistic data.
这篇关于在scipy稀疏矩阵上mask为True时将元素设置为零的有效方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!