在scipy稀疏矩阵上mask为True时将元素设置为零的有效方法

本文介绍了在scipy稀疏矩阵上mask为True时将元素设置为零的有效方法的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有两个scipy_sparse_csr_matrix'a'和scipy_sparse_csr_matrix(boolean)'mask'，我想将'a'的元素设置为零，其中mask的元素为True.

I have two scipy_sparse_csr_matrix 'a' and scipy_sparse_csr_matrix(boolean) 'mask', and I want to set elements of 'a' to zero where element of mask is True.

例如

>>>a
<3x3 sparse matrix of type '<type 'numpy.int32'>'
    with 4 stored elements in Compressed Sparse Row format>
>>>a.todense()
matrix([[0, 0, 3],
        [0, 1, 5],
        [7, 0, 0]])

>>>mask
<3x3 sparse matrix of type '<type 'numpy.bool_'>'
    with 4 stored elements in Compressed Sparse Row format>
>>>mask.todense()
matrix([[ True, False,  True],
        [False, False,  True],
        [False,  True, False]], dtype=bool)

然后我想获得以下结果.

Then I want to obtain the following result.

>>>result
<3x3 sparse matrix of type '<type 'numpy.int32'>'
    with 2 stored elements in Compressed Sparse Row format>
>>>result.todense()
matrix([[0, 0, 0],
        [0, 1, 0],
        [7, 0, 0]])

我可以通过类似的操作来实现

I can do it by operation like

result = a - a.multiply(mask)

或

a -= a.multiply(mask) #I don't care either in-place or copy.

但是我认为上述操作效率低下.由于"a"和"mask"的实际形状为67,108,864×2,000,000，因此这些操作在高规格服务器(64核Xeon cpu，512GB内存)上花费数秒钟.例如，"a"具有大约30,000,000个非零元素，而"mask"具有大约1,800,000个非零(True)元素，那么上述操作大约需要2秒钟.

But I think above operations are inefficient. Since actual shape of 'a' and 'mask' are 67,108,864 × 2,000,000, these operations take several seconds on high spec server(64 core Xeon cpu, 512GB memory). For example, 'a' has about 30,000,000 non-zero elements, and 'mask' has about 1,800,000 non-zero(True) elements, then above operation take about 2 seconds.

有没有更有效的方法?

条件在下面.

a.getnnz()！= mask.getnnz()
a.shape = mask.shape

谢谢！

其他方式(尝试)

a.data*=~np.array(mask[a.astype(np.bool)]).flatten();a.eliminate_zeros() #This takes twice the time longer than above method.

推荐答案

我的最初印象是这种乘减法是一种合理的方法.即使密集等效项使用更直接的方法，sparse代码也经常将操作实现为某种乘法.行或列上的稀疏总和使用矩阵乘法与适当的1s的行或列矩阵相乘.偶数行或列索引使用矩阵乘法(至少在csr格式上).

My initial impression is that this multiply and subtract approach is a reasonable one. Quite often sparse code implements operations as some sort of multiplication, even if the dense equivalents use more direct methods. The sparse sum over rows or columns uses a matrix multiplication with the appropriate row or column matrix of 1s. Even row or column indexing uses matrix multiplication (at least on the csr format).

有时候，我们可以通过直接使用矩阵属性(data，indices，indptr)来改进操作.但这需要更多的思考和实验.

Sometimes we can improve on operations by working directly with the matrix attributes (data, indices, indptr). But that requires a lot more thought and experimentation.

对于密集阵列，我的第一个尝试是

For the dense arrays my first try would be

In [611]: a.A*~(mask.A)
Out[611]: 
array([[0, 0, 0],
       [0, 1, 0],
       [7, 0, 0]], dtype=int32)

但是没有直接方法对稀疏矩阵进行not运算.如果mask确实是稀疏的，则~mask不会.在您的示例中，mask具有4个True项和5个False，因此密集版本同样适用:

But there isn't a direct way of doing not to a sparse matrix. If mask was indeed sparse, ~mask would not be. In your example mask has 4 True terms, and 5 False, so a dense version would work just as well:

In [612]: nmask=sparse.csr_matrix(~(mask.A))
In [615]: a.multiply(nmask)
Out[615]: 
<3x3 sparse matrix of type '<class 'numpy.int32'>'
    with 2 stored elements in Compressed Sparse Row format>

CSR scipy矩阵在更新其矩阵后不会更新值探索将稀疏矩阵的对角线设置为0.可以将data属性的值设置为0，然后在最后将eliminate_zeros设置一次.

CSR scipy matrix does not update after updating its values explores setting the diagonal of a sparse matrix to 0. It is possible to set values of the data attribute to 0, and then eliminate_zeros once at the end.

另一种密集方法是

In [618]: a1=a.A
In [619]: a1[mask.A]=0

这在sparse中也适用-有点

In [622]: a2=a.copy()
In [624]: a2[mask]
Out[624]: matrix([[0, 3, 5, 0]], dtype=int32)
In [625]: a2[mask]=0
/usr/local/lib/python3.5/dist-packages/scipy/sparse/compressed.py:730: SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.
  SparseEfficiencyWarning)
In [626]: a2
Out[626]: 
<3x3 sparse matrix of type '<class 'numpy.int32'>'
    with 6 stored elements in Compressed Sparse Row format>

如上一个问题所述，我们需要消除零:

As noted in the previous question, we need to eliminate the zeros:

In [628]: a2.eliminate_zeros()
In [629]: a2
Out[629]: 
<3x3 sparse matrix of type '<class 'numpy.int32'>'
    with 2 stored elements in Compressed Sparse Row format>

从稀疏警告中获取提示，让我们尝试lil格式

Taking a hint from the sparsity warning let's try the lil format

In [638]: al=a.tolil()
In [639]: al[mask]
Out[639]: 
<1x4 sparse matrix of type '<class 'numpy.int32'>'
    with 2 stored elements in LInked List format>
In [640]: al[mask]=0
In [641]: al
Out[641]: 
<3x3 sparse matrix of type '<class 'numpy.int32'>'
    with 2 stored elements in LInked List format>

有趣的是，al[mask]仍然很稀疏，而a[mask]却很密集.这两种格式使用不同的索引方法.

It's interesting that al[mask] is still sparse, where as a[mask] is dense. The 2 formats use different indexing methods.

在某种程度的稀疏性下，可能值得对mask的True(非零)元素进行迭代，将相应的a项直接设置为零.

At some low level of sparsity, it might be worth iterating over the True (nonzero) elements of mask, setting the corresponding terms of a to zero directly.

我不会猜测这些方法的相对速度.需要对真实数据进行测试.

I'm not going to guess as to the relative speeds of these methods. That needs to be tested on realistic data.

这篇关于在scipy稀疏矩阵上mask为True时将元素设置为零的有效方法的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！