本文介绍了了解csr格式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正试图了解科学的CSR的工作原理.

I am trying to undersand how scipy CSR works.

https://docs.scipy.org/doc/scipy/reference/sparse.html

例如,在> https://en.wikipedia.org/wiki/Sparse_matrix上的以下矩阵中

( 0 0 0 0 )
( 5 8 0 0 )
( 0 0 3 0 )
( 0 6 0 0 )

它表示以下是CSR表示形式.

it says the CSR representation is the following.

V必须从左到右逐行列出非零元素吗?

Must V list one row after another with non-zero elements in a row list from left to right?

我可以理解COL_INDEX是与V中的元素相对应的列索引(列1的索引索引为0).

I can understand COL_INDEX is the column index (column 1 is indexed as 0) corresponding to elements in V.

我不了解ROW_INDEX.有人可以告诉我ROW_INDEX是如何从原始矩阵创建的吗?谢谢.

I don't understand ROW_INDEX. Could anybody show me how the ROW_INDEX was created from the original matrix? Thanks.

   V         = [ 5 8 3 6 ]
   COL_INDEX = [ 0 1 2 1 ]
   ROW_INDEX = [ 0 0 2 3 4 ]

推荐答案

coo格式

我认为最好从coo定义开始.它更容易理解,并且被广泛使用:

coo format

I think it's best to start with the coo definition. It's easier to understand, and widely used:

In [90]: A = np.array([[0,0,0,0],[5,8,0,0],[0,0,3,0],[0,6,0,0]])
In [91]: M = sparse.coo_matrix(A)

这些值存储在3个属性中:

The values are stored in 3 attributes:

In [92]: M.row
Out[92]: array([1, 1, 2, 3], dtype=int32)
In [93]: M.col
Out[93]: array([0, 1, 2, 1], dtype=int32)
In [94]: M.data
Out[94]: array([5, 8, 3, 6])

我们可以从这3个数组中创建一个新矩阵:

We can make a new matrix from those 3 arrays:

In [95]: sparse.coo_matrix((_94, (_92, _93))).A
Out[95]:
array([[0, 0, 0],
       [5, 8, 0],
       [0, 0, 3],
       [0, 6, 0]])

糟糕,我需要添加一个形状,因为一列全为0:

oops, I need to add a shape, since one column is all 0s:

In [96]: sparse.coo_matrix((_94, (_92, _93)), shape=(4,4)).A
Out[96]:
array([[0, 0, 0, 0],
       [5, 8, 0, 0],
       [0, 0, 3, 0],
       [0, 6, 0, 0]])

显示此矩阵的另一种方法:

Another way to display this matrix:

In [97]: print(M)
  (1, 0)    5
  (1, 1)    8
  (2, 2)    3
  (3, 1)    6

np.where(A)给出相同的非零坐标.

np.where(A) gives the same non-zero coordinates.

In [108]: np.where(A)
Out[108]: (array([1, 1, 2, 3]), array([0, 1, 2, 1]))

转换为csr

一旦有了coo,我们可以轻松地将其转换为csr.实际上,sparse通常会为我们做到这一点:

conversion to csr

Once we have coo, we can easily convert it to csr. In fact sparse often does that for us:

In [98]: Mr = M.tocsr()
In [99]: Mr.data
Out[99]: array([5, 8, 3, 6], dtype=int64)
In [100]: Mr.indices
Out[100]: array([0, 1, 2, 1], dtype=int32)
In [101]: Mr.indptr
Out[101]: array([0, 0, 2, 3, 4], dtype=int32)

Sparse做几件事-对索引进行排序,对重复项求和,并用indptr数组替换row.在这里,它实际上比原始的要长,但通常会更短,因为每行只有一个值(加1). 但也许更重要的是,大多数快速计算例程,尤其是矩阵乘法,都是使用csr格式编写的.

Sparse does several things - it sorts the indices, sums duplicates, and replaces the row with a indptr array. Here it is actually longer than the original, but in general it will be shorter, since it has just one value per row (plus 1). But perhaps more important, most of the fast calculation routines, especially matrix multiplication, have been written using the csr format.

我已经使用了很多这个包.同样是MATLAB,默认定义为coo样式,但内部存储为csc(但不像scipy那样对用户公开).但是我从未尝试过从头开始导出indptr.我可以,但是我不需要.

I've used this package a lot. MATLAB as well, where the default definition is in the coo style, but the internal storage is csc (but not as exposed to users as in scipy). But I've never tried to derive indptr from scratch. I could, but I don't need to.

csr_matrix接受coo格式的输入,也接受indptr等格式的输入.我不建议这样做,除非您已经计算出了这些输入(例如从另一个矩阵中得出).它更容易出错,并且可能不会更快.

csr_matrix accepts inputs in the coo format, but also in the indptr etc format. I wouldn't recommend it, unless you already have those inputs calculated (say from another matrix). It's more error prone, and probably not much faster.

但是有时在intptr上进行迭代并直接在data上执行计算很有用.通常,这比使用提供的方法要快.

However sometimes it is useful to iterate on intptr, and perform calculations directly on the data. Often this is faster than working with the provided methods.

例如,我们可以按行列出非零值:

For example we can list the nonzero values by row:

In [104]: for i in range(Mr.shape[0]):
     ...:     pt = slice(Mr.indptr[i], Mr.indptr[i+1])
     ...:     print(i, Mr.indices[pt], Mr.data[pt])
     ...:
0 [] []
1 [0 1] [5 8]
2 [2] [3]
3 [1] [6]

保留初始0使此迭代更容易.当矩阵为(10000,90000)时,没有太多动机将indptr的大小减小1.

Keeping the initial 0 makes this iteration easier. When the matrix is (10000,90000) there's not much incentive to reduces the size of indptr by 1.

lil格式以类似的方式存储矩阵:

The lil format stores the matrix in a similar manner:

In [105]: Ml = M.tolil()
In [106]: Ml.data
Out[106]: array([list([]), list([5, 8]), list([3]), list([6])], dtype=object)
In [107]: Ml.rows
Out[107]: array([list([]), list([0, 1]), list([2]), list([1])], dtype=object)

In [110]: for i,(r,d) in enumerate(zip(Ml.rows, Ml.data)):
     ...:     print(i, r, d)
     ...:
0 [] []
1 [0, 1] [5, 8]
2 [2] [3]
3 [1] [6]

由于行的存储方式,lil实际上允许我们获取view:

Because of how rows are stored, lil actually allows us to fetch a view:

In [167]: Ml.getrowview(2)
Out[167]:
<1x4 sparse matrix of type '<class 'numpy.longlong'>'
    with 1 stored elements in List of Lists format>
In [168]: for i in range(Ml.shape[0]):
     ...:     print(Ml.getrowview(i))
     ...:

  (0, 0)    5
  (0, 1)    8
  (0, 2)    3
  (0, 1)    6

这篇关于了解csr格式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-14 00:09