问题描述
我正试图了解科学的CSR的工作原理.
I am trying to undersand how scipy CSR works.
https://docs.scipy.org/doc/scipy/reference/sparse.html
例如,在> https://en.wikipedia.org/wiki/Sparse_matrix上的以下矩阵中
( 0 0 0 0 )
( 5 8 0 0 )
( 0 0 3 0 )
( 0 6 0 0 )
它表示以下是CSR表示形式.
it says the CSR representation is the following.
V必须从左到右逐行列出非零元素吗?
Must V list one row after another with non-zero elements in a row list from left to right?
我可以理解COL_INDEX
是与V中的元素相对应的列索引(列1的索引索引为0).
I can understand COL_INDEX
is the column index (column 1 is indexed as 0) corresponding to elements in V.
我不了解ROW_INDEX
.有人可以告诉我ROW_INDEX
是如何从原始矩阵创建的吗?谢谢.
I don't understand ROW_INDEX
. Could anybody show me how the ROW_INDEX
was created from the original matrix? Thanks.
V = [ 5 8 3 6 ]
COL_INDEX = [ 0 1 2 1 ]
ROW_INDEX = [ 0 0 2 3 4 ]
推荐答案
coo
格式
我认为最好从coo
定义开始.它更容易理解,并且被广泛使用:
coo
format
I think it's best to start with the coo
definition. It's easier to understand, and widely used:
In [90]: A = np.array([[0,0,0,0],[5,8,0,0],[0,0,3,0],[0,6,0,0]])
In [91]: M = sparse.coo_matrix(A)
这些值存储在3个属性中:
The values are stored in 3 attributes:
In [92]: M.row
Out[92]: array([1, 1, 2, 3], dtype=int32)
In [93]: M.col
Out[93]: array([0, 1, 2, 1], dtype=int32)
In [94]: M.data
Out[94]: array([5, 8, 3, 6])
我们可以从这3个数组中创建一个新矩阵:
We can make a new matrix from those 3 arrays:
In [95]: sparse.coo_matrix((_94, (_92, _93))).A
Out[95]:
array([[0, 0, 0],
[5, 8, 0],
[0, 0, 3],
[0, 6, 0]])
糟糕,我需要添加一个形状,因为一列全为0:
oops, I need to add a shape, since one column is all 0s:
In [96]: sparse.coo_matrix((_94, (_92, _93)), shape=(4,4)).A
Out[96]:
array([[0, 0, 0, 0],
[5, 8, 0, 0],
[0, 0, 3, 0],
[0, 6, 0, 0]])
显示此矩阵的另一种方法:
Another way to display this matrix:
In [97]: print(M)
(1, 0) 5
(1, 1) 8
(2, 2) 3
(3, 1) 6
np.where(A)
给出相同的非零坐标.
np.where(A)
gives the same non-zero coordinates.
In [108]: np.where(A)
Out[108]: (array([1, 1, 2, 3]), array([0, 1, 2, 1]))
转换为csr
一旦有了coo
,我们可以轻松地将其转换为csr
.实际上,sparse
通常会为我们做到这一点:
conversion to csr
Once we have coo
, we can easily convert it to csr
. In fact sparse
often does that for us:
In [98]: Mr = M.tocsr()
In [99]: Mr.data
Out[99]: array([5, 8, 3, 6], dtype=int64)
In [100]: Mr.indices
Out[100]: array([0, 1, 2, 1], dtype=int32)
In [101]: Mr.indptr
Out[101]: array([0, 0, 2, 3, 4], dtype=int32)
Sparse做几件事-对索引进行排序,对重复项求和,并用indptr
数组替换row
.在这里,它实际上比原始的要长,但通常会更短,因为每行只有一个值(加1). 但也许更重要的是,大多数快速计算例程,尤其是矩阵乘法,都是使用csr
格式编写的.
Sparse does several things - it sorts the indices, sums duplicates, and replaces the row
with a indptr
array. Here it is actually longer than the original, but in general it will be shorter, since it has just one value per row (plus 1). But perhaps more important, most of the fast calculation routines, especially matrix multiplication, have been written using the csr
format.
我已经使用了很多这个包.同样是MATLAB,默认定义为coo
样式,但内部存储为csc
(但不像scipy
那样对用户公开).但是我从未尝试过从头开始导出indptr
.我可以,但是我不需要.
I've used this package a lot. MATLAB as well, where the default definition is in the coo
style, but the internal storage is csc
(but not as exposed to users as in scipy
). But I've never tried to derive indptr
from scratch. I could, but I don't need to.
csr_matrix
接受coo
格式的输入,也接受indptr
等格式的输入.我不建议这样做,除非您已经计算出了这些输入(例如从另一个矩阵中得出).它更容易出错,并且可能不会更快.
csr_matrix
accepts inputs in the coo
format, but also in the indptr
etc format. I wouldn't recommend it, unless you already have those inputs calculated (say from another matrix). It's more error prone, and probably not much faster.
但是有时在intptr
上进行迭代并直接在data
上执行计算很有用.通常,这比使用提供的方法要快.
However sometimes it is useful to iterate on intptr
, and perform calculations directly on the data
. Often this is faster than working with the provided methods.
例如,我们可以按行列出非零值:
For example we can list the nonzero values by row:
In [104]: for i in range(Mr.shape[0]):
...: pt = slice(Mr.indptr[i], Mr.indptr[i+1])
...: print(i, Mr.indices[pt], Mr.data[pt])
...:
0 [] []
1 [0 1] [5 8]
2 [2] [3]
3 [1] [6]
保留初始0
使此迭代更容易.当矩阵为(10000,90000)时,没有太多动机将indptr
的大小减小1.
Keeping the initial 0
makes this iteration easier. When the matrix is (10000,90000) there's not much incentive to reduces the size of indptr
by 1.
lil
格式以类似的方式存储矩阵:
The lil
format stores the matrix in a similar manner:
In [105]: Ml = M.tolil()
In [106]: Ml.data
Out[106]: array([list([]), list([5, 8]), list([3]), list([6])], dtype=object)
In [107]: Ml.rows
Out[107]: array([list([]), list([0, 1]), list([2]), list([1])], dtype=object)
In [110]: for i,(r,d) in enumerate(zip(Ml.rows, Ml.data)):
...: print(i, r, d)
...:
0 [] []
1 [0, 1] [5, 8]
2 [2] [3]
3 [1] [6]
由于行的存储方式,lil
实际上允许我们获取view
:
Because of how rows are stored, lil
actually allows us to fetch a view
:
In [167]: Ml.getrowview(2)
Out[167]:
<1x4 sparse matrix of type '<class 'numpy.longlong'>'
with 1 stored elements in List of Lists format>
In [168]: for i in range(Ml.shape[0]):
...: print(Ml.getrowview(i))
...:
(0, 0) 5
(0, 1) 8
(0, 2) 3
(0, 1) 6
这篇关于了解csr格式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!