see Spare Matrix wikipedia item,
and scipy's documentation on different choices of sparse matrix type
sparse matrix storage, only store non-zero entries. there're multiple possible data structures for this, and can be divided into 2 groups
- support efficient modification
- DOK (dictory of keys)
- LIL (list of lists)
- COO (coordiate list)
- support efficient access
- CSR/CSC (compressed sparse row/column)
Dictionary of Keys (DOK)
- a dictionary that maps (row, col)-pair to the value;
- good for incremental build;
- poor for iterating;
- often used for building matrix, and convert to another format
List of Lists (LIL)
- matrix is a list of lists, one list for each row;
- each row list stores the (col, val) pair list;
- efficient for creation/insertion
Coordinate List (COO) aka IJV format
- sotre a list of (row, col, value) triplets, and ideally sorted by row then col;
- also known as IJV or Triplet format.
Compressed Sparse Row (CSR)
- an m*n matrix is represented as 3 vectors:
vals
,row_ptr
,col_idx
; vals
: all values in row-major; length is number of non-zero matrix elements;col_idx
: all values' column index in row-major order; same length with vals;row_ptr
:row_ptr[0] = 0
,row_ptr[k] = number-of-vals in first k rows
; i.e.row_ptr[k+1]-row_ptr[k]
is number of elements at row k;- this is extremely optimized for row-by-row iteration: only access current portion of vals and col_idx, and 2 elements of row_ptr to determine the portion - super cache friendly;
- thus very suitable for cases like matrix-multiplication, matrix-vector-multiplication;