问题描述
I have a scipy.sparse.csc_matrix that I am trying to transform into an array with scipy.sparse.csc_matrix.toarray()
. When I use the function for a small dataset it works fine. However, when I use it for a large dataset, the python interpreter immediately crashes upon calling the function and the window closes without an error message. The matrix I am trying to transform into an array was created with sklearn.feature_extraction.text.CountVectorizer
. I am running python 2.7.3 on Ubuntu 12.04. To complicate matters, when I try to run the script from the terminal in order to save any error message, the log records no error message and indeed stops much earlier in the script (despite being complete if toarray()
is not called).
推荐答案
您不能在大型稀疏矩阵上调用toarray
,因为它将尝试将所有值(包括零)显式存储在连续的内存块中.
You cannot call toarray
on a large sparse matrix as it will try to store all the values (including the zeros) explicitly in a continuous chunk of memory.
让我们举个例子,假设您有一个稀疏矩阵A:
Let's take and example, assume you have sparse matrix A:
>>> A.shape
(10000, 100000)
>>> A.nnz # non zero entries
47231
>>> A.dtype.itemsize
8
以MB为单位的非零数据的大小为:
The size of the non-zeros data in MB is:
>>> (A.nnz * A.dtype.itemsize) / 1e6
0.377848
您可以检查它是否与稀疏矩阵数据结构的data
数组的大小匹配:
You can check that this matches the size of the data
array of the sparse matrix data-structure:
>>> A.data / 1e6
0.377848
根据稀疏矩阵数据结构(CSR,CSC,COO ...)的类型,它还以各种方式存储非零条目的位置.通常,这大约会使内存使用量增加一倍.因此,A使用的总内存约为700kB.
Depending on the kind of sparse matrix data-structure (CSR, CSC, COO...), it also stores the location of the non-zero entries in various ways. In general this approximately doubles the memory usage. So the total memory used by A is in the order of 700kB.
转换为连续数组表示形式将使内存中的所有零变为实物,结果大小为:
Converting to the contiguous array representation would materialize all the zeros in memory and the resulting size would be:
>>> A.shape[0] * A.shape[1] * A.dtype.itemsize / 1e6
8000.0
此示例为8GB,而原始的稀疏表示小于1MB.
That's 8GB for this example, compared to less than 1MB for the original sparse representation.
这篇关于使用scipy.sparse.csc_matrix.toarray()将稀疏矩阵转换为数组时出错的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!