问题描述
我正在编写一个关于巨大 & 的机器学习算法稀疏数据(我的矩阵的形状为 (347, 5 416 812 801) 但非常稀疏,只有 0.13% 的数据非零.
I'm writing a machine learning algorithm on huge & sparse data (my matrix is of shape (347, 5 416 812 801) but very sparse, only 0.13% of the data is non zero.
我的稀疏矩阵的大小为 105 000 字节(csr 类型.
My sparse matrix's size is 105 000 bytes (<1Mbytes) and is of csr
type.
我试图通过为每个训练集/测试集选择一个示例索引列表来分离训练集/测试集.所以我想使用以下方法将我的数据集一分为二:
I'm trying to separate train/test sets by choosing a list of examples indices for each.So I want to split my dataset in two using :
training_set = matrix[train_indices]
形状(len(training_indices), 5 416 812 801)
,仍然稀疏
testing_set = matrix[test_indices]
形状 (347-len(training_indices), 5 416 812 801)
也是稀疏的
用training_indices
和testing_indices
两个list
的int
但是 training_set = matrix[train_indices]
似乎失败并返回 Segmentation fault (core dumped)
But training_set = matrix[train_indices]
seems to fail and return a Segmentation fault (core dumped)
这可能不是内存问题,因为我在具有 64GB RAM 的服务器上运行此代码.
It might not be a problem of memory, as I'm running this code on a server with 64Gbytes of RAM.
任何关于可能是什么原因的线索?
Any clue on what could be the cause ?
推荐答案
我想我已经重新创建了 csr
行索引:
I think I've recreated the csr
row indexing with:
def extractor(indices, N):
indptr=np.arange(len(indices)+1)
data=np.ones(len(indices))
shape=(len(indices),N)
return sparse.csr_matrix((data,indices,indptr), shape=shape)
在我闲逛的 csr
上进行测试:
Testing on a csr
I had hanging around:
In [185]: M
Out[185]:
<30x40 sparse matrix of type '<class 'numpy.float64'>'
with 76 stored elements in Compressed Sparse Row format>
In [186]: indices=np.r_[0:20]
In [187]: M[indices,:]
Out[187]:
<20x40 sparse matrix of type '<class 'numpy.float64'>'
with 57 stored elements in Compressed Sparse Row format>
In [188]: extractor(indices, M.shape[0])*M
Out[188]:
<20x40 sparse matrix of type '<class 'numpy.float64'>'
with 57 stored elements in Compressed Sparse Row format>
与许多其他 csr
方法一样,它使用矩阵乘法来产生最终值.在这种情况下,稀疏矩阵在所选行中为 1.时间其实好一点.
As with a number of other csr
methods, it uses matrix multiplication to produce the final value. In this case with a sparse matrix with 1 in selected rows. Time is actually a bit better.
In [189]: timeit M[indices,:]
1000 loops, best of 3: 515 µs per loop
In [190]: timeit extractor(indices, M.shape[0])*M
1000 loops, best of 3: 399 µs per loop
在您的情况下,提取器矩阵的形状为 (len(training_indices),347),只有 len(training_indices)
值.所以不大.
In your case the extractor matrix is (len(training_indices),347) in shape, with only len(training_indices)
values. So it is not big.
但是如果 matrix
太大(或至少第二维太大)以至于它在矩阵乘法例程中产生一些错误,它可能会在没有 python/numpy 陷阱的情况下引起分段错误
But if the matrix
is so large (or at least the 2nd dimension so big) that it produces some error in the matrix multiplication routines, it could give rise to segmentation fault without python/numpy trapping it.
matrix.sum(axis=1)
是否有效.这也使用矩阵乘法,尽管使用 1 的密集矩阵.或者sparse.eye(347)*M
,类似大小的矩阵乘法?
Does matrix.sum(axis=1)
work. That too uses a matrix multiplication, though with a dense matrix of 1s. Or sparse.eye(347)*M
, a similar size matrix multiplication?
这篇关于使用 int 列表进行稀疏矩阵切片的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!