本文介绍了将稀疏矩阵存储为HDF5的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想以HDF5格式压缩并存储一个庞大的Scipy矩阵.我该怎么做呢?我尝试了以下代码:

I want to compress and store a humongous Scipy matrix in HDF5 format. How do I do this? I've tried the below code:

a = csr_matrix((dat, (row, col)), shape=(947969, 36039))
f = h5py.File('foo.h5','w')
dset = f.create_dataset("init", data=a, dtype = int, compression='gzip')

我遇到类似这样的错误,

I get errors like these,

TypeError: Scalar datasets don't support chunk/filter options
IOError: Can't prepare for writing data (No appropriate function for conversion path)

由于内存溢出,我无法将其转换为numpy数组.最好的方法是什么?

I can't convert it to numpy array as there will be memory overflow. What is the best method?

推荐答案

您可以使用 scipy.sparse.save_npz 方法

或者考虑使用 Pandas.SparseDataFrame ,但请注意此方法非常慢(strongstrong)(感谢 @ hpaulj进行测试并指出)

Alternatively consider using Pandas.SparseDataFrame, but be aware that this method is very slow (thanks to @hpaulj for testing and pointing it out)

演示:

生成稀疏矩阵和SparseDataFrame

generating sparse matrix and SparseDataFrame

In [55]: import pandas as pd

In [56]: from scipy.sparse import *

In [57]: m = csr_matrix((20, 10), dtype=np.int8)

In [58]: m
Out[58]:
<20x10 sparse matrix of type '<class 'numpy.int8'>'
        with 0 stored elements in Compressed Sparse Row format>

In [59]: sdf = pd.SparseDataFrame([pd.SparseSeries(m[i].toarray().ravel(), fill_value=0)
    ...:                           for i in np.arange(m.shape[0])])
    ...:

In [61]: type(sdf)
Out[61]: pandas.sparse.frame.SparseDataFrame

In [62]: sdf.info()
<class 'pandas.sparse.frame.SparseDataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 10 columns):
0    20 non-null int8
1    20 non-null int8
2    20 non-null int8
3    20 non-null int8
4    20 non-null int8
5    20 non-null int8
6    20 non-null int8
7    20 non-null int8
8    20 non-null int8
9    20 non-null int8
dtypes: int8(10)
memory usage: 280.0 bytes

将SparseDataFrame保存为HDF文件

saving SparseDataFrame to HDF file

In [64]: sdf.to_hdf('d:/temp/sparse_df.h5', 'sparse_df')

从HDF文件读取

In [65]: store = pd.HDFStore('d:/temp/sparse_df.h5')

In [66]: store
Out[66]:
<class 'pandas.io.pytables.HDFStore'>
File path: d:/temp/sparse_df.h5
/sparse_df            sparse_frame

In [67]: x = store['sparse_df']

In [68]: type(x)
Out[68]: pandas.sparse.frame.SparseDataFrame

In [69]: x.info()
<class 'pandas.sparse.frame.SparseDataFrame'>
Int64Index: 20 entries, 0 to 19
Data columns (total 10 columns):
0    20 non-null int8
1    20 non-null int8
2    20 non-null int8
3    20 non-null int8
4    20 non-null int8
5    20 non-null int8
6    20 non-null int8
7    20 non-null int8
8    20 non-null int8
9    20 non-null int8
dtypes: int8(10)
memory usage: 360.0 bytes

这篇关于将稀疏矩阵存储为HDF5的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

06-29 14:22