问题描述
我想以HDF5格式压缩并存储一个庞大的Scipy矩阵.我该怎么做呢?我尝试了以下代码:
I want to compress and store a humongous Scipy matrix in HDF5 format. How do I do this? I've tried the below code:
a = csr_matrix((dat, (row, col)), shape=(947969, 36039))
f = h5py.File('foo.h5','w')
dset = f.create_dataset("init", data=a, dtype = int, compression='gzip')
我遇到类似这样的错误,
I get errors like these,
TypeError: Scalar datasets don't support chunk/filter options
IOError: Can't prepare for writing data (No appropriate function for conversion path)
由于内存溢出,我无法将其转换为numpy数组.最好的方法是什么?
I can't convert it to numpy array as there will be memory overflow. What is the best method?
推荐答案
您可以使用 scipy.sparse.save_npz 方法
或者考虑使用 Pandas.SparseDataFrame ,但请注意此方法非常慢(strongstrong)(感谢 @ hpaulj进行测试并指出)
Alternatively consider using Pandas.SparseDataFrame, but be aware that this method is very slow (thanks to @hpaulj for testing and pointing it out)
演示:
生成稀疏矩阵和SparseDataFrame
generating sparse matrix and SparseDataFrame
In [55]: import pandas as pd
In [56]: from scipy.sparse import *
In [57]: m = csr_matrix((20, 10), dtype=np.int8)
In [58]: m
Out[58]:
<20x10 sparse matrix of type '<class 'numpy.int8'>'
with 0 stored elements in Compressed Sparse Row format>
In [59]: sdf = pd.SparseDataFrame([pd.SparseSeries(m[i].toarray().ravel(), fill_value=0)
...: for i in np.arange(m.shape[0])])
...:
In [61]: type(sdf)
Out[61]: pandas.sparse.frame.SparseDataFrame
In [62]: sdf.info()
<class 'pandas.sparse.frame.SparseDataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 10 columns):
0 20 non-null int8
1 20 non-null int8
2 20 non-null int8
3 20 non-null int8
4 20 non-null int8
5 20 non-null int8
6 20 non-null int8
7 20 non-null int8
8 20 non-null int8
9 20 non-null int8
dtypes: int8(10)
memory usage: 280.0 bytes
将SparseDataFrame保存为HDF文件
saving SparseDataFrame to HDF file
In [64]: sdf.to_hdf('d:/temp/sparse_df.h5', 'sparse_df')
从HDF文件读取
In [65]: store = pd.HDFStore('d:/temp/sparse_df.h5')
In [66]: store
Out[66]:
<class 'pandas.io.pytables.HDFStore'>
File path: d:/temp/sparse_df.h5
/sparse_df sparse_frame
In [67]: x = store['sparse_df']
In [68]: type(x)
Out[68]: pandas.sparse.frame.SparseDataFrame
In [69]: x.info()
<class 'pandas.sparse.frame.SparseDataFrame'>
Int64Index: 20 entries, 0 to 19
Data columns (total 10 columns):
0 20 non-null int8
1 20 non-null int8
2 20 non-null int8
3 20 non-null int8
4 20 non-null int8
5 20 non-null int8
6 20 non-null int8
7 20 non-null int8
8 20 non-null int8
9 20 non-null int8
dtypes: int8(10)
memory usage: 360.0 bytes
这篇关于将稀疏矩阵存储为HDF5的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!