I want to store a dataFrame with different columns into an hdf5 file (find an excerpt with data types below).
In [1]: mydf
Out [1]:
endTime uint32
distance float16
signature category
anchorName category
stationList object
Before converting some columns (signature and anchorName in my excerpt above), I used code like following to store it (which works pretty fine):
path = 'tmp4.hdf5'
key = 'journeys'
mydf.to_hdf(path, key, mode='w', complevel=9, complib='bzip2')
But it does not work with category and then I tried following:
path = 'tmp4.hdf5'
key = 'journeys'
mydf.to_hdf(path, key, mode='w', format='t', complevel=9, complib='bzip2')
It works fine, if I remove the column stationList, where each entry is a list of strings. But with this column I got the following exception:
Cannot serialize the column [stationList] because
its data contents are [mixed] object dtype
How do I need to improve my code to get the data frame stored?
pandas version: 0.17.1
python version: 2.7.6 (cannot change it due to compability reasons)
edit1 (some sample code):
import pandas as pd
mydf = pd.DataFrame({'endTime' : pd.Series([1443525810,1443540836,1443609470]),
'distance' : pd.Series([454.75,477.25,242.12]),
'signature' : pd.Series(['ab','cd','ab']),
'anchorName' : pd.Series(['tec','ing','pol']),
'stationList' : pd.Series([['t1','t2','t3'],['4','t2','t3'],['t3','t2','t4']])
# this works fine (no category)
mydf.to_hdf('tmp_without_cat.hdf5', 'journeys', mode='w', complevel=9, complib='bzip2')
for col in ['anchorName', 'signature']:
mydf[col] = mydf[col].astype('category')
# this crashes now because of category data
# mydf.to_hdf('tmp_with_cat.hdf5', 'journeys', mode='w', complevel=9, complib='bzip2')
# switching to format='t'
# this caused problems because of "mixed data" in column stationList
mydf.to_hdf('tmp_with_cat.hdf5', 'journeys', mode='w', format='t', complevel=9, complib='bzip2')
# this again works fine
mydf.to_hdf('tmp_with_cat_without_stationList.hdf5', 'journeys', mode='w', format='t', complevel=9, complib='bzip2')
edit2:In the meanwhile I tried different things to get rid of this problem. One of these were to convert the entries of column stationList to tupels (possible since they shall not be changed) and to also convert it to category. But it did not change anything.Here are the lines I added after the conversion loop (just for completeness):
mydf.stationList = [tuple(x) for x in mydf.stationList.values]
- 您要将分类数据存储在HDF5文件中;
- 您正在尝试将任意对象(即
- You want to store categorical data in a HDF5 file;
- You're trying to store arbitrary objects (i.e.
) in a HDF5 file.
As you discovered, categorical data is (currently?) only supported in the "table" format for HDF5.
However, storing arbitrary objects (list of strings, etc.) is really not something that is supported by the HDF5 format itself. Pandas working around that for you by serializing these objects using pickle, and then storing the pickle as an arbitrary-length string (which is not supported by all HDF5 formats, I think). But that will be slow and inefficient, and will never be supported well by HDF5.
In my mind, you have two options:
- 旋转数据,以便按工作站名称获得一行数据.然后,您可以将所有内容存储在表格格式的HDF5文件中. (通常,这是一个好习惯;请参阅 Hadley Wickham在Tidy Data上.)
- 如果您确实想保留此格式,则最好使用to_pickle()保存整个数据帧.处理您扔给它的任何类型的对象(例如,字符串列表等)都没有问题.
Personally, I would recommend option 1. You get to use a fast, binary file format. And the pivot will also make other operations with your data easier.
这篇关于将具有混合数据和类别的 pandas DataFrame存储到hdf5中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!