将具有混合数据和类别的 pandas DataFrame存储到hdf5中

本文介绍了将具有混合数据和类别的 pandas DataFrame存储到hdf5中的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想将具有不同列的dataFrame存储到hdf5文件中(找到以下数据类型的摘录).

I want to store a dataFrame with different columns into an hdf5 file (find an excerpt with data types below).

In  [1]: mydf
Out [1]:
endTime             uint32
distance           float16
signature         category
anchorName        category
stationList         object

在转换某些列(在上面的摘录中，签名和anchorName)之前，我使用了如下代码来存储它(效果很好):

Before converting some columns (signature and anchorName in my excerpt above), I used code like following to store it (which works pretty fine):

path = 'tmp4.hdf5'
key = 'journeys'
mydf.to_hdf(path, key, mode='w', complevel=9, complib='bzip2')

但是它不适用于类别，然后我尝试了以下操作:

But it does not work with category and then I tried following:

path = 'tmp4.hdf5'
key = 'journeys'
mydf.to_hdf(path, key, mode='w', format='t', complevel=9, complib='bzip2')

如果删除列stationList，它的工作原理很好，其中每个条目都是一个字符串列表.但是在本专栏文章中，我得到了以下例外:

It works fine, if I remove the column stationList, where each entry is a list of strings. But with this column I got the following exception:

Cannot serialize the column [stationList] because
its data contents are [mixed] object dtype

我需要如何改进代码以存储数据帧?

How do I need to improve my code to get the data frame stored?

熊猫版本:0.17.1
python版本:2.7.6(由于兼容性原因无法更改)

pandas version: 0.17.1
python version: 2.7.6 (cannot change it due to compability reasons)

edit1(一些示例代码):

edit1 (some sample code):

import pandas as pd

mydf = pd.DataFrame({'endTime' : pd.Series([1443525810,1443540836,1443609470]),
                    'distance' : pd.Series([454.75,477.25,242.12]),
                    'signature' : pd.Series(['ab','cd','ab']),
                    'anchorName' : pd.Series(['tec','ing','pol']),
                    'stationList' : pd.Series([['t1','t2','t3'],['4','t2','t3'],['t3','t2','t4']])
                    })

# this works fine (no category)
mydf.to_hdf('tmp_without_cat.hdf5', 'journeys', mode='w', complevel=9, complib='bzip2')

for col in ['anchorName', 'signature']:
    mydf[col] = mydf[col].astype('category')

# this crashes now because of category data
# mydf.to_hdf('tmp_with_cat.hdf5', 'journeys', mode='w', complevel=9, complib='bzip2')

# switching to format='t'   
# this caused problems because of "mixed data" in column stationList
mydf.to_hdf('tmp_with_cat.hdf5', 'journeys', mode='w', format='t', complevel=9, complib='bzip2')

mydf.pop('stationList')

# this again works fine
mydf.to_hdf('tmp_with_cat_without_stationList.hdf5', 'journeys', mode='w', format='t', complevel=9, complib='bzip2')

edit2:同时，我尝试了不同的方法来解决这个问题.其中之一是将stationList列的条目转换为tupel(可能因为它们不会更改)，并且还将其转换为category.但这并没有改变任何东西.这是我在转换循环后添加的行(仅出于完整性考虑):

edit2:In the meanwhile I tried different things to get rid of this problem. One of these were to convert the entries of column stationList to tupels (possible since they shall not be changed) and to also convert it to category. But it did not change anything.Here are the lines I added after the conversion loop (just for completeness):

mydf.stationList = [tuple(x) for x in mydf.stationList.values]
mydf.stationList.astype('category')

hdf5

将具有混合数据和类别的 pandas DataFrame存储到hdf5中

问题描述

推荐答案