python - 处理从相关数据库导出的那些数据框objs的丢失数据时的更好解决方案

几天前，我发布了一个关于"how to make pandas HDFStore 'put' operation faster"的问题，感谢Jeff的回答，我找到了一种从db中提取数据并将其存储到hdf5文件中的更有效方法。

但是通过这种方式，我必须根据类型填充每个单列的缺失数据，并在每个表上执行这些操作（在大多数情况下，这项工作是重复的）。否则，当我将数据帧放入hdf5文件时，数据帧中的None对象将导致性能问题。

有没有更好的方法来完成这项工作？

我刚刚读了this issue "ENH: sql to provided NaN/NaT conversions"

NaT是否可以与其他类型一起使用？（datetime64除外）
将数据帧存储到hdf5文件时，是否可以使用它替换数据帧中的所有None对象而不必担心性能问题？

更新1

pd.version：0.10.1
我正在使用np.nan来填充丢失的数据。但是我遇到了两个问题。

同时具有np.nan和datetime.datetime objs的列不能转换为“ datetime64 [ns]”类型，并且将它们放入hdfstore时会引发异常。

    在[155]中：len（df_bugs.lastdiffed [df_bugs.lastdiffed.isnull（）]）
    出[155]：150

    在[156]中：len（df_bugs.lastdiffed）
    出[156]：1003387

    在[158]中：df_bugs.lastdiffed.astype（df_bugs.creation_ts.dtype）

    -------------------------------------------------- -------------------------
    ValueError跟踪（最近一次通话）
     在（）中
    ----> 1 df_bugs.lastdiffed.astype（df_bugs.creation_ts.dtype）

    / type中的/usr/local/lib/python2.6/dist-packages/pandas-0.10.1-py2.6-linux-x86_64.egg/pandas/core/series.pyc（self，dtype）
        777请参阅numpy.ndarray.astype
        778“”“
    -> 779强制转换= com._astype_nansafe（self.values，dtype）
        780 return self._constructor（casted，index = self.index，name = self.name）
        781

    _astype_nansafe中的/usr/local/lib/python2.6/dist-packages/pandas-0.10.1-py2.6-linux-x86_64.egg/pandas/core/common.pyc(arr，dtype）
       第1047章真相大白（二更）
       1048＃解决NumPy破碎问题＃1987
    -> 1049返回lib.astype_intsafe（arr.ravel（），dtype）.reshape（arr.shape）
       1050
       1051返回arr.astype（dtype）

    pandas.lib.astype_intsafe中的/usr/local/lib/python2.6/dist-packages/pandas-0.10.1-py2.6-linux-x86_64.egg/pandas/lib.so（pandas / lib.c：11886 ）（）

    util.set_value_at中的/usr/local/lib/python2.6/dist-packages/pandas-0.10.1-py2.6-linux-x86_64.egg/pandas/lib.so（pandas / lib.c：44436）（）

    ValueError：必须是datetime.date或datetime.datetime对象

        ＃df_bugs_sample1 = df_bugs.ix [：10000]
    在[147]中：％prun store.put（'df_bugs_sample1'，df_bugs_sample1，table = True）

    将/usr/local/lib/python2.6/dist-packages/pandas-0.10.1-py2.6-linux-x86_64.egg/pandas/io/pytables.pyc放入（自己，键，值，表，附加，** kwargs）
        456桌
        457“”“
    -> 458 self._write_to_group（键，值，table = table，append = append，** kwargs）
        459
        460 def remove（self，key，where = None，start = None，stop = None）：

    _write_to_group中的/usr/local/lib/python2.6/dist-packages/pandas-0.10.1-py2.6-linux-x86_64.egg/pandas/io/pytables.pyc（自身，键，值，索引，表，append，complib，** kwargs）
        786提高ValueError（'非表不支持压缩'）
        787
    -> 788 s.write（obj = value，append = append，complib = complib，** kwargs）
        第789回
        790 s.create_index（columns = index）

    /usr/local/lib/python2.6/dist-packages/pandas-0.10.1-py2.6-linux-x86_64.egg/pandas/io/pytables.pyc以write（self，obj，axes，append，complib ，complevel，fletcher32，min_itemsize，chunksize，expectedrows，** kwargs）
       2489＃创建轴
       2490 self.create_axes（axes = axes，obj = obj，validate = append，
    -> 2491 min_itemsize = min_itemsize，** kwargs）
       2492
       2493如果不是self.is_exists：

    create_axes中的/usr/local/lib/python2.6/dist-packages/pandas-0.10.1-py2.6-linux-x86_64.egg/pandas/io/pytables.pyc(self、axes、obj、validate、nan_rep ，data_columns，min_itemsize，** kwargs）
       2252加薪
       2253除了（Exception），细节：
    -> 2254引发Exception（“找不到正确的原子类型-> [dtype->％s，items->％s]％s”％（b.dtype.name，b.items，str（detail）））
       2255 j + = 1
       2256

    例外：找不到正确的原子类型-> [dtype-> object，items-> Index（[bug_file_loc，bug_severity，bug_status，cf_branch，cf_bug_source，cf_eta，cf_public_severity，cf_public_summary，cf_regression，cf_reported_by，sys_op，sys_op，sys_type lastdiffed，优先级，rep_platform，分辨率，short_desc，status_whiteboard，target_milestone]，dtype = object）]类型“ datetime.datetime”的对象没有len（）

另一个df似乎无法完全放入数据帧中，如下例所示，条目数为13742515，但是在我将数据帧放入hdfstore中并取出后，条目数变为1041998.这很奇怪〜

    在[123]：df_bugs_activity中
    出[123]：

    Int64Index：13742515条目，0到13742514
    资料栏：
    添加了13111366非空值
    attach_id 1041998非空值
    bug_id 13742515非空值
    bug_when 13742515非空值
    fieldid 13742515非空值
    id 13742515非空值
    删除了13612258非空值
    谁13742515非空值
    dtypes：datetime64 [ns]（1），float64（1），int64（4），object（2）

    在[121]中：％time store.put（'df_bugs_activity2'，df_bugs_activity，table = True）

    CPU时间：用户35.31 s，sys：4.23 s，总计：39.54 s
    挂墙时间：39.65 s

    在[122]中：％time store.get（'df_bugs_activity2'）

    CPU时间：用户7.56 s，系统：0.26 s，总计：7.82 s
    壁挂时间：7.84 s
    出[122]：

    Int64Index：1041998个条目，2012年至13354656
    资料栏：
    添加1041981非空值
    attach_id 1041998非空值
    bug_id 1041998非空值
    bug_when 1041998非空值
    fieldid 1041998非空值
    id 1041998非空值
    删除了1041991个非空值
    1041998非空值
    dtypes：datetime64 [ns]（1），float64（1），int64（4），object（2）

更新2

用于创建数据框的代码：

    def抓取数据（表名，页面大小= 20000）：
        '''
        从数据库表中获取数据

        size_of_page：sql的limit子类的第二个参数
        '''
        cur.execute（'从％s'％table_name中选择count（*））
        records_number = cur.fetchone（）[0]
        循环数=记录数/页面大小+ 1
        打印'**** \ n开始抓取％s \ n **** \ nrecords_number：％s \ nloop_number：％s'％（表名，记录号，循环号）

        start_position = 0
        df = DataFrame（）＃警告：此数据框对象将包含表的所有记录，因此请小心使用内存！

        对于我在范围（0，loop_number）中：
            sql_export ='从％s限制％s，％s'中选择*％（table_name，start_position，size_of_page）
            df = df.append（psql.read_frame（sql_export，conn），verify_integrity = False，ignore_index = True）

            开始位置+ =页面大小
            打印'start_position：％s'％start_position

        返回df

    df_bugs = catch_data（'bugs'）
    df_bugs = df_bugs.fillna（np.nan）
    df_bugs = df_bugs.convert_objects（）

df_bugs的结构：

Int64Index：1003387个条目，0到1003386
资料栏：
别名0非空值
signed_to 1003387非空值
bug_file_loc 498160非空值
bug_id 1003387非空值
bug_severity 1003387非空值
bug_status 1003387非空值
category_id 1003387非空值
cclist_accessible 1003387非空值
cf_attempted 102160非空值
cf_branch 691834非空值
cf_bug_source 1003387非空值
cf_build 357920非空值
cf_change 324933非空值
cf_doc_impact 1003387非空值
cf_eta 7223非空值
cf_failed 102123非空值
cf_i18n_impact 1003387非空值
cf_on_hold 1003387非空值
cf_public_severity 1003387非空值
cf_public_summary 587944非空值
cf_regression 1003387非空值
cf_reported_by 1003387非空值
cf_reviewer 1003387非空值
cf_security 1003387非空值
cf_test_id 13475非空值
cf_type 1003387非空值
cf_viss 1423非空值
component_id 1003387非空值
creation_ts 1003387非空值
截止日期0非空值
delta_ts 1003387非空值
估计的时间1003387非空值
确认1003387非空值
found_in_phase_id 1003387非空值
found_in_product_id 1003387非空值
found_in_version_id 1003387非空值
guest_op_sys 1003387非空值
host_op_sys 1003387非空值
关键字1003387非空值
lastdiffed 1003237非空值
优先级1003387非空值
product_id 1003387非空值
qa_contact 1003387非空值
剩余时间1003387非空值
rep_platform 1003387非空值
报告者1003387非空值
report_accessible 1003387非空值
分辨率1003387非空值
short_desc 1003387非空值
status_whiteboard 1003387非空值
target_milestone 1003387非空值
投票1003387非空值
dtypes：datetime64 [ns]（2），float64（10），int64（19），对象（21）

更新3

写入csv并从csv中读取：

    在[184]中：df_bugs.to_csv（'df_bugs.sv'）
    在[185]中：df_bugs_from_scv = pd.read_csv（'df_bugs.sv'）
    在[186]中：df_bugs_from_scv
    出[186]：

    Int64Index：1003387个条目，0到1003386
    资料栏：
    未命名：0 1003387非空值
    别名0非空值
    signed_to 1003387非空值
    bug_file_loc 0非空值
    bug_id 1003387非空值
    bug_severity 1003387非空值
    bug_status 1003387非空值
    category_id 1003387非空值
    cclist_accessible 1003387非空值
    cf_attempted 102160非空值
    cf_branch 345133非空值
    cf_bug_source 1003387非空值
    cf_build 357920非空值
    cf_change 324933非空值
    cf_doc_impact 1003387非空值
    cf_eta 7223非空值
    cf_failed 102123非空值
    cf_i18n_impact 1003387非空值
    cf_on_hold 1003387非空值
    cf_public_severity 1003387非空值
    cf_public_summary 588非空值
    cf_regression 1003387非空值
    cf_reported_by 1003387非空值
    cf_reviewer 1003387非空值
    cf_security 1003387非空值
    cf_test_id 13475非空值
    cf_type 1003387非空值
    cf_viss 1423非空值
    component_id 1003387非空值
    creation_ts 1003387非空值
    截止日期0非空值
    delta_ts 1003387非空值
    估计的时间1003387非空值
    确认1003387非空值
    found_in_phase_id 1003387非空值
    found_in_product_id 1003387非空值
    found_in_version_id 1003387非空值
    guest_op_sys 805088非空值
    host_op_sys 806344非空值
    关键字532941非空值
    lastdiffed 1003237非空值
    优先级1003387非空值
    product_id 1003387非空值
    qa_contact 1003387非空值
    剩余时间1003387非空值
    rep_platform 424213非空值
    报告者1003387非空值
    report_accessible 1003387非空值
    分辨率922282非空值
    short_desc 1003287非空值
    status_whiteboard 0非空值
    target_milestone 423276非空值
    投票1003387非空值
    dtypes：float64（12），int64（20），对象（21）

最佳答案

我会回答自己，并感谢杰夫的帮助。

首先，更新1中的第二个问题（“似乎无法将df完全放入数据帧中”）是fixed。

而且，我遇到的最大问题是处理同时具有python的datetime obj和None obj的列。幸运的是，从0.11开始，熊猫提供了更多的convenient way。我在项目中使用了以下代码，并且为某些行添加了注释，希望它可以对其他人有所帮助:)

cur.execute('select * from table_name')
result = cur.fetchall()

# For details: http://www.python.org/dev/peps/pep-0249/#description
db_description = cur.description
columns = [col_desc[0] for col_desc in db_description]

# As the pandas' doc said, `coerce_float`: Attempt to convert values to non-string, non-numeric objects (like decimal.Decimal) to floating point
df = DataFrame(result, columns=columns, coerce_float=True)

# dealing the missing data
for column_name in df.columns:
    # Currently, calling function `fillna(np.nan) on a `datetime64[ns]` column will cause an exception
    if df[column_name].dtype.str != '<M8[ns]':
        df[column_name].fillna(np.nan)

# convert the type of columns which both have np.nan and datetime obj from 'object' to 'datetime64[ns]'(short as'<M8[ns]')
# find the table columns whose type is Date or Datetime
column_name_type_tuple = [column[:2] for column in db_description if column[1] in (10, 12)]
# check whose type is 'object'
columns_need_conv = [column_name for column_name, column_type in column_name_type_tuple if str(df[column_name].dtype) == 'object']

# do the type converting
for column_name in columns_need_conv:
    df[column_name] = Series(df[column_name].values, dtype='M8[ns]')

df = df.convert_objects()

在此之后，df应该适合存储在h5文件中，并且不再需要“拨动”。

ps：

一些资料：
complib：“ lzo”，complevel：1
表1，7,810,561条记录，其中2个整数列和1个日期时间列，推杆操作成本为49秒

table2，1,008,794记录，包含4个datetime cols，4 float64 cols，19 int cols，24 object（string）cols，放置操作花费170s