我试图在具有DateTimeIndex的DataFrame上使用pandas的groupby函数构建数据组。我想使用pd.TimeGrouper,按天分组。

当我定义此DataFrame时,以下操作n.groupby(pd.TimeGrouper("d"))不起作用。

n = pd.DataFrame(
    {"value": [5462,5462,3185]},
    index=[pd.to_datetime("2013-10-13 19:03:54"),
           pd.to_datetime("2013-10-12 19:03:54"),
           pd.to_datetime("2013-10-11 13:19:23")])


错误:

n.groupby(pd.TimeGrouper("d"))
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-248-120eaa65b064> in <module>()
----> 1 n.groupby(pd.TimeGrouper("d"))

\lib\site-packages\pandas\core\generic.pyc in groupby(self, by, axis, level, as_index, sort, group_keys, squeeze)
    184         return groupby(self, by, axis=axis, level=level, as_index=as_index,
    185                        sort=sort, group_keys=group_keys,
--> 186                        squeeze=squeeze)
    187
    188     def asfreq(self, freq, method=None, how=None, normalize=False):

\lib\site-packages\pandas\core\groupby.pyc in groupby(obj, by, **kwds)
    531         raise TypeError('invalid type: %s' % type(obj))
    532
--> 533     return klass(obj, by, **kwds)
    534
    535

\lib\site-packages\pandas\core\groupby.pyc in __init__(self, obj, keys, axis, level, grouper, exclusions, selection, as_index, sort, group_keys, squeeze)
    195         if grouper is None:
    196             grouper, exclusions = _get_grouper(obj, keys, axis=axis,
--> 197                                                level=level, sort=sort)
    198
    199         self.grouper = grouper

\lib\site-packages\pandas\core\groupby.pyc in _get_grouper(obj, key, axis, level, sort)
   1268
   1269     if isinstance(key, CustomGrouper):
-> 1270         gpr = key.get_grouper(obj)
   1271         return gpr, []
   1272     elif isinstance(key, Grouper):

\lib\site-packages\pandas\tseries\resample.pyc in get_grouper(self, obj)
    106     def get_grouper(self, obj):
    107         # Only return grouper
--> 108         return self._get_time_grouper(obj)[1]
    109
    110     def _get_time_grouper(self, obj):

\lib\site-packages\pandas\tseries\resample.pyc in _get_time_grouper(self, obj)
    112
    113         if self.kind is None or self.kind == 'timestamp':
--> 114             binner, bins, binlabels = self._get_time_bins(axis)
    115         else:
    116             binner, bins, binlabels = self._get_time_period_bins(axis)

\lib\site-packages\pandas\tseries\resample.pyc in _get_time_bins(self, axis)
    146
    147         # general version, knowing nothing about relative frequencies
--> 148         bins = lib.generate_bins_dt64(ax_values, bin_edges, self.closed)
    149
    150         if self.closed == 'right':

\lib\site-packages\pandas\lib.pyd in pandas.lib.generate_bins_dt64 (pandas\lib.c:16139)()

ValueError: Invalid length for values or for binner


令人惊讶的是,当我如下定义DataFrame时,它工作得很好。请注意,我将最后一天更改为2013-10-12而不是2013-10-11。

n = pd.DataFrame(
    {"value": [5462,5462,3185]},
    index=[pd.to_datetime("2013-10-13 19:03:54"),
           pd.to_datetime("2013-10-13 19:03:54"),
           pd.to_datetime("2013-10-12 13:19:23")])


在这种情况下,我得到了正确的组对象:

n.groupby(pd.TimeGrouper("d"))
<pandas.core.groupby.DataFrameGroupBy object at 0x000000000A3D84E0>


我已经在源代码中查询了pandas的一些核心功能,但是我不确定这是一个bug还是我只是不知道如何正确使用该功能。

还请注意,按月进行聚合就可以了。

谢谢您的帮助。

最佳答案

这是一个错误,因为索引不是单调排序的,请参见here。但没有理由使用TimeGrouper,这是内部ATM,请使用resample

In [3]: df
Out[3]:
                     value
2013-10-13 19:03:54   5462
2013-10-12 19:03:54   5462
2013-10-11 13:19:23   3185

In [4]: df.resample('d')
Out[4]:
            value
2013-10-11   3185
2013-10-12   5462
2013-10-13   5462

07-27 19:31