我试图在具有DateTimeIndex的DataFrame上使用pandas的groupby函数构建数据组。我想使用pd.TimeGrouper,按天分组。
当我定义此DataFrame时,以下操作n.groupby(pd.TimeGrouper("d"))
不起作用。
n = pd.DataFrame(
{"value": [5462,5462,3185]},
index=[pd.to_datetime("2013-10-13 19:03:54"),
pd.to_datetime("2013-10-12 19:03:54"),
pd.to_datetime("2013-10-11 13:19:23")])
错误:
n.groupby(pd.TimeGrouper("d"))
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-248-120eaa65b064> in <module>()
----> 1 n.groupby(pd.TimeGrouper("d"))
\lib\site-packages\pandas\core\generic.pyc in groupby(self, by, axis, level, as_index, sort, group_keys, squeeze)
184 return groupby(self, by, axis=axis, level=level, as_index=as_index,
185 sort=sort, group_keys=group_keys,
--> 186 squeeze=squeeze)
187
188 def asfreq(self, freq, method=None, how=None, normalize=False):
\lib\site-packages\pandas\core\groupby.pyc in groupby(obj, by, **kwds)
531 raise TypeError('invalid type: %s' % type(obj))
532
--> 533 return klass(obj, by, **kwds)
534
535
\lib\site-packages\pandas\core\groupby.pyc in __init__(self, obj, keys, axis, level, grouper, exclusions, selection, as_index, sort, group_keys, squeeze)
195 if grouper is None:
196 grouper, exclusions = _get_grouper(obj, keys, axis=axis,
--> 197 level=level, sort=sort)
198
199 self.grouper = grouper
\lib\site-packages\pandas\core\groupby.pyc in _get_grouper(obj, key, axis, level, sort)
1268
1269 if isinstance(key, CustomGrouper):
-> 1270 gpr = key.get_grouper(obj)
1271 return gpr, []
1272 elif isinstance(key, Grouper):
\lib\site-packages\pandas\tseries\resample.pyc in get_grouper(self, obj)
106 def get_grouper(self, obj):
107 # Only return grouper
--> 108 return self._get_time_grouper(obj)[1]
109
110 def _get_time_grouper(self, obj):
\lib\site-packages\pandas\tseries\resample.pyc in _get_time_grouper(self, obj)
112
113 if self.kind is None or self.kind == 'timestamp':
--> 114 binner, bins, binlabels = self._get_time_bins(axis)
115 else:
116 binner, bins, binlabels = self._get_time_period_bins(axis)
\lib\site-packages\pandas\tseries\resample.pyc in _get_time_bins(self, axis)
146
147 # general version, knowing nothing about relative frequencies
--> 148 bins = lib.generate_bins_dt64(ax_values, bin_edges, self.closed)
149
150 if self.closed == 'right':
\lib\site-packages\pandas\lib.pyd in pandas.lib.generate_bins_dt64 (pandas\lib.c:16139)()
ValueError: Invalid length for values or for binner
令人惊讶的是,当我如下定义DataFrame时,它工作得很好。请注意,我将最后一天更改为2013-10-12而不是2013-10-11。
n = pd.DataFrame(
{"value": [5462,5462,3185]},
index=[pd.to_datetime("2013-10-13 19:03:54"),
pd.to_datetime("2013-10-13 19:03:54"),
pd.to_datetime("2013-10-12 13:19:23")])
在这种情况下,我得到了正确的组对象:
n.groupby(pd.TimeGrouper("d"))
<pandas.core.groupby.DataFrameGroupBy object at 0x000000000A3D84E0>
我已经在源代码中查询了pandas的一些核心功能,但是我不确定这是一个bug还是我只是不知道如何正确使用该功能。
还请注意,按月进行聚合就可以了。
谢谢您的帮助。
最佳答案
这是一个错误,因为索引不是单调排序的,请参见here。但没有理由使用TimeGrouper
,这是内部ATM,请使用resample
。
In [3]: df
Out[3]:
value
2013-10-13 19:03:54 5462
2013-10-12 19:03:54 5462
2013-10-11 13:19:23 3185
In [4]: df.resample('d')
Out[4]:
value
2013-10-11 3185
2013-10-12 5462
2013-10-13 5462