我有DatetimeIndex个对象的集合,例如

DatetimeIndex(['2007-11-01 00:00:00', '2008-01-01 00:00:00',
               '2008-02-01 00:00:00', '2008-03-01 00:00:00',
               '2008-04-01 00:00:00', '2012-09-01 00:10:00',
               '2012-09-01 00:20:00', '2012-09-01 00:30:00',
               '2012-09-01 00:40:00', '2012-09-01 00:50:00',
               ...
               '2012-09-30 22:40:00', '2012-09-30 22:50:00',
               '2012-09-30 23:00:00', '2012-09-30 23:10:00',
               '2012-09-30 23:20:00', '2012-09-30 23:30:00',
               '2012-09-30 23:40:00', '2012-09-30 23:50:00',
               '2012-10-01 00:00:00', '2015-07-01 00:00:00'],
              dtype='datetime64[ns]', length=4326, freq=None, tz=None)


我认为它的freqinferred_freq都是None,因为尽管数据实际上有10分钟的时间,但由于缺少部分而无法检测到。我想尽可能高效地提取这些缺失的部分,或者等效地,可用的部分。即,我想得到一些诸如以下范围列表:

[('2007-11-01 00:00:00', '2007-11-01 00:00:00'),
 ('2008-01-01 00:00:00', '2008-01-01 00:00:00'),
 ('2008-02-01 00:00:00', '2008-02-01 00:00:00'),
 ('2008-03-01 00:00:00', '2008-03-01 00:00:00'),
 ('2008-04-01 00:00:00', '2008-04-01 00:00:00'),
 ('2012-09-01 00:10:00', '2012-10-01 00:00:00'),
 ('2015-07-01 00:00:00', '2015-07-01 00:00:00')]


我应该怎么做呢?我看过PeriodIndex,但这似乎是针对不同类型的应用程序,或者也许还没有处理任意时间间隔。

最佳答案

我认为您可以按系列groupby使用grouper并汇总minmax

通过将grouperdifference10 minute进行比较来创建cumsum

rng = pd.DatetimeIndex(['2007-11-01 00:00:00', '2008-01-01 00:00:00',
               '2008-02-01 00:00:00', '2008-03-01 00:00:00',
               '2008-04-01 00:00:00', '2012-09-01 00:10:00',
               '2012-09-01 00:20:00', '2012-09-01 00:30:00',
               '2012-09-01 00:40:00', '2012-09-01 00:50:00',
               '2012-09-30 22:40:00', '2012-09-30 22:50:00',
               '2012-09-30 23:00:00', '2012-09-30 23:10:00',
               '2012-09-30 23:20:00', '2012-09-30 23:30:00',
               '2012-09-30 23:40:00', '2012-09-30 23:50:00',
               '2012-10-01 00:00:00', '2015-07-01 00:00:00'])

s = pd.Series(rng)
grouper = s.diff().ne(pd.to_timedelta('10min')).cumsum()
print (grouper)
0     1
1     2
2     3
3     4
4     5
5     6
6     6
7     6
8     6
9     6
10    7
11    7
12    8
13    8
14    8
15    8
16    8
17    8
18    8
19    9
dtype: int32




print (s.groupby(grouper).agg(['min', 'max']).astype(str).apply(tuple, axis=1).tolist())
[('2007-11-01 00:00:00', '2007-11-01 00:00:00'),
 ('2008-01-01 00:00:00', '2008-01-01 00:00:00'),
 ('2008-02-01 00:00:00', '2008-02-01 00:00:00'),
 ('2008-03-01 00:00:00', '2008-03-01 00:00:00'),
 ('2008-04-01 00:00:00', '2008-04-01 00:00:00'),
 ('2012-09-01 00:10:00', '2012-09-01 00:50:00'),
 ('2015-09-30 22:40:00', '2015-09-30 22:50:00'),
 ('2012-09-30 23:00:00', '2012-10-01 00:00:00'),
 ('2015-07-01 00:00:00', '2015-07-01 00:00:00')]

08-20 04:16