python - 在数据帧中有效地获取可变长度的时间片

我想用DatetimeIndex有效地切片DataFrame（类似于重新采样或groupby操作），但是所需的时间片长度不同。

通过循环比较容易做到这一点（请参见下面的代码），但是在较大的时间序列中，多个切片很快变慢。关于矢量化这个/提高速度的任何建议吗？

import pandas as pd, datetime as dt, numpy as np

#Example DataFrame with a DatetimeIndex
idx = pd.DatetimeIndex(start=dt.datetime(2017,1,1), end=dt.datetime(2017,1,31), freq='h')
df = pd.Series(index = idx, data = np.random.rand(len(idx)))

#The slicer dataframe contains a series of start and end windows
slicer_df = pd.DataFrame(index = [1,2])
slicer_df['start_window'] = [dt.datetime(2017,1,2,2), dt.datetime(2017,1,6,12)]
slicer_df['end_window'] = [dt.datetime(2017,1,6,12), dt.datetime(2017,1,15,2)]

#The results should be stored to a dataframe, indexed by the index of the slicer dataframe
#This is the loop that I would like to vectorise
slice_results = pd.DataFrame()
slice_results['total'] = None
for index, row in slicer_df.iterrows():
    slice_results.loc[index,'total'] = df[(df.index >= row.start_window) &
                                          (df.index <= row.end_window)].sum()

注意我刚刚意识到我的特定数据集具有相邻的窗口（即，一个窗口的开始与之前的窗口的结束相对应），但是窗口的长度不同。感觉应该有一种方法可以通过df仅一次通过来执行groupby或类似操作。

最佳答案

您可以使用searchsorted将其向量化（假设datetime索引已排序，否则为第一排序）：

In [11]: inds = np.searchsorted(df.index.values, slicer_df.values)

In [12]: s = df.cumsum()  # only sum once!

In [13]: pd.Series([s[end] - s[start-1] if start else s[end] for start, end in inds], slicer_df.index)
Out[13]:
1     36.381155
2    111.521803
dtype: float64

那里仍然有一个循环，但是现在便宜很多！

这导致我们得到一个完全矢量化的解决方案（有点神秘）：

In [21]: inds2 = np.maximum(1, inds)  # see note

In [22]: inds2[:, 0] -= 1

In [23]: inds2
Out[23]:
array([[ 23,  96],
       [119, 336]])

In [24]: x = s[inds2]

In [25]: x
Out[25]:
array([[  11.4596498 ,   47.84080472],
       [  55.94941276,  167.47121538]])

In [26]: x[:, 1] - x[:, 0]
Out[26]: array([  36.38115493,  111.52180263])

注意：当开始日期早于第一个日期时，我们要避免开始索引从0滚动回-1（这意味着数组的结束即下溢）。

关于python - 在数据帧中有效地获取可变长度的时间片，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/46902567/