问题描述
我有一些分层数据,这些数据从底到时间序列数据,如下所示:
I have some hierarchical data which bottoms out into time series data which looks something like this:
df = pandas.DataFrame(
{'value_a': values_a, 'value_b': values_b},
index=[states, cities, dates])
df.index.names = ['State', 'City', 'Date']
df
value_a value_b
State City Date
Georgia Atlanta 2012-01-01 0 10
2012-01-02 1 11
2012-01-03 2 12
2012-01-04 3 13
Savanna 2012-01-01 4 14
2012-01-02 5 15
2012-01-03 6 16
2012-01-04 7 17
Alabama Mobile 2012-01-01 8 18
2012-01-02 9 19
2012-01-03 10 20
2012-01-04 11 21
Montgomery 2012-01-01 12 22
2012-01-02 13 23
2012-01-03 14 24
2012-01-04 15 25
我想对每个城市进行时间重采样,所以类似
I'd like to perform time resampling per city, so something like
df.resample("2D", how="sum")
将输出
value_a value_b
State City Date
Georgia Atlanta 2012-01-01 1 21
2012-01-03 5 25
Savanna 2012-01-01 9 29
2012-01-03 13 33
Alabama Mobile 2012-01-01 17 37
2012-01-03 21 41
Montgomery 2012-01-01 25 45
2012-01-03 29 49
按原样,df.resample('2D', how='sum')
让我
TypeError: Only valid with DatetimeIndex or PeriodIndex
足够公平,但是我有点希望它能起作用:
Fair enough, but I'd sort of expect this to work:
>>> df.swaplevel('Date', 'State').resample('2D', how='sum')
TypeError: Only valid with DatetimeIndex or PeriodIndex
在那一点上我真的没什么主意了...堆栈和堆栈可能有什么方法可以帮助我吗?
at which point I'm really running out of ideas... is there some way stack and unstack might be able to help me?
推荐答案
pd.Grouper
允许您指定目标对象的groupby指令".在特别是,即使df.index
不是DatetimeIndex
,您也可以使用它按日期分组:
pd.Grouper
allows you to specify a "groupby instruction for a target object". Inparticular, you can use it to group by dates even if df.index
is not a DatetimeIndex
:
df.groupby(pd.Grouper(freq='2D', level=-1))
level=-1
告诉pd.Grouper
在MultiIndex的最后一级中查找日期.此外,您可以将其与索引中的其他级别值结合使用:
The level=-1
tells pd.Grouper
to look for the dates in the last level of the MultiIndex.Moreover, you can use this in conjunction with other level values from the index:
level_values = df.index.get_level_values
result = (df.groupby([level_values(i) for i in [0,1]]
+[pd.Grouper(freq='2D', level=-1)]).sum())
看起来有点尴尬,但是using_Grouper
却比我原来的要快得多建议,using_reset_index
:
It looks a bit awkward, but using_Grouper
turns out to be much faster than my originalsuggestion, using_reset_index
:
import numpy as np
import pandas as pd
import datetime as DT
def using_Grouper(df):
level_values = df.index.get_level_values
return (df.groupby([level_values(i) for i in [0,1]]
+[pd.Grouper(freq='2D', level=-1)]).sum())
def using_reset_index(df):
df = df.reset_index(level=[0, 1])
return df.groupby(['State','City']).resample('2D').sum()
def using_stack(df):
# http://stackoverflow.com/a/15813787/190597
return (df.unstack(level=[0,1])
.resample('2D').sum()
.stack(level=[2,1])
.swaplevel(2,0))
def make_orig():
values_a = range(16)
values_b = range(10, 26)
states = ['Georgia']*8 + ['Alabama']*8
cities = ['Atlanta']*4 + ['Savanna']*4 + ['Mobile']*4 + ['Montgomery']*4
dates = pd.DatetimeIndex([DT.date(2012,1,1)+DT.timedelta(days = i) for i in range(4)]*4)
df = pd.DataFrame(
{'value_a': values_a, 'value_b': values_b},
index = [states, cities, dates])
df.index.names = ['State', 'City', 'Date']
return df
def make_df(N):
dates = pd.date_range('2000-1-1', periods=N)
states = np.arange(50)
cities = np.arange(10)
index = pd.MultiIndex.from_product([states, cities, dates],
names=['State', 'City', 'Date'])
df = pd.DataFrame(np.random.randint(10, size=(len(index),2)), index=index,
columns=['value_a', 'value_b'])
return df
df = make_orig()
print(using_Grouper(df))
收益
value_a value_b
State City Date
Alabama Mobile 2012-01-01 17 37
2012-01-03 21 41
Montgomery 2012-01-01 25 45
2012-01-03 29 49
Georgia Atlanta 2012-01-01 1 21
2012-01-03 5 25
Savanna 2012-01-01 9 29
2012-01-03 13 33
这是在5000行DataFrame上比较using_Grouper
,using_reset_index
,using_stack
的基准:
Here is a benchmark comparing using_Grouper
, using_reset_index
, using_stack
on a 5000-row DataFrame:
In [30]: df = make_df(10)
In [34]: len(df)
Out[34]: 5000
In [32]: %timeit using_Grouper(df)
100 loops, best of 3: 6.03 ms per loop
In [33]: %timeit using_stack(df)
10 loops, best of 3: 22.3 ms per loop
In [31]: %timeit using_reset_index(df)
1 loop, best of 3: 659 ms per loop
这篇关于在Pandas MultiIndex中重新采样的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!