问题描述
说我有一个包含3列的数据框:日期,股票行情,值(无索引,至少从此开始).我有很多约会和股票行情指示器,但是每个(ticker, date)
元组都是唯一的. (但是很显然,同一日期将显示在多行中,因为它将在多个行情中出现,而同一行情显示将在多行中显示,因为它将存在于许多日期中.)
Say I have a dataframe with 3 columns: Date, Ticker, Value (no index, at least to start with). I have many dates and many tickers, but each (ticker, date)
tuple is unique. (But obviously the same date will show up in many rows since it will be there for multiple tickers, and the same ticker will show up in multiple rows since it will be there for many dates.)
最初,我的行按特定顺序排列,但未按任何列进行排序.
Initially, my rows in a specific order, but not sorted by any of the columns.
我想计算每个股票的第一笔差价(每日变化)(按日期排序),并将它们放在数据框中的新列中.在这种情况下,我不能简单地做
I would like to compute first differences (daily changes) of each ticker (ordered by date) and put these in a new column in my dataframe. Given this context, I cannot simply do
df['diffs'] = df['value'].diff()
因为相邻的行不是来自同一行情自动收录器.像这样排序:
because adjacent rows do not come from the same ticker. Sorting like this:
df = df.sort(['ticker', 'date'])
df['diffs'] = df['value'].diff()
不能解决问题,因为会有障碍". IE.经过这种排序后,一个股票代码的最后一个值将高于下一个股票代码的第一个值.然后,计算差异将需要两个代码之间的差异.我不要这个我希望每个股票的最早日期在其差异列中以NaN
结尾.
doesn't solve the problem because there will be "borders". I.e. after that sort, the last value for one ticker will be above the first value for the next ticker. And computing differences then would take a difference between two tickers. I don't want this. I want the earliest date for each ticker to wind up with an NaN
in its diff column.
这似乎是使用groupby
的明显时机,但是由于某种原因,我似乎无法使其正常工作.需要明确的是,我想执行以下过程:
This seems like an obvious time to use groupby
but for whatever reason, I can't seem to get it to work properly. To be clear, I would like to perform the following process:
- 根据行的
ticker
对行进行分组 - 在每个组中,按其
date
对行进行排序 - 在每个已排序的组中,计算
value
列的差异 - 将这些差异放入新的
diffs
列中的原始数据帧中(最好保持原始数据帧顺序不变).
- Group rows based on their
ticker
- Within each group, sort rows by their
date
- Within each sorted group, compute differences of the
value
column - Put these differences into the original dataframe in a new
diffs
column (ideally leaving the original dataframe order in tact.)
我必须想象这是单线的.但是我想念什么?
I have to imagine this is a one-liner. But what am I missing?
在2013-12-17晚上9:00编辑
Edit at 9:00pm 2013-12-17
好的...有些进展.我可以执行以下操作以获取新的数据框:
Ok...some progress. I can do the following to get a new dataframe:
result = df.set_index(['ticker', 'date'])\
.groupby(level='ticker')\
.transform(lambda x: x.sort_index().diff())\
.reset_index()
但是,如果我了解groupby的机制,那么我的行现在将首先按ticker
排序,然后按date
排序.那是对的吗?如果是这样,我是否需要进行合并以将差异列(当前在result['current']
中附加到原始数据帧df
中?
But if I understand the mechanics of groupby, my rows will now be sorted first by ticker
and then by date
. Is that correct? If so, would I need to do a merge to append the differences column (currently in result['current']
to the original dataframe df
?
推荐答案
做自己描述的事情不会更容易
wouldn't be just easier to do what yourself describe, namely
df.sort(['ticker', 'date'], inplace=True)
df['diffs'] = df['value'].diff()
然后更正边框:
mask = df.ticker != df.ticker.shift(1)
df['diffs'][mask] = np.nan
要保持原始索引,您可以在开始时执行idx = df.index
,然后在结束时执行df.reindex(idx)
,或者如果它是一个巨大的数据帧,请在
to maintain the original index you may do idx = df.index
in the beginning, and then at the end you can do df.reindex(idx)
, or if it is a huge dataframe, perform the operations on
df.filter(['ticker', 'date', 'value'])
,然后是join
最后两个数据帧.
and then join
the two dataframes at the end.
编辑:或者,(尽管仍然不使用groupby
)
edit: alternatively, ( though still not using groupby
)
df.set_index(['ticker','date'], inplace=True)
df.sort_index(inplace=True)
df['diffs'] = np.nan
for idx in df.index.levels[0]:
df.diffs[idx] = df.value[idx].diff()
为
date ticker value
0 63 C 1.65
1 88 C -1.93
2 22 C -1.29
3 76 A -0.79
4 72 B -1.24
5 34 A -0.23
6 92 B 2.43
7 22 A 0.55
8 32 A -2.50
9 59 B -1.01
这将产生:
value diffs
ticker date
A 22 0.55 NaN
32 -2.50 -3.05
34 -0.23 2.27
76 -0.79 -0.56
B 59 -1.01 NaN
72 -1.24 -0.23
92 2.43 3.67
C 22 -1.29 NaN
63 1.65 2.94
88 -1.93 -3.58
这篇关于计算数据帧组内的差异的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!