问题描述
我正在使用循环神经网络来消耗时间序列事件(点击流).我的数据需要格式化,以便每一行都包含ID的所有事件.我的数据是一键编码的,并且已经按ID对其进行了分组.另外,我限制了每个ID的事件总数(例如2),因此最终宽度始终是已知的(#one-hot cols x #events).我需要维护事件的顺序,因为它们是按时间排序的.
I am using a recurrent neural network to consume time-series events (click stream). My data needs to be formatted such that a each row contains all the events for an id. My data is one-hot encoded, and I have already grouped it by the id. Also I limit the total number of events per id (ex. 2), so final width will always be known (#one-hot cols x #events). I need to maintain the order of the events, because they are ordered by time.
当前数据状态:
id page.A page.B page.C
0 001 0 1 0
1 001 1 0 0
2 002 0 0 1
3 002 1 0 0
所需的数据状态:
id page.A1 page.B1 page.C1 page.A2 page.B2 page.C2
0 001 0 1 0 1 0 0
1 002 0 0 1 1 0 1
对我来说这似乎是一个pivot
问题,但是我得到的数据帧不是我所需的格式.关于我应该如何处理此问题的任何建议?
This looks like a pivot
problem to me, but my resulting dataframes are not in the format I need. Any suggestions on how I should approach this?
推荐答案
这里的想法是在每个'id'
组中的reset_index
来计数我们在该特定'id'
中的哪一行.然后用unstack
和sort_index
进行后续操作,以获取应该位于的列.
The idea here is to reset_index
within each group of 'id'
to get a count which row of that particular 'id'
we are at. Then follow that up with unstack
and sort_index
to get columns where they are supposed to be.
最后,将多索引展平.
df1 = df.set_index('id').groupby(level=0) \
.apply(lambda df: df.reset_index(drop=True)) \
.unstack().sort_index(axis=1, level=1) # Thx @jezrael for sort reminder
df1.columns = ['{}{}'.format(x[0], int(x[1]) + 1) for x in df1.columns]
df1
这篇关于用 pandas 将多个时间序列行合并为一行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!