问题描述
我有以下的 Pandas DataFrame 表格
每年每个ID(2008年-2015年)有一行.对于Max Temp
,Min Temp
和Rain
列,每个单元格包含对应于该年中一天的值的数组,即
There is one row per ID per year (2008 - 2015). For the columns Max Temp
, Min Temp
, and Rain
each cell contains an array of values corresponding to a day in that year, i.e. for the frame above
-
frame3.iloc[0]['Max Temp'][0]
是2011年1月1日的值 -
frame3.iloc[0]['Max Temp'][364]
是2011年12月31日的值.
frame3.iloc[0]['Max Temp'][0]
is the value for January 1st 2011frame3.iloc[0]['Max Temp'][364]
is the value for December 31st 2011.
我知道这结构不好,但这是我必须处理的数据.它以这种方式存储在MongoDB中(这些行之一等同于Mongo中的文档).
I'm aware this is badly structured, but this is the data I have to deal with. It is stored in MongoDB in this way (where one of these rows equates to a document in Mongo).
我想拆分这些嵌套的数组,这样我每天每个ID而不是每个ID一行,而不是每年每个ID一行.但是,在拆分数组时,我还要基于当前数组索引创建一个新列以捕获一年中的某天.然后,我将使用这一天以及Year
列来创建 DatetimeIndex
I want to split these nested arrays, so that instead of one row per ID per year, I have one row per ID per day. While splitting the array, however, I would also like to create a new column to capture the day of the year, based on the current array index. I would then use this day, plus the Year
column to create a DatetimeIndex
我在这里搜索了相关的答案,但只找到了并没有真正帮助我.
I searched here for relevant answers, but only found this one which doesn't really help me.
推荐答案
您可以为每个列运行.apply(pd.Series)
,然后stack
并连接结果.
You can run .apply(pd.Series)
for each of your columns, then stack
and concatenate the results.
对于系列
s = pd.Series([[0, 1], [2, 3, 4]], index=[2011, 2012])
s
Out[103]:
2011 [0, 1]
2012 [2, 3, 4]
dtype: object
它的工作原理如下
s.apply(pd.Series).stack()
Out[104]:
2011 0 0.0
1 1.0
2012 0 2.0
1 3.0
2 4.0
dtype: float64
该系列的元素长度不同(这很重要,因为2012年是a年).中间序列,即stack
之前的序列,其NaN
值后来被删除.
The elements of the series have different length (it matters because 2012 was a leap year). The intermediate series, i.e. before stack
, had a NaN
value that has been later dropped.
现在,让我们来做一个框架:
Now, let's take a frame:
a = list(range(14))
b = list(range(20, 34))
df = pd.DataFrame({'ID': [11111, 11111, 11112, 11112],
'Year': [2011, 2012, 2011, 2012],
'A': [a[:3], a[3:7], a[7:10], a[10:14]],
'B': [b[:3], b[3:7], b[7:10], b[10:14]]})
df
Out[108]:
A B ID Year
0 [0, 1, 2] [20, 21, 22] 11111 2011
1 [3, 4, 5, 6] [23, 24, 25, 26] 11111 2012
2 [7, 8, 9] [27, 28, 29] 11112 2011
3 [10, 11, 12, 13] [30, 31, 32, 33] 11112 2012
然后我们可以运行:
# set an index (each column will inherit it)
df2 = df.set_index(['ID', 'Year'])
# the trick
unnested_lst = []
for col in df2.columns:
unnested_lst.append(df2[col].apply(pd.Series).stack())
result = pd.concat(unnested_lst, axis=1, keys=df2.columns)
并获得:
result
Out[115]:
A B
ID Year
11111 2011 0 0.0 20.0
1 1.0 21.0
2 2.0 22.0
2012 0 3.0 23.0
1 4.0 24.0
2 5.0 25.0
3 6.0 26.0
11112 2011 0 7.0 27.0
1 8.0 28.0
2 9.0 29.0
2012 0 10.0 30.0
1 11.0 31.0
2 12.0 32.0
3 13.0 33.0
其余的(日期时间索引)则不那么直接.例如:
The rest (datetime index) is more less straightforward. For example:
# DatetimeIndex
years = pd.to_datetime(result.index.get_level_values(1).astype(str))
# TimedeltaIndex
days = pd.to_timedelta(result.index.get_level_values(2), unit='D')
# If the above line doesn't work (a bug in pandas), try this:
# days = result.index.get_level_values(2).astype('timedelta64[D]')
# the sum is again a DatetimeIndex
dates = years + days
dates.name = 'Date'
new_index = pd.MultiIndex.from_arrays([result.index.get_level_values(0), dates])
result.index = new_index
result
Out[130]:
A B
ID Date
11111 2011-01-01 0.0 20.0
2011-01-02 1.0 21.0
2011-01-03 2.0 22.0
2012-01-01 3.0 23.0
2012-01-02 4.0 24.0
2012-01-03 5.0 25.0
2012-01-04 6.0 26.0
11112 2011-01-01 7.0 27.0
2011-01-02 8.0 28.0
2011-01-03 9.0 29.0
2012-01-01 10.0 30.0
2012-01-02 11.0 31.0
2012-01-03 12.0 32.0
2012-01-04 13.0 33.0
这篇关于从Pandas Dataframe单元格中将嵌套数组值拆分为多行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!