问题描述
我有一个如下所示的数据帧
I have a dataframe as given below
data_file= pd.DataFrame({'person_id':[1,1,1,1,2,2,2,3,3,3],'ob.date': [np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
'observation': ['Age','interviewdate','marital_status','interviewdate','Age','interviewdate','marital_status','Age','interviewdate','marital_status'],
'answer': [21,'21/08/2017','Single','22/05/2217', 26,'11/03/2010','Single',41,'31/09/2012','Married']
})
我想做的是,获取
,并将其放在 answer
列中的日期值 ob.date
列中。提供的数据帧显示 person_id = 1
在 2017/08/21
和<$ c $上回答了有关年龄的问题c> 22/05/2017 他回答了有关婚姻状况
What I would like to do is, fetch the date values
from answer
column and put it in ob.date
column. The dataframe provided shows that person_id =1
answered question about Age on 21/08/2017
and on 22/05/2017
he answered question about marital_status
的问题我尝试根据另一篇帖子的SO建议
This is what I tried based on SO suggestion from another post
s = data_file[(data_file.observation == 'interviewdate')].set_index('person_id')['answer']
data_file['ob.date'] = data_file['person_id'].map(s)
但这无法正常工作,因为出现重复索引错误
。我该如何避免该问题并使其足够有效?
But this does not work as I get duplicate index error
. How can I avoid that issue and make it efficient enough?
因此,任何简洁有效的解决方案都将有所帮助。 Person_id = 1具有两个日期值,因此请使用 answer
列(采访日期
观察)
So any elegant and efficient solution would be helpful. Person_id = 1 has two date values, so fill all rows above interviewdate
observation with the value from answer
column (of interviewdate
observation)
我怎么期望我的输出像这样?
How can I expect my output to be like this?
推荐答案
所有数据都取决于-第一组 answer
按条件创建新列,然后每个组通过前后填充来弥补缺失值:
All depends of data - first set new column by answer
by condition and then per groups repalce missing values by back and forward filling:
data_file['ob.date'] = data_file.loc[(data_file.observation == 'interviewdate'), 'answer']
data_file['ob.date'] = (data_file.groupby('person_id')['ob.date']
.apply(lambda x: x.bfill().ffill()))
print (data_file)
person_id ob.date observation answer
0 1 21/08/2017 Age 21
1 1 21/08/2017 interviewdate 21/08/2017
2 1 22/05/2217 marital_status Single
3 1 22/05/2217 interviewdate 22/05/2217
4 2 11/03/2010 Age 26
5 2 11/03/2010 interviewdate 11/03/2010
6 2 11/03/2010 marital_status Single
7 3 31/09/2012 Age 41
8 3 31/09/2012 interviewdate 31/09/2012
9 3 31/09/2012 marital_status Married
详细信息:
首先使用反向归档组,因为采访日期
是边缘行-之前的所有值都是相同的子组。最后是添加forwrd填充以代替每组最后的NaN-而不是用 bfill
代替:
First is used back fiiling per groups, because interviewdate
are edge rows - all values before are same subgroups. Last is add forwrd filling for repalce last NaNs per groups - not replaced by bfill
:
data_file['ob.date'] = (data_file.groupby('person_id')['ob.date']
.apply(lambda x: x.bfill()))
print (data_file)
person_id ob.date observation answer
0 1 21/08/2017 Age 21
1 1 21/08/2017 interviewdate 21/08/2017
2 1 22/05/2217 marital_status Single
3 1 22/05/2217 interviewdate 22/05/2217
4 2 11/03/2010 Age 26
5 2 11/03/2010 interviewdate 11/03/2010
6 2 NaN marital_status Single
7 3 31/09/2012 Age 41
8 3 31/09/2012 interviewdate 31/09/2012
9 3 NaN marital_status Married
这篇关于一种基于 pandas 中的组用行值填充列的优雅方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!