问题描述
个人(索引从0到5)在两个位置之间选择:A和B.我的数据格式很宽,包含因人而异的特征(ind_var)和仅因位置而异的特征(location_var).
Individuals (indexed from 0 to 5) choose between two locations: A and B.My data has a wide format containing characteristics that vary by individual (ind_var) and characteristics that vary only by location (location_var).
例如,我有:
In [281]:
df_reshape_test = pd.DataFrame( {'location' : ['A', 'A', 'A', 'B', 'B', 'B'], 'dist_to_A' : [0, 0, 0, 50, 50, 50], 'dist_to_B' : [50, 50, 50, 0, 0, 0], 'location_var': [10, 10, 10, 14, 14, 14], 'ind_var': [3, 8, 10, 1, 3, 4]})
df_reshape_test
Out[281]:
dist_to_A dist_to_B ind_var location location_var
0 0 50 3 A 10
1 0 50 8 A 10
2 0 50 10 A 10
3 50 0 1 B 14
4 50 0 3 B 14
5 50 0 4 B 14
变量位置"是个人选择的变量.dist_to_A是从个人选择的位置到位置A的距离(与dist_to_B相同)
The variable 'location' is the one chosen by the individual.dist_to_A is the distance to location A from the location chosen by the individual (same thing with dist_to_B)
我希望我的数据具有以下形式:
I'd like my data to have this form:
choice dist_S ind_var location location_var
0 1 0 3 A 10
0 0 50 3 B 14
1 1 0 8 A 10
1 0 50 8 B 14
2 1 0 10 A 10
2 0 50 10 B 14
3 0 50 1 A 10
3 1 0 1 B 14
4 0 50 3 A 10
4 1 0 3 B 14
5 0 50 4 A 10
5 1 0 4 B 14
其中choice == 1表示个人已选择该位置,而dist_S是距所选位置的距离.
where choice == 1 indicates individual has chosen that location and dist_S is the distance from the location chosen.
我阅读了有关 .stack 方法,但无法弄清楚在这种情况下如何应用它.感谢您的宝贵时间!
I read about the .stack method but couldn't figure out how to apply it for this case.Thanks for your time!
注意:这只是一个简单的例子.我要查找的数据集具有不同的位置数量,每个位置的个人数量也不尽相同,因此,我正在寻找一种灵活的解决方案
NOTE: this is just a simple example. The datasets I'm looking have varying numbers of location and number of individuals per location, so I'm looking for a flexible solution if possible
推荐答案
实际上,pandas有一个wide_to_long
命令,可以方便地完成您打算做的事情.
In fact, pandas has a wide_to_long
command that can conveniently do what you intend to do.
df = pd.DataFrame( {'location' : ['A', 'A', 'A', 'B', 'B', 'B'],
'dist_to_A' : [0, 0, 0, 50, 50, 50],
'dist_to_B' : [50, 50, 50, 0, 0, 0],
'location_var': [10, 10, 10, 14, 14, 14],
'ind_var': [3, 8, 10, 1, 3, 4]})
df['ind'] = df.index
#The `location` and `location_var` corresponds to the choices,
#record them as dictionaries and drop them
#(Just realized you had a cleaner way, copied from yous).
ind_to_loc = dict(df['location'])
loc_dict = dict(df.groupby('location').agg(lambda x : int(np.mean(x)))['location_var'])
df.drop(['location_var', 'location'], axis = 1, inplace = True)
# now reshape
df_long = pd.wide_to_long(df, ['dist_to_'], i = 'ind', j = 'location')
# use the dictionaries to get variables `choice` and `location_var` back.
df_long['choice'] = df_long.index.map(lambda x: ind_to_loc[x[0]])
df_long['location_var'] = df_long.index.map(lambda x : loc_dict[x[1]])
print df_long.sort()
这将为您提供所需的表格:
This gives you the table you asked for:
ind_var dist_to_ choice location_var
ind location
0 A 3 0 A 10
B 3 50 A 14
1 A 8 0 A 10
B 8 50 A 14
2 A 10 0 A 10
B 10 50 A 14
3 A 1 50 B 10
B 1 0 B 14
4 A 3 50 B 10
B 3 0 B 14
5 A 4 50 B 10
B 4 0 B 14
当然,如果您想要的话,您可以生成一个选择变量,该变量采用0
和1
.
Of course you can generate a choice variable that takes 0
and 1
if that's what you want.
这篇关于复杂的(对我而言)在Pandas中由宽变长的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!