问题描述
我有一个像这样的数据框:
I have a data frame like this:
Index ID Industry years_spend asset
6646 892 4 4 144.977037
2347 315 10 8 137.749138
7342 985 1 5 104.310217
137 18 5 5 156.593396
2840 381 11 2 229.538828
6579 883 11 1 171.380125
1776 235 4 7 217.734377
2691 361 1 2 148.865341
815 110 15 4 233.309491
2932 393 17 5 187.281724
我想为Industry X years_spend创建虚拟变量,该变量将创建len(df.Industry.value_counts()) * len(df.years_spend.value_counts())
变量,例如,对于具有行业== 1且年份花费= 4的所有行,d_11_4 = 1,否则d_11_4 =0.那么我可以使用这些变量一些回归的作品.
I want to create dummy variables for Industry X years_spend which creates len(df.Industry.value_counts()) * len(df.years_spend.value_counts())
varaible, for example d_11_4 = 1 for all rows that has industry==1 and years spend=4 otherwise d_11_4 = 0. Then I can use these vars for some regression works.
我知道我可以使用df.groupby(['Industry','years_spend'])来创建想要的组,并且我可以在statsmodels
中使用patsy
语法为一列创建此类变量:
I know I can make groups like what I want using df.groupby(['Industry','years_spend']) and I know I can create such variable for one column using patsy
syntax in statsmodels
:
import statsmodels.formula.api as smf
mod = smf.ols("income ~ C(Industry)", data=df).fit()
但是如果我要处理2列,则会收到以下错误消息:IndexError: tuple index out of range
but If I want to do with 2 columns I get an error that:IndexError: tuple index out of range
我该如何使用熊猫或在statsmodels内部使用某些功能?
How can I do that with pandas or using some function inside statsmodels?
推荐答案
您可以执行以下操作,首先必须创建一个包含Industry
和years_spend
的计算字段:
You could do something like this where you have to first create a calculated field that encapsulates the Industry
and years_spend
:
df = pd.DataFrame({'Industry': [4, 3, 11, 4, 1, 1], 'years_spend': [4, 5, 8, 4, 4, 1]})
df['industry_years'] = df['Industry'].astype('str') + '_' + df['years_spend'].astype('str') # this is the calculated field
df
的外观如下:
Industry years_spend industry_years
0 4 4 4_4
1 3 5 3_5
2 11 8 11_8
3 4 4 4_4
4 1 4 1_4
5 1 1 1_1
现在,您可以应用 get_dummies
:
Now you can apply get_dummies
:
df = pd.get_dummies(df, columns=['industry_years'])
那会给你你想要的:)
这篇关于使用pandas或statsmodel创建虚拟变量以交互两列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!