本文介绍了使用pandas或statsmodel创建虚拟变量以交互两列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个像这样的数据框:

I have a data frame like this:

Index ID  Industry  years_spend       asset
6646  892         4            4  144.977037
2347  315        10            8  137.749138
7342  985         1            5  104.310217
137    18         5            5  156.593396
2840  381        11            2  229.538828
6579  883        11            1  171.380125
1776  235         4            7  217.734377
2691  361         1            2  148.865341
815   110        15            4  233.309491
2932  393        17            5  187.281724

我想为Industry X years_spend创建虚拟变量,该变量将创建len(df.Industry.value_counts()) * len(df.years_spend.value_counts())变量,例如,对于具有行业== 1且年份花费= 4的所有行,d_11_4 = 1,否则d_11_4 =0.那么我可以使用这些变量一些回归的作品.

I want to create dummy variables for Industry X years_spend which creates len(df.Industry.value_counts()) * len(df.years_spend.value_counts()) varaible, for example d_11_4 = 1 for all rows that has industry==1 and years spend=4 otherwise d_11_4 = 0. Then I can use these vars for some regression works.

我知道我可以使用df.groupby(['Industry','years_spend'])来创建想要的组,并且我可以在statsmodels中使用patsy语法为一列创建此类变量:

I know I can make groups like what I want using df.groupby(['Industry','years_spend']) and I know I can create such variable for one column using patsy syntax in statsmodels:

import statsmodels.formula.api as smf

mod = smf.ols("income ~   C(Industry)", data=df).fit()

但是如果我要处理2列,则会收到以下错误消息:IndexError: tuple index out of range

but If I want to do with 2 columns I get an error that:IndexError: tuple index out of range

我该如何使用熊猫或在statsmodels内部使用某些功能?

How can I do that with pandas or using some function inside statsmodels?

推荐答案

您可以执行以下操作,首先必须创建一个包含Industryyears_spend的计算字段:

You could do something like this where you have to first create a calculated field that encapsulates the Industry and years_spend:

df = pd.DataFrame({'Industry': [4, 3, 11, 4, 1, 1], 'years_spend': [4, 5, 8, 4, 4, 1]})
df['industry_years'] = df['Industry'].astype('str') + '_' + df['years_spend'].astype('str')  # this is the calculated field

df的外观如下:

   Industry  years_spend industry_years
0         4            4            4_4
1         3            5            3_5
2        11            8           11_8
3         4            4            4_4
4         1            4            1_4
5         1            1            1_1

现在,您可以应用 get_dummies :

Now you can apply get_dummies:

df = pd.get_dummies(df, columns=['industry_years'])

那会给你你想要的:)

这篇关于使用pandas或statsmodel创建虚拟变量以交互两列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-15 03:27