我有一个带有列的数据框,我打算将它们都视为分类变量。
第一列是国家/地区,其值为SGP,AUS,MYS等。第二列是一天中的时间,其值为24小时格式,例如00、11、14、15等。event是一个二进制变量,有1/0个标志。
我知道要对其进行分类,我需要在运行Logistic回归之前使用patsy。这是我使用dmatrices构建的。
用例:仅考虑国家和time_day的交互作用(以及其他属性,例如“操作系统”)
f= 'event_int ~ time_day:country'
y,X = patsy.dmatrices(f, df, return_type='dataframe')
X.columns
Index([u'Intercept', u'country[T.HKG]', u'country[T.IDN]', u'country[T.IND]', u'country[T.MYS]', u'country[T.NZL]', u'country[T.PHL]', u'country[T.SGP]', u'time_day[T.02]:country[AUS]', u'time_day[T.03]:country[AUS]', u'time_day[T.04]:country[AUS]', u'time_day[T.05]:country[AUS]', u'time_day[T.06]:country[AUS]', u'time_day[T.07]:country[AUS]', u'time_day[T.08]:country[AUS]', u'time_day[T.09]:country[AUS]', u'time_day[T.10]:country[AUS]', u'time_day[T.11]:country[AUS]', u'time_day[T.12]:country[AUS]', u'time_day[T.NA]:country[AUS]', u'time_day[T.02]:country[HKG]', u'time_day[T.03]:country[HKG]', u'time_day[T.04]:country[HKG]', u'time_day[T.05]:country[HKG]', u'time_day[T.06]:country[HKG]', u'time_day[T.07]:country[HKG]', u'time_day[T.08]:country[HKG]', u'time_day[T.09]:country[HKG]', u'time_day[T.10]:country[HKG]', u'time_day[T.11]:country[HKG]', u'time_day[T.12]:country[HKG]', u'time_day[T.NA]:country[HKG]', u'time_day[T.02]:country[IDN]', u'time_day[T.03]:country[IDN]', u'time_day[T.04]:country[IDN]', u'time_day[T.05]:country[IDN]', u'time_day[T.06]:country[IDN]', u'time_day[T.07]:country[IDN]', u'time_day[T.08]:country[IDN]', u'time_day[T.09]:country[IDN]', u'time_day[T.10]:country[IDN]', u'time_day[T.11]:country[IDN]', u'time_day[T.12]:country[IDN]', u'time_day[T.NA]:country[IDN]', u'time_day[T.02]:country[IND]', u'time_day[T.03]:country[IND]', u'time_day[T.04]:country[IND]', u'time_day[T.05]:country[IND]', u'time_day[T.06]:country[IND]', u'time_day[T.07]:country[IND]', u'time_day[T.08]:country[IND]', u'time_day[T.09]:country[IND]', u'time_day[T.10]:country[IND]', u'time_day[T.11]:country[IND]', u'time_day[T.12]:country[IND]', u'time_day[T.NA]:country[IND]', u'time_day[T.02]:country[MYS]', u'time_day[T.03]:country[MYS]', u'time_day[T.04]:country[MYS]', u'time_day[T.05]:country[MYS]', u'time_day[T.06]:country[MYS]', u'time_day[T.07]:country[MYS]', u'time_day[T.08]:country[MYS]', u'time_day[T.09]:country[MYS]', u'time_day[T.10]:country[MYS]', u'time_day[T.11]:country[MYS]', u'time_day[T.12]:country[MYS]', u'time_day[T.NA]:country[MYS]', u'time_day[T.02]:country[NZL]', u'time_day[T.03]:country[NZL]', u'time_day[T.04]:country[NZL]', u'time_day[T.05]:country[NZL]', u'time_day[T.06]:country[NZL]', u'time_day[T.07]:country[NZL]', u'time_day[T.08]:country[NZL]', u'time_day[T.09]:country[NZL]', u'time_day[T.10]:country[NZL]', u'time_day[T.11]:country[NZL]', u'time_day[T.12]:country[NZL]', u'time_day[T.NA]:country[NZL]', u'time_day[T.02]:country[PHL]', u'time_day[T.03]:country[PHL]', u'time_day[T.04]:country[PHL]', u'time_day[T.05]:country[PHL]', u'time_day[T.06]:country[PHL]', u'time_day[T.07]:country[PHL]', u'time_day[T.08]:country[PHL]', u'time_day[T.09]:country[PHL]', u'time_day[T.10]:country[PHL]', u'time_day[T.11]:country[PHL]', u'time_day[T.12]:country[PHL]', u'time_day[T.NA]:country[PHL]', u'time_day[T.02]:country[SGP]', u'time_day[T.03]:country[SGP]', u'time_day[T.04]:country[SGP]', u'time_day[T.05]:country[SGP]', u'time_day[T.06]:country[SGP]', u'time_day[T.07]:country[SGP]', u'time_day[T.08]:country[SGP]', u'time_day[T.09]:country[SGP]', ...], dtype='object')
我希望仅查看带有国家和时间_天的列名,但事实并非如此。我可以通过指定手动获取一个子集
X = X.ix[:,range(7,len(X.columns))]
,但这意味着对每个数据集进行硬编码。我的理解是A * B与A:B的不同之处在于它没有列出A + B
有趣的是,在上面的输出中我没有看到A,即单独的time_day的分类值。
另外,当我执行以下操作时,将“国家/地区”单独明确排除在“ X”数据框中也无法正常工作,并且得到的输出与上述相同。
f='event_int ~ time_day:country-country'
y,X = patsy.dmatrices(f, df, return_type='dataframe')
X.columns
Index([u'Intercept', u'country[T.HKG]', u'country[T.IDN]', u'country[T.IND]', u'country[T.MYS]', u'country[T.NZL]', u'country[T.PHL]', u'country[T.SGP]', u'time_day[T.02]:country[AUS]', u'time_day[T.03]:country[AUS]', u'time_day[T.04]:country[AUS]', u'time_day[T.05]:country[AUS]', u'time_day[T.06]:country[AUS]', u'time_day[T.07]:country[AUS]', u'time_day[T.08]:country[AUS]', u'time_day[T.09]:country[AUS]', u'time_day[T.10]:country[AUS]', u'time_day[T.11]:country[AUS]', u'time_day[T.12]:country[AUS]', u'time_day[T.NA]:country[AUS]', u'time_day[T.02]:country[HKG]', u'time_day[T.03]:country[HKG]', u'time_day[T.04]:country[HKG]', u'time_day[T.05]:country[HKG]', u'time_day[T.06]:country[HKG]', u'time_day[T.07]:country[HKG]', u'time_day[T.08]:country[HKG]', u'time_day[T.09]:country[HKG]', u'time_day[T.10]:country[HKG]', u'time_day[T.11]:country[HKG]', u'time_day[T.12]:country[HKG]', u'time_day[T.NA]:country[HKG]', u'time_day[T.02]:country[IDN]', u'time_day[T.03]:country[IDN]', u'time_day[T.04]:country[IDN]', u'time_day[T.05]:country[IDN]', u'time_day[T.06]:country[IDN]', u'time_day[T.07]:country[IDN]', u'time_day[T.08]:country[IDN]', u'time_day[T.09]:country[IDN]', u'time_day[T.10]:country[IDN]', u'time_day[T.11]:country[IDN]', u'time_day[T.12]:country[IDN]', u'time_day[T.NA]:country[IDN]', u'time_day[T.02]:country[IND]', u'time_day[T.03]:country[IND]', u'time_day[T.04]:country[IND]', u'time_day[T.05]:country[IND]', u'time_day[T.06]:country[IND]', u'time_day[T.07]:country[IND]', u'time_day[T.08]:country[IND]', u'time_day[T.09]:country[IND]', u'time_day[T.10]:country[IND]', u'time_day[T.11]:country[IND]', u'time_day[T.12]:country[IND]', u'time_day[T.NA]:country[IND]', u'time_day[T.02]:country[MYS]', u'time_day[T.03]:country[MYS]', u'time_day[T.04]:country[MYS]', u'time_day[T.05]:country[MYS]', u'time_day[T.06]:country[MYS]', u'time_day[T.07]:country[MYS]', u'time_day[T.08]:country[MYS]', u'time_day[T.09]:country[MYS]', u'time_day[T.10]:country[MYS]', u'time_day[T.11]:country[MYS]', u'time_day[T.12]:country[MYS]', u'time_day[T.NA]:country[MYS]', u'time_day[T.02]:country[NZL]', u'time_day[T.03]:country[NZL]', u'time_day[T.04]:country[NZL]', u'time_day[T.05]:country[NZL]', u'time_day[T.06]:country[NZL]', u'time_day[T.07]:country[NZL]', u'time_day[T.08]:country[NZL]', u'time_day[T.09]:country[NZL]', u'time_day[T.10]:country[NZL]', u'time_day[T.11]:country[NZL]', u'time_day[T.12]:country[NZL]', u'time_day[T.NA]:country[NZL]', u'time_day[T.02]:country[PHL]', u'time_day[T.03]:country[PHL]', u'time_day[T.04]:country[PHL]', u'time_day[T.05]:country[PHL]', u'time_day[T.06]:country[PHL]', u'time_day[T.07]:country[PHL]', u'time_day[T.08]:country[PHL]', u'time_day[T.09]:country[PHL]', u'time_day[T.10]:country[PHL]', u'time_day[T.11]:country[PHL]', u'time_day[T.12]:country[PHL]', u'time_day[T.NA]:country[PHL]', u'time_day[T.02]:country[SGP]', u'time_day[T.03]:country[SGP]', u'time_day[T.04]:country[SGP]', u'time_day[T.05]:country[SGP]', u'time_day[T.06]:country[SGP]', u'time_day[T.07]:country[SGP]', u'time_day[T.08]:country[SGP]', u'time_day[T.09]:country[SGP]', ...], dtype='object')
这使我感到“:”是“ *”的简化形式,因为它只丢失了一个类别变量。我认为无法理解两者都是绝对变量吗?
f='event_int ~ time_day*country'
y,X = patsy.dmatrices(f, df, return_type='dataframe')
X.columns
Index([u'Intercept', u'time_day[T.02]', u'time_day[T.03]', u'time_day[T.04]', u'time_day[T.05]', u'time_day[T.06]', u'time_day[T.07]', u'time_day[T.08]', u'time_day[T.09]', u'time_day[T.10]', u'time_day[T.11]', u'time_day[T.12]', u'time_day[T.NA]', u'country[T.HKG]', u'country[T.IDN]', u'country[T.IND]', u'country[T.MYS]', u'country[T.NZL]', u'country[T.PHL]', u'country[T.SGP]', u'time_day[T.02]:country[T.HKG]', u'time_day[T.03]:country[T.HKG]', u'time_day[T.04]:country[T.HKG]', u'time_day[T.05]:country[T.HKG]', u'time_day[T.06]:country[T.HKG]', u'time_day[T.07]:country[T.HKG]', u'time_day[T.08]:country[T.HKG]', u'time_day[T.09]:country[T.HKG]', u'time_day[T.10]:country[T.HKG]', u'time_day[T.11]:country[T.HKG]', u'time_day[T.12]:country[T.HKG]', u'time_day[T.NA]:country[T.HKG]', u'time_day[T.02]:country[T.IDN]', u'time_day[T.03]:country[T.IDN]', u'time_day[T.04]:country[T.IDN]', u'time_day[T.05]:country[T.IDN]', u'time_day[T.06]:country[T.IDN]', u'time_day[T.07]:country[T.IDN]', u'time_day[T.08]:country[T.IDN]', u'time_day[T.09]:country[T.IDN]', u'time_day[T.10]:country[T.IDN]', u'time_day[T.11]:country[T.IDN]', u'time_day[T.12]:country[T.IDN]', u'time_day[T.NA]:country[T.IDN]', u'time_day[T.02]:country[T.IND]', u'time_day[T.03]:country[T.IND]', u'time_day[T.04]:country[T.IND]', u'time_day[T.05]:country[T.IND]', u'time_day[T.06]:country[T.IND]', u'time_day[T.07]:country[T.IND]', u'time_day[T.08]:country[T.IND]', u'time_day[T.09]:country[T.IND]', u'time_day[T.10]:country[T.IND]', u'time_day[T.11]:country[T.IND]', u'time_day[T.12]:country[T.IND]', u'time_day[T.NA]:country[T.IND]', u'time_day[T.02]:country[T.MYS]', u'time_day[T.03]:country[T.MYS]', u'time_day[T.04]:country[T.MYS]', u'time_day[T.05]:country[T.MYS]', u'time_day[T.06]:country[T.MYS]', u'time_day[T.07]:country[T.MYS]', u'time_day[T.08]:country[T.MYS]', u'time_day[T.09]:country[T.MYS]', u'time_day[T.10]:country[T.MYS]', u'time_day[T.11]:country[T.MYS]', u'time_day[T.12]:country[T.MYS]', u'time_day[T.NA]:country[T.MYS]', u'time_day[T.02]:country[T.NZL]', u'time_day[T.03]:country[T.NZL]', u'time_day[T.04]:country[T.NZL]', u'time_day[T.05]:country[T.NZL]', u'time_day[T.06]:country[T.NZL]', u'time_day[T.07]:country[T.NZL]', u'time_day[T.08]:country[T.NZL]', u'time_day[T.09]:country[T.NZL]', u'time_day[T.10]:country[T.NZL]', u'time_day[T.11]:country[T.NZL]', u'time_day[T.12]:country[T.NZL]', u'time_day[T.NA]:country[T.NZL]', u'time_day[T.02]:country[T.PHL]', u'time_day[T.03]:country[T.PHL]', u'time_day[T.04]:country[T.PHL]', u'time_day[T.05]:country[T.PHL]', u'time_day[T.06]:country[T.PHL]', u'time_day[T.07]:country[T.PHL]', u'time_day[T.08]:country[T.PHL]', u'time_day[T.09]:country[T.PHL]', u'time_day[T.10]:country[T.PHL]', u'time_day[T.11]:country[T.PHL]', u'time_day[T.12]:country[T.PHL]', u'time_day[T.NA]:country[T.PHL]', u'time_day[T.02]:country[T.SGP]', u'time_day[T.03]:country[T.SGP]', u'time_day[T.04]:country[T.SGP]', u'time_day[T.05]:country[T.SGP]', u'time_day[T.06]:country[T.SGP]', u'time_day[T.07]:country[T.SGP]', u'time_day[T.08]:country[T.SGP]', u'time_day[T.09]:country[T.SGP]', ...], dtype='object')
如果我要明确地将它们声明为“类别”变量,我会得到-
f='event_int ~ C(time_day):C(country)'
y,X = patsy.dmatrices(f, df, return_type='dataframe')
X.columns
Index([u'Intercept', u'C(country)[T.HKG]', u'C(country)[T.IDN]', u'C(country)[T.IND]', u'C(country)[T.MYS]', u'C(country)[T.NZL]', u'C(country)[T.PHL]', u'C(country)[T.SGP]', u'C(time_day)[T.02]:C(country)[AUS]', u'C(time_day)[T.03]:C(country)[AUS]', u'C(time_day)[T.04]:C(country)[AUS]', u'C(time_day)[T.05]:C(country)[AUS]', u'C(time_day)[T.06]:C(country)[AUS]', u'C(time_day)[T.07]:C(country)[AUS]', u'C(time_day)[T.08]:C(country)[AUS]', u'C(time_day)[T.09]:C(country)[AUS]', u'C(time_day)[T.10]:C(country)[AUS]', u'C(time_day)[T.11]:C(country)[AUS]', u'C(time_day)[T.12]:C(country)[AUS]', u'C(time_day)[T.NA]:C(country)[AUS]', u'C(time_day)[T.02]:C(country)[HKG]', u'C(time_day)[T.03]:C(country)[HKG]', u'C(time_day)[T.04]:C(country)[HKG]', u'C(time_day)[T.05]:C(country)[HKG]', u'C(time_day)[T.06]:C(country)[HKG]', u'C(time_day)[T.07]:C(country)[HKG]', u'C(time_day)[T.08]:C(country)[HKG]', u'C(time_day)[T.09]:C(country)[HKG]', u'C(time_day)[T.10]:C(country)[HKG]', u'C(time_day)[T.11]:C(country)[HKG]', u'C(time_day)[T.12]:C(country)[HKG]', u'C(time_day)[T.NA]:C(country)[HKG]', u'C(time_day)[T.02]:C(country)[IDN]', u'C(time_day)[T.03]:C(country)[IDN]', u'C(time_day)[T.04]:C(country)[IDN]', u'C(time_day)[T.05]:C(country)[IDN]', u'C(time_day)[T.06]:C(country)[IDN]', u'C(time_day)[T.07]:C(country)[IDN]', u'C(time_day)[T.08]:C(country)[IDN]', u'C(time_day)[T.09]:C(country)[IDN]', u'C(time_day)[T.10]:C(country)[IDN]', u'C(time_day)[T.11]:C(country)[IDN]', u'C(time_day)[T.12]:C(country)[IDN]', u'C(time_day)[T.NA]:C(country)[IDN]', u'C(time_day)[T.02]:C(country)[IND]', u'C(time_day)[T.03]:C(country)[IND]', u'C(time_day)[T.04]:C(country)[IND]', u'C(time_day)[T.05]:C(country)[IND]', u'C(time_day)[T.06]:C(country)[IND]', u'C(time_day)[T.07]:C(country)[IND]', u'C(time_day)[T.08]:C(country)[IND]', u'C(time_day)[T.09]:C(country)[IND]', u'C(time_day)[T.10]:C(country)[IND]', u'C(time_day)[T.11]:C(country)[IND]', u'C(time_day)[T.12]:C(country)[IND]', u'C(time_day)[T.NA]:C(country)[IND]', u'C(time_day)[T.02]:C(country)[MYS]', u'C(time_day)[T.03]:C(country)[MYS]', u'C(time_day)[T.04]:C(country)[MYS]', u'C(time_day)[T.05]:C(country)[MYS]', u'C(time_day)[T.06]:C(country)[MYS]', u'C(time_day)[T.07]:C(country)[MYS]', u'C(time_day)[T.08]:C(country)[MYS]', u'C(time_day)[T.09]:C(country)[MYS]', u'C(time_day)[T.10]:C(country)[MYS]', u'C(time_day)[T.11]:C(country)[MYS]', u'C(time_day)[T.12]:C(country)[MYS]', u'C(time_day)[T.NA]:C(country)[MYS]', u'C(time_day)[T.02]:C(country)[NZL]', u'C(time_day)[T.03]:C(country)[NZL]', u'C(time_day)[T.04]:C(country)[NZL]', u'C(time_day)[T.05]:C(country)[NZL]', u'C(time_day)[T.06]:C(country)[NZL]', u'C(time_day)[T.07]:C(country)[NZL]', u'C(time_day)[T.08]:C(country)[NZL]', u'C(time_day)[T.09]:C(country)[NZL]', u'C(time_day)[T.10]:C(country)[NZL]', u'C(time_day)[T.11]:C(country)[NZL]', u'C(time_day)[T.12]:C(country)[NZL]', u'C(time_day)[T.NA]:C(country)[NZL]', u'C(time_day)[T.02]:C(country)[PHL]', u'C(time_day)[T.03]:C(country)[PHL]', u'C(time_day)[T.04]:C(country)[PHL]', u'C(time_day)[T.05]:C(country)[PHL]', u'C(time_day)[T.06]:C(country)[PHL]', u'C(time_day)[T.07]:C(country)[PHL]', u'C(time_day)[T.08]:C(country)[PHL]', u'C(time_day)[T.09]:C(country)[PHL]', u'C(time_day)[T.10]:C(country)[PHL]', u'C(time_day)[T.11]:C(country)[PHL]', u'C(time_day)[T.12]:C(country)[PHL]', u'C(time_day)[T.NA]:C(country)[PHL]', u'C(time_day)[T.02]:C(country)[SGP]', u'C(time_day)[T.03]:C(country)[SGP]', u'C(time_day)[T.04]:C(country)[SGP]', u'C(time_day)[T.05]:C(country)[SGP]', u'C(time_day)[T.06]:C(country)[SGP]', u'C(time_day)[T.07]:C(country)[SGP]', u'C(time_day)[T.08]:C(country)[SGP]', u'C(time_day)[T.09]:C(country)[SGP]', ...], dtype='object')
问题:
1.我如何只包括交互作用而没有其他变量?
2.为什么在第二种情况下排除具有
-country
的国家无效?相关:Statsmodels formula API (patsy): How to exclude a subset of interaction components?
根据@Nathaniel J.Smith的以下答案进行了编辑,以对您自己进行故障排除-:
f2='event_int ~ country:time_day'
y2,X2 = patsy.dmatrices(f2, df, return_type='dataframe')
X2.design_info.term_names
['Intercept', 'country:time_day']
f1='event_int ~ country:time_day-1'
y1,X1 = patsy.dmatrices(f1, df, return_type='dataframe')
X1.design_info.term_names
['country:time_day']
最佳答案
简短答案:尝试event_int ~ -1 + time_day:country
长答案:
首先要了解的是,patsy如何决定构建设计矩阵有两个不同的阶段。首先,它确定要包括哪些术语。术语是诸如a
或a:b
之类的东西。 (a
中的b
和a:b
称为因素;术语a
包含一个也拼写为a
的因素。)弄清楚存在哪些术语涉及扩展和简化给出的公式,直到具有仅使用+
和:
的表达式。 a*b
扩展为a + b + a:b
,等等。减法是在此阶段发生的操作:a + b - a
简化为普通的b
。因此,a*b - a
扩展为a + b + a:b - a
,简化为b + a:b
,但是a:b - a
与a:b
相同,因为没有要减去的a
,因此- a
被忽略。这就是为什么编写time_day:country - country
与编写time_day:country
相同的原因。
然后在第二阶段,一旦patsy确定了要包括的术语,就必须决定如何对这些术语进行编码。在此阶段您会遇到麻烦。
一般规则是,patsy会遍历其中包含分类因素的每个术语,并找出它可以使用的一组列,这将使模型足够灵活以包括指定的交互,但不会与已经存在的任何术语重复已添加。
在这种情况下,您的麻烦是由patsy默认添加的拦截项引起的:event_int ~ time_day:country
的解释类似于event_int ~ 1 + time_day:country
。这告诉patsy,您想让一列单独代表拦截项,然后让第二组列覆盖交互作用-但不与拦截重叠。对time_day
和country
进行伪编码的明显方法是带有截距的冗余(共线),因此patsy会找到一种不具有此属性的复杂方案。如果删除了拦截器,则告诉patsy它可以继续并使用简单的方案,所以也可以。
patsy如何选择编码方案的详细信息在此处说明:http://patsy.readthedocs.org/en/latest/formulas.html#redundancy-and-categorical-factors
手册部分的第一部分也许有太多的数学运算,但是如果您向下滚动,可能会有一些不错的图表,它们可以使发生的事情更加清楚(并提供一些数学上下文)。如果搜索y ~ 1 + a:b
,则会看到该图专门显示了您键入event_int ~ time_day:country
时遇到的情况。而且,如果您搜索y ~ 1 + a + b + a:b
,您将看到event_int ~ time_day*country
案例中发生的情况的图片。
除了查看X.columns
之外,查看X.design_info.term_names
和X.design_info.term_slices
也很有用,它们显示了patsy认为存在的“术语”以及它们对应的列。 (a
和a:b
是术语;每一个都生成多个列。)y ~ 1 + a:b
图中的粗轮廓旨在指示在这种情况下,单个术语a:b
生成两组列:一组编码使用处理编码的b
的列,以及第二组对伪编码b
和处理编码的a
的成对乘积进行编码的列。
最后,有两个解释输出结果的提示:(1)您可以确定patsy实际上将因素视为分类因素,因为列名称看起来像varname[something involving the var's value]
。数值因子看起来像varname
或(在极少数情况下,您将2d矩阵作为预测变量)是varname[column index]
。 (2)注意country[T.HKG]
和country[HKG]
之间的区别-前者指示patsy使用降级“处理”编码来避免冗余,而后者指示简单的伪编码。当然,事实证明,就单个列而言,它们是相同的,但是从概念上讲,区别是非常重要的-T.
模式意味着它删除了其中一列(请注意,没有country[T.AUS]
),因此像您认为的那样对列进行子集调整效果不佳!
希望这可以帮助!
关于python-2.7 - patsy与patsy.dmatrices的交互作用为“:”提供了重复的列,如“+”或“*”,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/23672466/