问题描述
pandas.factorize
将输入值编码为枚举类型或分类变量.
pandas.factorize
encodes input values as an enumerated type or categorical variable.
但是我如何轻松高效地转换数据帧的许多列呢?反向映射步骤如何?
But how can I easily and efficiently convert many columns of a data frame? What about the reverse mapping step?
示例:此数据框包含带有字符串值(例如类型2")的列,我希望将其转换为数值-并可能在以后将其转换回去.
Example: This data frame contains columns with string values such as "type 2" which I would like to convert to numerical values - and possibly translate them back later.
推荐答案
如果需要分别factorize
每列,则可以使用apply
:
You can use apply
if you need to factorize
each column separately:
df = pd.DataFrame({'A':['type1','type2','type2'],
'B':['type1','type2','type3'],
'C':['type1','type3','type3']})
print (df)
A B C
0 type1 type1 type1
1 type2 type2 type3
2 type2 type3 type3
print (df.apply(lambda x: pd.factorize(x)[0]))
A B C
0 0 0 0
1 1 1 1
2 1 2 1
如果您需要相同的字符串值和相同的数字:
If you need for the same string value the same numeric one:
print (df.stack().rank(method='dense').unstack())
A B C
0 1.0 1.0 1.0
1 2.0 2.0 3.0
2 2.0 3.0 3.0
如果您只需要对某些列应用此功能,请使用一个子集:
If you need to apply the function only for some columns, use a subset:
df[['B','C']] = df[['B','C']].stack().rank(method='dense').unstack()
print (df)
A B C
0 type1 1.0 1.0
1 type2 2.0 3.0
2 type2 3.0 3.0
使用 factorize
的解决方案:
Solution with factorize
:
stacked = df[['B','C']].stack()
df[['B','C']] = pd.Series(stacked.factorize()[0], index=stacked.index).unstack()
print (df)
A B C
0 type1 0 0
1 type2 1 2
2 type2 2 2
可以通过 map
(由dict
),其中您需要通过 drop_duplicates
:
Translate them back is possible via map
by dict
, where you need to remove duplicates by drop_duplicates
:
vals = df.stack().drop_duplicates().values
b = [x for x in df.stack().drop_duplicates().rank(method='dense')]
d1 = dict(zip(b, vals))
print (d1)
{1.0: 'type1', 2.0: 'type2', 3.0: 'type3'}
df1 = df.stack().rank(method='dense').unstack()
print (df1)
A B C
0 1.0 1.0 1.0
1 2.0 2.0 3.0
2 2.0 3.0 3.0
print (df1.stack().map(d1).unstack())
A B C
0 type1 type1 type1
1 type2 type2 type3
2 type2 type3 type3
这篇关于pandas.factorize在整个数据框上的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!