pandas.factorize在整个数据框上

本文介绍了pandas.factorize在整个数据框上的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

pandas.factorize 将输入值编码为枚举类型或分类变量.

pandas.factorize encodes input values as an enumerated type or categorical variable.

但是我如何轻松高效地转换数据帧的许多列呢?反向映射步骤如何?

But how can I easily and efficiently convert many columns of a data frame? What about the reverse mapping step?

示例:此数据框包含带有字符串值(例如类型2")的列，我希望将其转换为数值-并可能在以后将其转换回去.

Example: This data frame contains columns with string values such as "type 2" which I would like to convert to numerical values - and possibly translate them back later.

推荐答案

如果需要分别factorize每列，则可以使用apply:

You can use apply if you need to factorize each column separately:

df = pd.DataFrame({'A':['type1','type2','type2'],
                   'B':['type1','type2','type3'],
                   'C':['type1','type3','type3']})

print (df)
       A      B      C
0  type1  type1  type1
1  type2  type2  type3
2  type2  type3  type3

print (df.apply(lambda x: pd.factorize(x)[0]))
   A  B  C
0  0  0  0
1  1  1  1
2  1  2  1

如果您需要相同的字符串值和相同的数字:

If you need for the same string value the same numeric one:

print (df.stack().rank(method='dense').unstack())
     A    B    C
0  1.0  1.0  1.0
1  2.0  2.0  3.0
2  2.0  3.0  3.0

如果您只需要对某些列应用此功能，请使用一个子集:

If you need to apply the function only for some columns, use a subset:

df[['B','C']] = df[['B','C']].stack().rank(method='dense').unstack()
print (df)
       A    B    C
0  type1  1.0  1.0
1  type2  2.0  3.0
2  type2  3.0  3.0

使用 factorize 的解决方案:

Solution with factorize:

stacked = df[['B','C']].stack()
df[['B','C']] = pd.Series(stacked.factorize()[0], index=stacked.index).unstack()
print (df)
       A  B  C
0  type1  0  0
1  type2  1  2
2  type2  2  2

可以通过 map (由dict)，其中您需要通过 drop_duplicates :

Translate them back is possible via map by dict, where you need to remove duplicates by drop_duplicates:

vals = df.stack().drop_duplicates().values
b = [x for x in df.stack().drop_duplicates().rank(method='dense')]

d1 = dict(zip(b, vals))
print (d1)
{1.0: 'type1', 2.0: 'type2', 3.0: 'type3'}

df1 = df.stack().rank(method='dense').unstack()
print (df1)
     A    B    C
0  1.0  1.0  1.0
1  2.0  2.0  3.0
2  2.0  3.0  3.0

print (df1.stack().map(d1).unstack())
       A      B      C
0  type1  type1  type1
1  type2  type2  type3
2  type2  type3  type3

这篇关于pandas.factorize在整个数据框上的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！