标签编码具有相同类别的多个列

标签编码具有相同类别的多个列

本文介绍了标签编码具有相同类别的多个列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

请考虑以下数据框:

import pandas as pd
from sklearn.preprocessing import LabelEncoder

df = pd.DataFrame(data=[["France", "Italy", "Belgium"], ["Italy", "France", "Belgium"]], columns=["a", "b", "c"])
df = df.apply(LabelEncoder().fit_transform)
print(df)

当前输出:

   a  b  c
0  0  1  0
1  1  0  0

我的目标是通过传入要共享分类值的列来使其输出类似的内容:

My goal is to make it output something like this by passing in the columns I want to share categorial values:

   a  b  c
0  0  1  2
1  1  0  2

推荐答案

通过 axis=1 为每一行调用一次LabelEncoder().fit_transform.(默认情况下,df.apply(func)为每一列调用一次func.)

Pass axis=1 to call LabelEncoder().fit_transform once for each row.(By default, df.apply(func) calls func once for each column).

import pandas as pd
from sklearn.preprocessing import LabelEncoder

df = pd.DataFrame(data=[["France", "Italy", "Belgium"],
                        ["Italy", "France", "Belgium"]], columns=["a", "b", "c"])

encoder = LabelEncoder()

df = df.apply(encoder.fit_transform, axis=1)
print(df)

收益

   a  b  c
0  1  2  0
1  2  1  0


或者,您可以使用make category dtype 并将类别代码用作标签:


Alternatively, you could use make the data of category dtype and use the category codes as labels:

import pandas as pd

df = pd.DataFrame(data=[["France", "Italy", "Belgium"],
                        ["Italy", "France", "Belgium"]], columns=["a", "b", "c"])

stacked = df.stack().astype('category')
result = stacked.cat.codes.unstack()
print(result)

也产生

   a  b  c
0  1  2  0
1  2  1  0

这应该明显更快,因为它不需要为每一行调用一次encoder.fit_transform(如果您有很多行,这可能会带来糟糕的性能).

This should be significantly faster since it does not require calling encoder.fit_transform once for each row (which might give terrible performance if you have lots of rows).

这篇关于标签编码具有相同类别的多个列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-14 23:55