问题描述
我正在尝试使用 scikit-learn 的 LabelEncoder
对字符串标签的熊猫 DataFrame
进行编码.由于数据框有很多(50+)列,我想避免为每一列创建一个 LabelEncoder
对象;我宁愿只有一个大的 LabelEncoder
对象,它可以在我的所有数据列中工作.
将整个 DataFrame
放入 LabelEncoder
会产生以下错误.请记住,我在这里使用的是虚拟数据;实际上,我正在处理大约 50 列标记为字符串的数据,因此需要一个不按名称引用任何列的解决方案.
导入熊猫从 sklearn 导入预处理df = 熊猫.DataFrame({'宠物':['猫','狗','猫','猴子','狗','狗'],'owner': ['Champ', 'Ron', 'Brick', 'Champ', 'Veronica', 'Ron'],'location': ['San_Diego', 'New_York', 'New_York', 'San_Diego', 'San_Diego','纽约']})le = 预处理.LabelEncoder()le.fit(df)
回溯(最近一次调用最后一次):文件",第 1 行,在文件/Users/bbalin/anaconda/lib/python2.7/site-packages/sklearn/preprocessing/label.py",第103行,合适y = column_or_1d(y, 警告=真)文件/Users/bbalin/anaconda/lib/python2.7/site-packages/sklearn/utils/validation.py",第 306 行,在 column_or_1draise ValueError("错误的输入形状{0}".format(shape))ValueError: 错误的输入形状 (6, 3)
关于如何解决这个问题的任何想法?
你可以很容易地做到这一点,
df.apply(LabelEncoder().fit_transform)
编辑 2:
在 scikit-learn 0.20 中,推荐的方式是
OneHotEncoder().fit_transform(df)
因为 OneHotEncoder 现在支持字符串输入.使用 ColumnTransformer 可以将 OneHotEncoder 仅应用于某些列.
由于这个原始答案是一年多前的,并且产生了很多赞成票(包括赏金),我可能应该进一步扩展它.
对于inverse_transform 和transform,你必须做一些hack.
from collections import defaultdictd = defaultdict(LabelEncoder)
这样,您现在将所有列 LabelEncoder
保留为字典.
# 对变量进行编码适合 = df.apply(lambda x: d[x.name].fit_transform(x))# 反转编码fit.apply(lambda x: d[x.name].inverse_transform(x))# 使用字典标记未来数据df.apply(lambda x: d[x.name].transform(x))
MOAR
使用 Neuraxle 的 FlattenForEach
步骤,也可以使用相同的 LabelEncoder
一次性处理所有扁平化数据:
FlattenForEach(LabelEncoder(), then_unflatten=True).fit_transform(df)
对于根据您的数据列使用单独的 LabelEncoder
,或者如果您的数据列中只有一些需要进行标签编码而不是其他列,则使用 ColumnTransformer
是一种解决方案,可让您更好地控制列选择和 LabelEncoder 实例.
I'm trying to use scikit-learn's LabelEncoder
to encode a pandas DataFrame
of string labels. As the dataframe has many (50+) columns, I want to avoid creating a LabelEncoder
object for each column; I'd rather just have one big LabelEncoder
objects that works across all my columns of data.
Throwing the entire DataFrame
into LabelEncoder
creates the below error. Please bear in mind that I'm using dummy data here; in actuality I'm dealing with about 50 columns of string labeled data, so need a solution that doesn't reference any columns by name.
import pandas
from sklearn import preprocessing
df = pandas.DataFrame({
'pets': ['cat', 'dog', 'cat', 'monkey', 'dog', 'dog'],
'owner': ['Champ', 'Ron', 'Brick', 'Champ', 'Veronica', 'Ron'],
'location': ['San_Diego', 'New_York', 'New_York', 'San_Diego', 'San_Diego',
'New_York']
})
le = preprocessing.LabelEncoder()
le.fit(df)
Any thoughts on how to get around this problem?
You can easily do this though,
df.apply(LabelEncoder().fit_transform)
EDIT2:
In scikit-learn 0.20, the recommended way is
OneHotEncoder().fit_transform(df)
as the OneHotEncoder now supports string input.Applying OneHotEncoder only to certain columns is possible with the ColumnTransformer.
EDIT:
Since this original answer is over a year ago, and generated many upvotes (including a bounty), I should probably extend this further.
For inverse_transform and transform, you have to do a little bit of hack.
from collections import defaultdict
d = defaultdict(LabelEncoder)
With this, you now retain all columns LabelEncoder
as dictionary.
# Encoding the variable
fit = df.apply(lambda x: d[x.name].fit_transform(x))
# Inverse the encoded
fit.apply(lambda x: d[x.name].inverse_transform(x))
# Using the dictionary to label future data
df.apply(lambda x: d[x.name].transform(x))
MOAR EDIT:
Using Neuraxle's FlattenForEach
step, it's possible to do this as well to use the same LabelEncoder
on all the flattened data at once:
FlattenForEach(LabelEncoder(), then_unflatten=True).fit_transform(df)
For using separate LabelEncoder
s depending for your columns of data, or if only some of your columns of data needs to be label-encoded and not others, then using a ColumnTransformer
is a solution that allows for more control on your column selection and your LabelEncoder instances.
这篇关于scikit-learn 中跨多列的标签编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!