问题描述
主要目标如下:
1)将StandardScaler
应用于连续变量
1) Apply StandardScaler
to continuous variables
2)将LabelEncoder
和OnehotEncoder
应用于分类变量
2) Apply LabelEncoder
and OnehotEncoder
to categorical variables
需要对连续变量进行缩放,但同时,几个分类变量也是整数类型.应用StandardScaler
会导致不希望的效果.
The continuous variables need to be scaled, but at the same time, a couple of categorical variables are also of integer type. Applying StandardScaler
would result in undesired effects.
另一方面,StandardScaler
会缩放基于整数的分类变量,这也不是我们想要的.
On the flip side, the StandardScaler
would scale the integer based categorical variables, which is also not we what.
由于连续变量和分类变量混合在单个Pandas
DataFrame中,因此建议使用哪种工作流程来解决此类问题?
Since continuous variables and categorical ones are mixed in a single Pandas
DataFrame, what's the recommended workflow to approach this kind of problem?
说明我的观点的最好例子是 Kaggle自行车共享需求数据集,其中season
和weather
是整数分类变量
The best example to illustrate my point is the Kaggle Bike Sharing Demand dataset, where season
and weather
are integer categorical variables
推荐答案
查看 sklearn_pandas.DataFrameMapper
元变压器.将其用作执行按列数据工程操作的管道中的第一步:
Check out the sklearn_pandas.DataFrameMapper
meta-transformer. Use it as the first step in your pipeline to perform column-wise data engineering operations:
mapper = DataFrameMapper(
[(continuous_col, StandardScaler()) for continuous_col in continuous_cols] +
[(categorical_col, LabelBinarizer()) for categorical_col in categorical_cols]
)
pipeline = Pipeline(
[("mapper", mapper),
("estimator", estimator)]
)
pipeline.fit_transform(df, df["y"])
此外,您应该使用sklearn.preprocessing.LabelBinarizer
而不是[LabelEncoder(), OneHotEncoder()]
列表.
Also, you should be using sklearn.preprocessing.LabelBinarizer
instead of a list of [LabelEncoder(), OneHotEncoder()]
.
这篇关于使用scikit-learn对连续和分类变量(整数类型)进行特征预处理的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!