使用scikit-learn对连续和分类变量(整数类型)进行特征预处理

本文介绍了使用scikit-learn对连续和分类变量(整数类型)进行特征预处理的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

主要目标如下:

1)将StandardScaler应用于连续变量

1) Apply StandardScaler to continuous variables

2)将LabelEncoder和OnehotEncoder应用于分类变量

2) Apply LabelEncoder and OnehotEncoder to categorical variables

需要对连续变量进行缩放，但同时，几个分类变量也是整数类型.应用StandardScaler会导致不希望的效果.

The continuous variables need to be scaled, but at the same time, a couple of categorical variables are also of integer type. Applying StandardScaler would result in undesired effects.

另一方面，StandardScaler会缩放基于整数的分类变量，这也不是我们想要的.

On the flip side, the StandardScaler would scale the integer based categorical variables, which is also not we what.

由于连续变量和分类变量混合在单个Pandas DataFrame中，因此建议使用哪种工作流程来解决此类问题?

Since continuous variables and categorical ones are mixed in a single Pandas DataFrame, what's the recommended workflow to approach this kind of problem?

说明我的观点的最好例子是 Kaggle自行车共享需求数据集，其中season和weather是整数分类变量

The best example to illustrate my point is the Kaggle Bike Sharing Demand dataset, where season and weather are integer categorical variables

推荐答案

查看 sklearn_pandas.DataFrameMapper 元变压器.将其用作执行按列数据工程操作的管道中的第一步:

Check out the sklearn_pandas.DataFrameMapper meta-transformer. Use it as the first step in your pipeline to perform column-wise data engineering operations:

mapper = DataFrameMapper(
  [(continuous_col, StandardScaler()) for continuous_col in continuous_cols] +
  [(categorical_col, LabelBinarizer()) for categorical_col in categorical_cols]
)
pipeline = Pipeline(
  [("mapper", mapper),
  ("estimator", estimator)]
)
pipeline.fit_transform(df, df["y"])

此外，您应该使用sklearn.preprocessing.LabelBinarizer而不是[LabelEncoder(), OneHotEncoder()]列表.

Also, you should be using sklearn.preprocessing.LabelBinarizer instead of a list of [LabelEncoder(), OneHotEncoder()].

这篇关于使用scikit-learn对连续和分类变量(整数类型)进行特征预处理的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！