本文介绍了SKLERN//结合 GridsearchCV 与列变换和管道的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在为一个机器学习项目而苦苦挣扎,我正在尝试将其结合起来:

I am struggling with a machine learning project, in which I am trying to combine :

  • 一个 sklearn 列变换,用于将不同的变换器应用于我的数值和分类特征
  • 应用我的不同转换器和估算器的管道
  • 一个用于搜索最佳参数的 GridSearchCV.

只要我在我的管道中手动填写不同转换器的参数,代码就可以完美运行.但是,一旦我尝试传递不同值的列表以在我的 gridsearch 参数中进行比较,我就会收到各种无效参数错误消息.

As long as I fill-in the parameters of my different transformers manually in my pipeline, the code is working perfectly.But as soon as I try to pass lists of different values to compare in my gridsearch parameters, I am getting all kind of invalid parameter error messages.

这是我的代码:

首先我将我的特征分为数值和分类

from sklearn.compose import make_column_selector
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.impute import KNNImputer
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder


numerical_features=make_column_selector(dtype_include=np.number)
cat_features=make_column_selector(dtype_exclude=np.number)

然后我为数值和分类特征创建了 2 个不同的预处理管道:

numerical_pipeline= make_pipeline(KNNImputer())
cat_pipeline=make_pipeline(SimpleImputer(strategy='most_frequent'),OneHotEncoder(handle_unknown='ignore'))

我将两者结合到另一个管道中,设置我的参数,并运行我的 GridSearchCV 代码

I combined both into another pipeline, set my parameters, and run my GridSearchCV code

model=make_pipeline(preprocessor, LinearRegression() )

params={
    'columntransformer__numerical_pipeline__knnimputer__n_neighbors':[1,2,3,4,5,6,7]
}

grid=GridSearchCV(model, param_grid=params,scoring = 'r2',cv=10)
cv = KFold(n_splits=5)
all_accuracies = cross_val_score(grid, X, y, cv=cv,scoring='r2')

我尝试了不同的方法来声明参数,但从未找到合适的方法.我总是收到无效参数"错误消息.

I tried different ways to declare the paramaters, but never found the proper one. I always get an "invalid parameter" error message.

你能帮我理解哪里出了问题吗?

Could you please help me understanding what went wrong?

非常感谢您的支持,请保重!

Really a lot of thanks for your support, and take good care!

推荐答案

我假设您可能已经将 preprocessor 定义如下,

I am assuming that you might have defined preprocessor as the following,

preprocessor = Pipeline([('numerical_pipeline',numerical_pipeline),
                        ('cat_pipeline', cat_pipeline)])

然后你需要改变你的参数名称如下:

then you need to change your param name as following:

pipeline__numerical_pipeline__knnimputer__n_neighbors

但是,代码还有其他几个问题:

but, there are couple of other problems with the code:

  1. 您不必在执行 GridSearchCV 后调用 cross_val_score.GridSearchCV 本身的输出将具有每个超参数组合的交叉验证结果.

  1. you don't have to call cross_val_score after performing GridSearchCV. Output of GridSearchCV itself would have the cross validation result for each combination of hyper parameters.

KNNImputer 当您的数据具有字符串数据时将不起作用.您需要在 num_pipeline 之前应用 cat_pipeline.

KNNImputer would not work when you data is having string data. You need to apply cat_pipeline before num_pipeline.

完整示例:

from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import make_column_transformer
from sklearn.compose import make_column_selector
import pandas as pd  # doctest: +SKIP
X = pd.DataFrame({'city': ['London', 'London', 'Paris', np.nan],
                  'rating': [5, 3, 4, 5]})  # doctest: +SKIP

y = [1,0,1,1]

from sklearn.compose import make_column_selector
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.impute import KNNImputer
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score, KFold
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder


numerical_features=make_column_selector(dtype_include=np.number)
cat_features=make_column_selector(dtype_exclude=np.number)

numerical_pipeline= make_pipeline(KNNImputer())
cat_pipeline=make_pipeline(SimpleImputer(strategy='most_frequent'),
                            OneHotEncoder(handle_unknown='ignore', sparse=False))
preprocessor = Pipeline([('cat_pipeline', cat_pipeline),
                        ('numerical_pipeline',numerical_pipeline)])
model=make_pipeline(preprocessor, LinearRegression() )

params={
    'pipeline__numerical_pipeline__knnimputer__n_neighbors':[1,2]
}


grid=GridSearchCV(model, param_grid=params,scoring = 'r2',cv=2)

grid.fit(X, y)

这篇关于SKLERN//结合 GridsearchCV 与列变换和管道的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

07-31 15:01