from imblearn.pipeline import Pipelinefrom sklearn.model_selection import RepeatedKFoldfrom sklearn.model_selection import GridSearchCVfrom sklearn.model_selection import cross_val_scorefrom xgboost import XGBClassifierstd_scaling = StandardScaler()algo = XGBClassifier()steps = [('std_scaling', StandardScaler()), ('algo', XGBClassifier())]pipeline = Pipeline(steps)parameters = {'algo__min_child_weight': [1, 2], 'algo__subsample': [0.6, 0.9], 'algo__max_depth': [4, 6], 'algo__gamma': [0.1, 0.2], 'algo__learning_rate': [0.05, 0.5, 0.3]}cv1 = RepeatedKFold(n_splits=2, n_repeats = 5, random_state = 15)clf_auc = GridSearchCV(pipeline, cv = cv1, param_grid = parameters, scoring = 'roc_auc', n_jobs=-1, return_train_score=False)cv1 = RepeatedKFold(n_splits=2, n_repeats = 5, random_state = 15)outer_clf_auc = cross_val_score(clf_auc, X_train, y_train, cv = cv1, scoring = 'roc_auc')问题1.cross_val_score如何适合训练数据?问题2.由于我在管道中包括了StandardScaler(),在cross_val_score中包括X_train是否有意义?还是我应该使用X_train的标准化形式(即std_X_train)?std_scaler = StandardScaler().fit(X_train)std_X_train = std_scaler.transform(X_train)std_X_test = std_scaler.transform(X_test)解决方案您选择了避免数据泄露的正确方法-嵌套的简历.在嵌套的CV中,您估计的不是您可以握在手中"的真实估计量的分数,而是描述了模型选择过程的不存在的元估计量"的分数. /p>含义-在外部交叉验证的每一轮中(在您的情况下,以 cross_val_score 表示),估算器 clf_auc 都会接受内部CV评估,该评估会根据给定条件选择最佳模型外部简历的倍数.因此,对于外部CV的每一折,您要为内部CV选择的估算器评分.例如,在一个外部CV折叠中,评分模型可以是将参数 algo__min_child_weight 选择为1的模型,而在另一模型中将参数选择为2的模型.因此,外部简历的得分代表了更高层次的得分:在合理的模型选择过程中,我选择的模型将得到多大的概括".现在,如果您想用一个真实的模型来完成此过程,则必须以某种方式选择它(cross_val_score不会为您完成此操作).这样做的方法是现在使您的内部模型适合整个数据.执行的意义:clf_auc.fit(X, y) 现在是时候了解您在这里所做的事情:您有一个可以使用的模型,该模型适合所有可用数据.当系统询问您该模型在新数据上的推广程度如何?"答案就是您在嵌套简历中获得的分数-该分数反映了模型选择过程中模型评分的一部分.关于问题2-如果缩放器是管道的一部分,则没有理由在外部操纵X_train. I am working in scikit and I am trying to tune my XGBoost.I made an attempt to use a nested cross-validation using the pipeline for the rescaling of the training folds (to avoid data leakage and overfitting) and in parallel with GridSearchCV for param tuning and cross_val_score to get the roc_auc score at the end.from imblearn.pipeline import Pipelinefrom sklearn.model_selection import RepeatedKFoldfrom sklearn.model_selection import GridSearchCVfrom sklearn.model_selection import cross_val_scorefrom xgboost import XGBClassifierstd_scaling = StandardScaler()algo = XGBClassifier()steps = [('std_scaling', StandardScaler()), ('algo', XGBClassifier())]pipeline = Pipeline(steps)parameters = {'algo__min_child_weight': [1, 2], 'algo__subsample': [0.6, 0.9], 'algo__max_depth': [4, 6], 'algo__gamma': [0.1, 0.2], 'algo__learning_rate': [0.05, 0.5, 0.3]}cv1 = RepeatedKFold(n_splits=2, n_repeats = 5, random_state = 15)clf_auc = GridSearchCV(pipeline, cv = cv1, param_grid = parameters, scoring = 'roc_auc', n_jobs=-1, return_train_score=False)cv1 = RepeatedKFold(n_splits=2, n_repeats = 5, random_state = 15)outer_clf_auc = cross_val_score(clf_auc, X_train, y_train, cv = cv1, scoring = 'roc_auc')Question 1.How do I fit cross_val_score to the training data?Question2.Since I included the StandardScaler() in the pipeline does it make sense to include the X_train in the cross_val_score or should I use a standardized form of the X_train (i.e. std_X_train)?std_scaler = StandardScaler().fit(X_train)std_X_train = std_scaler.transform(X_train)std_X_test = std_scaler.transform(X_test) 解决方案 You chose the right way to avoid data leakage as you say - nested CV.The thing is in nested CV what you estimate is not the score of a real estimator you can "hold in your hand", but of a non-existing "meta-estimator" which describes you model selection process as well.Meaning - in every round of the outer cross validation (in your case represented by cross_val_score), the estimator clf_auc undergoes internal CV which selects the best model under the given fold of the external CV.Therefore, for every fold of the external CV you are scoring a different estimator chosen by the internal CV.For example, in one external CV fold the model scored can be one that selected the param algo__min_child_weight to be 1, and in another a model that selected it to be 2.The score of the external CV therefore represents a more high-level score: "under the process of reasonable model selection, how well will my selected model generalize".Now, if you want to finish the process with a real model in hand you would have to select it in some way (cross_val_score will not do that for you).The way to do that is to now fit your internal model over the entire data.meaning to perform:clf_auc.fit(X, y)This is the moment to understand what you've done here:You have a model you can use, which is fitted over all the data available.When you're asked "how well does that model generalizes on new data?" the answer is the score you got during your nested CV - which captured the model selection process as part of your model's scoring.And regarding Question #2 - if the scaler is part of the pipeline, there is no reason to manipulate the X_train externally. 这篇关于配合使用带有管道和GridSearch的cross_val_score嵌套的交叉验证的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持! 上岸,阿里云!
07-31 15:10