问题描述
我正在研究二进制分类问题,并且想执行嵌套的交叉验证来评估分类错误。我之所以进行嵌套CV是因为样本量较小(N_0 = 20,N_1 = 10),其中N_0,N_1分别是0类和1类中的实例数。
I am working on a binary classification problem and would like to perform the nested cross validation to assess the classification error. The reason why I'm doing the nested CV is due to the small sample size (N_0 = 20, N_1 = 10), where N_0, N_1 are the numbers of instances in 0 and 1 classes respectively.
我的代码非常简单:
>> pipe_logistic = Pipeline([('scl', StandardScaler()),('clf', LogisticRegression(penalty='l1'))])
>> parameters = {'clf__C': logspace(-4,1,50)}
>> grid_search = GridSearchCV(estimator=pipe_logistic, param_grid=parameters, verbose=1, scoring='f1', cv=5)
>> cross_val_score(grid_search, X, y, cv=5)
到目前为止,还不错。如果我想更改CV方案(在外部CV循环和内部CV循环中从随机拆分更改为StratifiedShuffleSplit,我都会遇到问题:如StratifiedShuffleSplit函数所要求的那样,如何传递类向量y?
So far, so good. If I want to change the CV scheme (from random splitting to StratifiedShuffleSplit in both, outer and inner CV loops, I face the problem: how can I pass the class vector y, as it is required by the StratifiedShuffleSplit function?
天真:
>> grid_search = GridSearchCV(estimator=pipe_logistic, param_grid=parameters, verbose=1, scoring='f1', cv=StratifiedShuffleSplit(y_inner_loop, 5, test_size=0.5, random_state=0))
>> cross_val_score(grid_search, X, y, cv=StratifiedShuffleSplit(y, 5, test_size=0.5, random_state=0))
所以,问题是如何指定 y_inner_loop 吗?
So, the problem is how to specify the y_inner_loop ?
**我的数据集略有失衡(20/10),我想保持此分配比例用于训练和
** My data set is slightly imbalanced (20/10) and I would like to keep this splitting ratio for training and assessing the model.
推荐答案
到目前为止,我已经解决了这个问题,这可能是ML的一些新手感兴趣的。 scikit-le的版本arn 0.18,交叉验证的指标已移至sklearn.model_selection模块,并已(略有更改)其API。简而言之:
So far, I resolved this problem which might be of interested to some novices to ML. In the newest version of the scikit-learn 0.18, cross validated metrics have moved to sklearn.model_selection module and have changed (slightly) their API. Making long story short:
>> from sklearn.model_selection import StratifiedShuffleSplit
>> sss_outer = StratifiedShuffleSplit(n_splits=5, test_size=0.4, random_state=15)
>> sss_inner = StratifiedShuffleSplit(n_splits=3, test_size=0.2, random_state=16)
>> pipe_logistic = Pipeline([('scl', StandardScaler()),('clf', LogisticRegression(penalty='l1'))])
>> parameters = {'clf__C': logspace(-4,1,50)}
>> grid_search = GridSearchCV(estimator=pipe_logistic, param_grid=parameters, verbose=1, scoring='f1', cv=sss_inner)
>> cross_val_score(grid_search, X, y, cv=sss_outer)
UPD 在最新版本中,我们无需明确指定目标向量( y,这最初是我的问题),而只需指定所需的分割数。
UPD in the newest version, we do not need to specify explicitly the target vector ("y", which was my problem initially), but rather only the number of desired splits.
这篇关于sklearn中使用StratifiedShuffleSplit的嵌套交叉验证的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!