本文介绍了Python,机器学习-对自定义验证集执行网格搜索的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在处理一个不平衡的分类问题,其中我的负面类比正面类多1000倍.我的策略是在平衡(50/50比率)训练集(我有足够的模拟样本)上训练深度神经网络,然后使用不平衡(1/1000比率)验证集选择最佳模型并优化超参数

I am dealing with an unbalanced classification problem, where my negative class is 1000 times more numerous than my positive class. My strategy is to train a deep neural network on a balanced (50/50 ratio) training set (I have enough simulated samples), and then use an unbalanced (1/1000 ratio) validation set to select the best model and optimise the hyperparameters.

由于参数数量很多,我想使用 scikit学习RandomizedSearchCV ,即随机网格搜索.

Since the number of parameters is significant, I want to use scikit-learn RandomizedSearchCV, i.e. a random grid search.

据我了解,sk-learn GridSearch在训练集上应用了一个指标,以选择最佳的超参数集.但是,就我而言,这意味着GridSearch将选择针对平衡训练集而不是针对更现实的不平衡数据而表现最佳的模型.

To my understanding, sk-learn GridSearch applies a metric on the training set to select the best set of hyperparameters. In my case however, this means that the GridSearch will select the model that performs best against a balanced training set, and not against more realistic unbalanced data.

我的问题是:有没有一种方法可以对在特定的,用户定义的验证集上估算的性能进行网格搜索?

My question would be: is there a way to grid search with the performances estimated on a specific, user-defined validation set?

推荐答案

如注释中所建议,您需要的是 PredefinedSplit .在此处的问题中进行了描述

As suggested in comments, the thing you need is PredefinedSplit. It is described in the question here

关于工作方式,您可以查看文档中给出的示例:

As about the working, you can see the example given in the documentation:

from sklearn.model_selection import PredefinedSplit
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([0, 0, 1, 1])

#This is what you need
test_fold = [0, 1, -1, 1]

ps = PredefinedSplit(test_fold)
ps.get_n_splits()
#OUTPUT
2

for train_index, test_index in ps.split():
   print("TRAIN:", train_index, "TEST:", test_index)
   X_train, X_test = X[train_index], X[test_index]
   y_train, y_test = y[train_index], y[test_index]

#OUTPUT
TRAIN: [1 2 3] TEST: [0]
TRAIN: [0 2] TEST: [1 3]

如您在此处看到的,您需要为test_fold分配一个索引列表,该列表将用于拆分数据. -1将用于样本索引,这些样本不包含在验证集中.

As you can see here, you need to assign the test_fold a list of indices, which will be used to split the data. -1 will be used for index of samples, which are not included in validation set.

因此在上面的代码中,test_fold = [0, 1, -1, 1]表示在第一个验证集中(样本中的索引,在test_fold中值为= 0)索引0.第二个是test_fold值为= 1的索引,因此索引1和3.

So in the above code, test_fold = [0, 1, -1, 1] says that in 1st validation set (indices in samples, whose value =0 in test_fold), index 0. And 2nd is where test_fold have value =1, so index 1 and 3.

但是,当您说自己有X_trainX_test时,如果只想从X_test获得验证集,则需要执行以下操作:

But when you say that you have X_train and X_test, if you want your validation set only from X_test, then you need to do the following:

my_test_fold = []

# put -1 here, so they will be in training set
for i in range(len(X_train)):
    my_test_fold.append(-1)

# for all greater indices, assign 0, so they will be put in test set
for i in range(len(X_test)):
    my_test_fold.append(0)

#Combine the X_train and X_test into one array:
import numpy as np

clf = RandomizedSearchCV( ...    cv = PredefinedSplit(test_fold=my_test_fold))
clf.fit(np.concatenate((X_train, X_test), axis=0), np.concatenate((y_train, y_test), axis=0))

这篇关于Python,机器学习-对自定义验证集执行网格搜索的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-06 05:49