我正在尝试了解如何获取GridSearchCV的计分器值。下面的示例代码在文本数据上建立了一个小的管道。

然后,它在不同的ngram上建立网格搜索。

得分是通过f1度量完成的:

#setup the pipeline
tfidf_vec = TfidfVectorizer(analyzer='word', min_df=0.05, max_df=0.95)
linearsvc = LinearSVC()
clf = Pipeline([('tfidf_vec', tfidf_vec), ('linearsvc', linearsvc)])

# setup the grid search
parameters = {'tfidf_vec__ngram_range': [(1, 1), (1, 2)]}
gs_clf = GridSearchCV(clf, parameters, n_jobs=-1, scoring='f1')
gs_clf = gs_clf.fit(docs_train, y_train)


现在,我可以使用以下命令打印分数:

打印gs_clf.grid_scores_

[mean: 0.81548, std: 0.01324, params: {'tfidf_vec__ngram_range': (1, 1)},
 mean: 0.82143, std: 0.00538, params: {'tfidf_vec__ngram_range': (1, 2)}]


打印gs_clf.grid_scores_ [0] .cv_validation_scores

array([ 0.83234714,  0.8       ,  0.81409002])


documentation对我来说还不清楚:


gs_clf.grid_scores_ [0] .cv_validation_scores是否为一个数组,该数组具有通过记分参数定义的分数(每折)(在这种情况下,f1是每折测量)?如果没有,那是什么?
如果我改为选择另一个metric,例如scoring ='f1_micro',则gs_clf.grid_scores_ [i] .cv_validation_scores中的每个数组都将包含针对特定网格搜索参数选择的折叠的f1_micro度量?

最佳答案

我编写了以下函数,将grid_scores_对象转换为pandas.DataFrame。希望数据框视图可以消除您的困惑,因为它是一种更直观的格式:

def grid_scores_to_df(grid_scores):
    """
    Convert a sklearn.grid_search.GridSearchCV.grid_scores_ attribute to a tidy
    pandas DataFrame where each row is a hyperparameter-fold combinatination.
    """
    rows = list()
    for grid_score in grid_scores:
        for fold, score in enumerate(grid_score.cv_validation_scores):
            row = grid_score.parameters.copy()
            row['fold'] = fold
            row['score'] = score
            rows.append(row)
    df = pd.DataFrame(rows)
    return df


您必须具有以下导入才能起作用:import pandas as pd

关于python - GridSearchCV得分和grid_scores_,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/37014564/

10-12 22:17