在我的分类方案中,有几个步骤,包括:

  • SMOTE(综合少数族裔过采样技术)
  • 特征选择的费舍尔准则
  • 标准化(Z分数标准化)
  • SVC(支持向量分类器)

  • 上面方案中要调整的主要参数是百分位(2.)和SVC的超参数(4.),我想通过网格搜索进行调整。
    当前解决方案在方案clf = Pipeline([('normal',preprocessing.StandardScaler()),('svc',svm.SVC(class_weight='auto'))])中构建了一个包括步骤3和4的“部分”管道。
    并将该方案分为两部分:
  • 调整要通过第一个网格搜索保留的要素的百分位数
    skf = StratifiedKFold(y)
    对于skf中的train_ind和test_ind:
    X_train,X_test,y_train,y_test = X [train_ind],X [test_ind],y [train_ind],y [test_ind]
    #SMOTE综合训练数据(我们希望保持测试数据完整)
    X_train,y_train = SMOTE(X_train,y_train)
    对于百分位数:
    #Fisher返回由参数'percentile'指定的所选要素的索引
    selected_ind = Fisher(X_train,y_train,百分位数)
    X_train_selected,X_test_selected = X_train [selected_ind,:],X_test [selected_ind,:]
    型号= clf.fit(X_train_selected,y_train)
    y_predict = model.predict(X_test_selected)
    f1 = f1_score(y_predict,y_test)
    将存储f1分数,然后将所有百分位数的所有折叠分区平均,然后返回具有最佳CV分数的百分位数。将“百分位数换循环”作为内部循环的目的是为了实现公平竞争,因为我们在所有百分位数的所有折叠分区上都具有相同的训练数据(包括合成数据)。
  • 确定百分位数后,通过第二个网格搜索调整超参数
    skf = StratifiedKFold(y)
    对于skf中的train_ind和test_ind:
    X_train,X_test,y_train,y_test = X [train_ind],X [test_ind],y [train_ind],y [test_ind]
    #SMOTE综合训练数据(我们希望保持测试数据完整)
    X_train,y_train = SMOTE(X_train,y_train)
    对于parameter_comb中的参数:
    #根据调整的百分位数选择功能
    selected_ind = Fisher(X_train,y_train,best_percentile)
    X_train_selected,X_test_selected = X_train [selected_ind,:],X_test [selected_ind,:]
    clf.set_params(svc__C =参数['C'],svc__gamma =参数['gamma'])
    型号= clf.fit(X_train_selected,y_train)
    y_predict = model.predict(X_test_selected)
    f1 = f1_score(y_predict,y_test)

  • 除了我们为SVC调整超参数而不是要选择的功能的百分位数之外,它的实现方式非常相似。
    我的问题是:
    I)在当前解决方案中,我只在clf中涉及3.和4.,并如上所述在两个嵌套循环中“手动”执行1.和2.。有什么办法可以将所有四个步骤都包括在管道中,并且可以一次完成整个过程吗?
    II)如果可以保留第一个嵌套循环,那么是否可以(以及如何)使用单个管道简化下一个嵌套循环
    clf_all = Pipeline([('smote', SMOTE()),
                        ('fisher', Fisher(percentile=best_percentile))
                        ('normal',preprocessing.StandardScaler()),
                        ('svc',svm.SVC(class_weight='auto'))])
    
    并简单地使用GridSearchCV(clf_all, parameter_comb)进行调整?
    请注意,仅对每个折叠分区中的训练数据都必须完成SMOTEFisher(排名标准)。
    任何评论将不胜感激。
    EDIT SMOTEFisher如下所示:
    def Fscore(X, y, percentile=None):
        X_pos, X_neg = X[y==1], X[y==0]
        X_mean = X.mean(axis=0)
        X_pos_mean, X_neg_mean = X_pos.mean(axis=0), X_neg.mean(axis=0)
        deno = (1.0/(shape(X_pos)[0]-1))*X_pos.var(axis=0) +(1.0/(shape(X_neg[0]-1))*X_neg.var(axis=0)
        num = (X_pos_mean - X_mean)**2 + (X_neg_mean - X_mean)**2
        F = num/deno
        sort_F = argsort(F)[::-1]
        n_feature = (float(percentile)/100)*shape(X)[1]
        ind_feature = sort_F[:ceil(n_feature)]
        return(ind_feature)
    
    SMOTE来自https://github.com/blacklab/nyan/blob/master/shared_modules/smote.py,它返回合成数据。我对其进行了修改,以返回与合成数据一起存储的原始输入数据,以及其标签和合成的标签。
    def smote(X, y):
        n_pos = sum(y==1), sum(y==0)
        n_syn = (n_neg-n_pos)/float(n_pos)
        X_pos = X[y==1]
        X_syn = SMOTE(X_pos, int(round(n_syn))*100, 5)
        y_syn = np.ones(shape(X_syn)[0])
        X, y = np.vstack([X, X_syn]), np.concatenate([y, y_syn])
        return(X, y)
    

    最佳答案

    我不知道您的SMOTE()Fisher()函数来自哪里,但是答案是肯定的,您可以这样做。为此,您将需要围绕这些函数编写包装器类。最简单的方法是继承sklearn的BaseEstimatorTransformerMixin类,请参见以下示例:http://scikit-learn.org/stable/auto_examples/hetero_feature_union.html

    如果这对您没有意义,请发布至少一个函数的详细信息(它来自库或您自己编写的代码),我们可以从那里开始。

    编辑:

    抱歉,我没有足够仔细地研究您的功能,以至于除了训练数据(即X和y)之外,它们还改变了您的目标。管道不支持到目标的转换,因此您将像原来一样进行转换。供您引用,这是为Fisher进程编写自定义类的外观,如果函数本身不需要影响目标变量,该类将起作用。

    >>> from sklearn.base import BaseEstimator, TransformerMixin
    >>> from sklearn.preprocessing import StandardScaler
    >>> from sklearn.svm import SVC
    >>> from sklearn.pipeline import Pipeline
    >>> from sklearn.grid_search import GridSearchCV
    >>> from sklearn.datasets import load_iris
    >>>
    >>> class Fisher(BaseEstimator, TransformerMixin):
    ...     def __init__(self,percentile=0.95):
    ...             self.percentile = percentile
    ...     def fit(self, X, y):
    ...             from numpy import shape, argsort, ceil
    ...             X_pos, X_neg = X[y==1], X[y==0]
    ...             X_mean = X.mean(axis=0)
    ...             X_pos_mean, X_neg_mean = X_pos.mean(axis=0), X_neg.mean(axis=0)
    ...             deno = (1.0/(shape(X_pos)[0]-1))*X_pos.var(axis=0) + (1.0/(shape(X_neg)[0]-1))*X_neg.var(axis=0)
    ...             num = (X_pos_mean - X_mean)**2 + (X_neg_mean - X_mean)**2
    ...             F = num/deno
    ...             sort_F = argsort(F)[::-1]
    ...             n_feature = (float(self.percentile)/100)*shape(X)[1]
    ...             self.ind_feature = sort_F[:ceil(n_feature)]
    ...             return self
    ...     def transform(self, x):
    ...             return x[self.ind_feature,:]
    ...
    >>>
    >>> data = load_iris()
    >>>
    >>> pipeline = Pipeline([
    ...     ('fisher', Fisher()),
    ...     ('normal',StandardScaler()),
    ...     ('svm',SVC(class_weight='auto'))
    ... ])
    >>>
    >>> grid = {
    ...     'fisher__percentile':[0.75,0.50],
    ...     'svm__C':[1,2]
    ... }
    >>>
    >>> model = GridSearchCV(estimator = pipeline, param_grid=grid, cv=2)
    >>> model.fit(data.data,data.target)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/grid_search.py", line 596, in fit
        return self._fit(X, y, ParameterGrid(self.param_grid))
      File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/grid_search.py", line 378, in _fit
        for parameters in parameter_iterable
      File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 653, in __call__
        self.dispatch(function, args, kwargs)
      File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 400, in dispatch
        job = ImmediateApply(func, args, kwargs)
      File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 138, in __init__
        self.results = func(*args, **kwargs)
      File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/cross_validation.py", line 1239, in _fit_and_score
        estimator.fit(X_train, y_train, **fit_params)
      File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/pipeline.py", line 130, in fit
        self.steps[-1][-1].fit(Xt, y, **fit_params)
      File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/svm/base.py", line 149, in fit
        (X.shape[0], y.shape[0]))
    ValueError: X and y have incompatible shapes.
    X has 1 samples, but y has 75.
    

    关于machine-learning - 将自定义函数放在Sklearn管道中,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/31259891/

    10-12 22:51