在我的分类方案中,有几个步骤,包括:
上面方案中要调整的主要参数是百分位(2.)和SVC的超参数(4.),我想通过网格搜索进行调整。
当前解决方案在方案
clf = Pipeline([('normal',preprocessing.StandardScaler()),('svc',svm.SVC(class_weight='auto'))])
中构建了一个包括步骤3和4的“部分”管道。并将该方案分为两部分:
skf = StratifiedKFold(y)
对于skf中的train_ind和test_ind:
X_train,X_test,y_train,y_test = X [train_ind],X [test_ind],y [train_ind],y [test_ind]
#SMOTE综合训练数据(我们希望保持测试数据完整)
X_train,y_train = SMOTE(X_train,y_train)
对于百分位数:
#Fisher返回由参数'percentile'指定的所选要素的索引
selected_ind = Fisher(X_train,y_train,百分位数)
X_train_selected,X_test_selected = X_train [selected_ind,:],X_test [selected_ind,:]
型号= clf.fit(X_train_selected,y_train)
y_predict = model.predict(X_test_selected)
f1 = f1_score(y_predict,y_test)
将存储f1分数,然后将所有百分位数的所有折叠分区平均,然后返回具有最佳CV分数的百分位数。将“百分位数换循环”作为内部循环的目的是为了实现公平竞争,因为我们在所有百分位数的所有折叠分区上都具有相同的训练数据(包括合成数据)。
skf = StratifiedKFold(y)
对于skf中的train_ind和test_ind:
X_train,X_test,y_train,y_test = X [train_ind],X [test_ind],y [train_ind],y [test_ind]
#SMOTE综合训练数据(我们希望保持测试数据完整)
X_train,y_train = SMOTE(X_train,y_train)
对于parameter_comb中的参数:
#根据调整的百分位数选择功能
selected_ind = Fisher(X_train,y_train,best_percentile)
X_train_selected,X_test_selected = X_train [selected_ind,:],X_test [selected_ind,:]
clf.set_params(svc__C =参数['C'],svc__gamma =参数['gamma'])
型号= clf.fit(X_train_selected,y_train)
y_predict = model.predict(X_test_selected)
f1 = f1_score(y_predict,y_test)
除了我们为SVC调整超参数而不是要选择的功能的百分位数之外,它的实现方式非常相似。
我的问题是:
I)在当前解决方案中,我只在
clf
中涉及3.和4.,并如上所述在两个嵌套循环中“手动”执行1.和2.。有什么办法可以将所有四个步骤都包括在管道中,并且可以一次完成整个过程吗?II)如果可以保留第一个嵌套循环,那么是否可以(以及如何)使用单个管道简化下一个嵌套循环
clf_all = Pipeline([('smote', SMOTE()),
('fisher', Fisher(percentile=best_percentile))
('normal',preprocessing.StandardScaler()),
('svc',svm.SVC(class_weight='auto'))])
并简单地使用GridSearchCV(clf_all, parameter_comb)
进行调整?请注意,仅对每个折叠分区中的训练数据都必须完成
SMOTE
和Fisher
(排名标准)。任何评论将不胜感激。
EDIT
SMOTE
和Fisher
如下所示:def Fscore(X, y, percentile=None):
X_pos, X_neg = X[y==1], X[y==0]
X_mean = X.mean(axis=0)
X_pos_mean, X_neg_mean = X_pos.mean(axis=0), X_neg.mean(axis=0)
deno = (1.0/(shape(X_pos)[0]-1))*X_pos.var(axis=0) +(1.0/(shape(X_neg[0]-1))*X_neg.var(axis=0)
num = (X_pos_mean - X_mean)**2 + (X_neg_mean - X_mean)**2
F = num/deno
sort_F = argsort(F)[::-1]
n_feature = (float(percentile)/100)*shape(X)[1]
ind_feature = sort_F[:ceil(n_feature)]
return(ind_feature)
SMOTE
来自https://github.com/blacklab/nyan/blob/master/shared_modules/smote.py,它返回合成数据。我对其进行了修改,以返回与合成数据一起存储的原始输入数据,以及其标签和合成的标签。def smote(X, y):
n_pos = sum(y==1), sum(y==0)
n_syn = (n_neg-n_pos)/float(n_pos)
X_pos = X[y==1]
X_syn = SMOTE(X_pos, int(round(n_syn))*100, 5)
y_syn = np.ones(shape(X_syn)[0])
X, y = np.vstack([X, X_syn]), np.concatenate([y, y_syn])
return(X, y)
最佳答案
我不知道您的SMOTE()
和Fisher()
函数来自哪里,但是答案是肯定的,您可以这样做。为此,您将需要围绕这些函数编写包装器类。最简单的方法是继承sklearn的BaseEstimator
和TransformerMixin
类,请参见以下示例:http://scikit-learn.org/stable/auto_examples/hetero_feature_union.html
如果这对您没有意义,请发布至少一个函数的详细信息(它来自库或您自己编写的代码),我们可以从那里开始。
编辑:
抱歉,我没有足够仔细地研究您的功能,以至于除了训练数据(即X和y)之外,它们还改变了您的目标。管道不支持到目标的转换,因此您将像原来一样进行转换。供您引用,这是为Fisher进程编写自定义类的外观,如果函数本身不需要影响目标变量,该类将起作用。
>>> from sklearn.base import BaseEstimator, TransformerMixin
>>> from sklearn.preprocessing import StandardScaler
>>> from sklearn.svm import SVC
>>> from sklearn.pipeline import Pipeline
>>> from sklearn.grid_search import GridSearchCV
>>> from sklearn.datasets import load_iris
>>>
>>> class Fisher(BaseEstimator, TransformerMixin):
... def __init__(self,percentile=0.95):
... self.percentile = percentile
... def fit(self, X, y):
... from numpy import shape, argsort, ceil
... X_pos, X_neg = X[y==1], X[y==0]
... X_mean = X.mean(axis=0)
... X_pos_mean, X_neg_mean = X_pos.mean(axis=0), X_neg.mean(axis=0)
... deno = (1.0/(shape(X_pos)[0]-1))*X_pos.var(axis=0) + (1.0/(shape(X_neg)[0]-1))*X_neg.var(axis=0)
... num = (X_pos_mean - X_mean)**2 + (X_neg_mean - X_mean)**2
... F = num/deno
... sort_F = argsort(F)[::-1]
... n_feature = (float(self.percentile)/100)*shape(X)[1]
... self.ind_feature = sort_F[:ceil(n_feature)]
... return self
... def transform(self, x):
... return x[self.ind_feature,:]
...
>>>
>>> data = load_iris()
>>>
>>> pipeline = Pipeline([
... ('fisher', Fisher()),
... ('normal',StandardScaler()),
... ('svm',SVC(class_weight='auto'))
... ])
>>>
>>> grid = {
... 'fisher__percentile':[0.75,0.50],
... 'svm__C':[1,2]
... }
>>>
>>> model = GridSearchCV(estimator = pipeline, param_grid=grid, cv=2)
>>> model.fit(data.data,data.target)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/grid_search.py", line 596, in fit
return self._fit(X, y, ParameterGrid(self.param_grid))
File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/grid_search.py", line 378, in _fit
for parameters in parameter_iterable
File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 653, in __call__
self.dispatch(function, args, kwargs)
File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 400, in dispatch
job = ImmediateApply(func, args, kwargs)
File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 138, in __init__
self.results = func(*args, **kwargs)
File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/cross_validation.py", line 1239, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/pipeline.py", line 130, in fit
self.steps[-1][-1].fit(Xt, y, **fit_params)
File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/svm/base.py", line 149, in fit
(X.shape[0], y.shape[0]))
ValueError: X and y have incompatible shapes.
X has 1 samples, but y has 75.
关于machine-learning - 将自定义函数放在Sklearn管道中,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/31259891/