我正在使用Python 3.5和XGBoost的Python实现(版本0.6)
我在Python中构建了一个正向特征选择例程,该例程反复构建了一组最佳特征(导致最佳分数,此处的度量标准是二进制分类错误)。
在我的数据集上,使用xgb.cv例程,通过将(树的)max_depth增加到40可以将错误率降低到0.21左右。
但是如果我使用相同的XG Boost参数,相同的折数,相同的指标和相同的数据集进行自定义交叉验证,则我获得的最佳分数是0.70,max_depth为4 ...我的xgb.cv例程,我的分数下降到0.65 ...我只是不明白发生了什么...
我最好的猜测是xgb.cv使用不同的折叠(即在分区之前对数据进行混洗),但我也认为我将折叠作为xgb.cv的输入提交(带有选项Shuffle = False)...因此,它可能是完全不同的东西...
这是forward_feature_selection的代码(使用xgb.cv):
def Forward_Feature_Selection(train, y_train, params, num_round=30, threshold=0, initial_score=0.5, to_exclude = [], nfold = 5):
k_fold = KFold(n_splits=13)
selected_features = []
gain = threshold + 1
previous_best_score = initial_score
train = train.drop(train.columns[to_exclude], axis=1) # df.columns is zero-based pd.Index
features = train.columns.values
selected = np.zeros(len(features))
scores = np.zeros(len(features))
while (gain > threshold): # we start a add-a-feature loop
for i in range(0,len(features)):
if (selected[i]==0): # take only features not yet selected
selected_features.append(features[i])
new_train = train.iloc[:][selected_features]
selected_features.remove(features[i])
dtrain = xgb.DMatrix(new_train, y_train, missing = None)
# dtrain = xgb.DMatrix(pd.DataFrame(new_train), y_train, missing = None)
if (i % 10 == 0):
print("Launching XGBoost for feature "+ str(i))
xgb_cv = xgb.cv(params, dtrain, num_round, nfold=13, folds=k_fold, shuffle=False)
if params['objective'] == 'binary:logistic':
scores[i] = xgb_cv.tail(1)["test-error-mean"] #classification
else:
scores[i] = xgb_cv.tail(1)["test-rmse-mean"] #regression
else:
scores[i] = initial_score # discard already selected variables from candidates
best = np.argmin(scores)
gain = previous_best_score - scores[best]
if (gain > 0):
previous_best_score = scores[best]
selected_features.append(features[best])
selected[best] = 1
print("Adding feature: " + features[best] + " increases score by " + str(gain) + ". Final score is now: " + str(previous_best_score))
return (selected_features, previous_best_score)
这是我的“自定义”交叉验证:
mean_error_rate = 0
for train, test in k_fold.split(ds):
dtrain = xgb.DMatrix(pd.DataFrame(ds.iloc[train]), dc.iloc[train]["bin_spread"], missing = None)
gbm = xgb.train(params, dtrain, 30)
dtest = xgb.DMatrix(pd.DataFrame(ds.iloc[test]), dc.iloc[test]["bin_spread"], missing = None)
res.ix[test,"pred"] = gbm.predict(dtest)
cv_reg = reg.fit(pd.DataFrame(ds.iloc[train]), dc.iloc[train]["bin_spread"])
res.ix[test,"lasso"] = cv_reg.predict(pd.DataFrame(ds.iloc[test]))
res.ix[test,"y_xgb"] = res.loc[test,"pred"] > 0.5
res.ix[test, "xgb_right"] = (res.loc[test,"y_xgb"]==res.loc[test,"bin_spread"])
print (str(100*np.sum(res.loc[test, "xgb_right"])/(N/13)))
mean_error_rate += 100*(np.sum(res.loc[test, "xgb_right"])/(N/13))
print("mean_error_rate is : " + str(mean_error_rate/13))
使用以下参数:
params = {"objective": "binary:logistic",
"booster":"gbtree",
"max_depth":4,
"eval_metric" : "error",
"eta" : 0.15}
res = pd.DataFrame(dc["bin_spread"])
k_fold = KFold(n_splits=13)
N = dc.shape[0]
num_trees = 30
最后是我的前进功能选择的调用:
selfeat = Forward_Feature_Selection(dc,
dc["bin_spread"],
params,
num_round = num_trees,
threshold = 0,
initial_score=999,
to_exclude = [0,1,5,30,31],
nfold = 13)
任何了解正在发生的事情的帮助将不胜感激!在此先感谢您的提示!
最佳答案
这个是正常的。我也经历过首先,Kfold每次分裂的方式都不同。您已在XGBoost中指定了折痕,但KFold不一致地分裂,这是正常的。
接下来,模型的初始状态每次都不同。
带有XGBoost的内部随机状态也可能导致这种情况,请尝试更改评估指标以查看方差是否减小。如果某个特定指标适合您的需求,请尝试平均最佳参数并将其用作最佳参数。
关于python - 无法重现Xgb.cv交叉验证结果,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/43258188/