问题描述
我正在尝试对数据集进行分类.我首先使用了XGBoost:
I'm trying to make a classifier on a data set. I first used XGBoost:
import xgboost as xgb
import pandas as pd
import numpy as np
train = pd.read_csv("train_users_processed_onehot.csv")
labels = train["Buy"].map({"Y":1, "N":0})
features = train.drop("Buy", axis=1)
data_dmat = xgb.DMatrix(data=features, label=labels)
params={"max_depth":5, "min_child_weight":2, "eta": 0.1, "subsamples":0.9, "colsample_bytree":0.8, "objective" : "binary:logistic", "eval_metric": "logloss"}
rounds = 180
result = xgb.cv(params=params, dtrain=data_dmat, num_boost_round=rounds, early_stopping_rounds=50, as_pandas=True, seed=23333)
print result
结果是:
test-logloss-mean test-logloss-std train-logloss-mean
0 0.683539 0.000141 0.683407
179 0.622302 0.001504 0.606452
我们可以看到它在0.622左右;
We can see it is around 0.622;
但是当我使用完全相同的参数切换到sklearn
(我认为)时,结果却大不相同.下面是我的代码:
But when I switch to sklearn
using the exactly same parameters(I think), the result is quite different. Below is my code:
from sklearn.model_selection import cross_val_score
from xgboost.sklearn import XGBClassifier
import pandas as pd
train_dataframe = pd.read_csv("train_users_processed_onehot.csv")
train_labels = train_dataframe["Buy"].map({"Y":1, "N":0})
train_features = train_dataframe.drop("Buy", axis=1)
estimator = XGBClassifier(learning_rate=0.1, n_estimators=190, max_depth=5, min_child_weight=2, objective="binary:logistic", subsample=0.9, colsample_bytree=0.8, seed=23333)
print cross_val_score(estimator, X=train_features, y=train_labels, scoring="neg_log_loss")
,结果为:[-4.11429976 -2.08675843 -3.27346662]
,反转后仍远离0.622.
and the result is:[-4.11429976 -2.08675843 -3.27346662]
, after reversing it is still far from 0.622.
我将一个断点扔到了cross_val_score
中,发现分类器通过尝试以约0.99的概率预测测试集中的每个元组为负来做出疯狂的预测.
I tossed a break point into cross_val_score
, and saw that the classifier is making crazy predictions by trying to predict every tuple in the test set to be negative with about 0.99 probability.
我想知道我哪里出错了.有人可以帮我吗?
I'm wondering where have I gone wrong. Could someone help me?
推荐答案
这个问题有点老了,但是我今天遇到了这个问题,并弄清楚了为什么xgboost.cv
和sklearn.model_selection.cross_val_score
给出的结果有很大的不同.
This question is a bit old, but I ran into the problem today and figured out why the results given by xgboost.cv
and sklearn.model_selection.cross_val_score
are quite different.
默认情况下,cross_val_score使用KFold
或StratifiedKFold
,其shuffle参数为False,因此不会从数据中随机抽取折痕.
By default cross_val_score use KFold
or StratifiedKFold
whose shuffle argument is False so the folds are not pulled randomly from the data.
因此,如果执行此操作,则应该获得相同的结果:
So if you do this, then you should get the same results:
cross_val_score(estimator, X=train_features, y=train_labels, scoring="neg_log_loss",
cv = StratifiedKFold(shuffle=True, random_state=23333))
将StratifiedKfold
中的random state
和xgboost.cv
中的seed
保持相同,以得到完全可重复的结果.
Keep the random state
in StratifiedKfold
and seed
in xgboost.cv
same to get exactly reproducible results.
这篇关于为什么xgboost.cv和sklearn.cross_val_score给出不同的结果?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!