问题描述
我有一个由数字数据和分类数据组成的数据集,我想根据患者的医学特征预测患者的不良结局.我为数据集定义了一个预测管道,如下所示:
I have a dataset consisting of both numeric and categorical data and I want to predict adverse outcomes for patients based on their medical characteristics. I defined a prediction pipeline for my dataset like so:
X = dataset.drop(columns=['target'])
y = dataset['target']
# define categorical and numeric transformers
numeric_transformer = Pipeline(steps=[
('knnImputer', KNNImputer(n_neighbors=2, weights="uniform")),
('scaler', StandardScaler())])
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('onehot', OneHotEncoder(handle_unknown='ignore'))])
# dispatch object columns to the categorical_transformer and remaining columns to numerical_transformer
preprocessor = ColumnTransformer(transformers=[
('num', numeric_transformer, selector(dtype_exclude="object")),
('cat', categorical_transformer, selector(dtype_include="object"))
])
# Append classifier to preprocessing pipeline.
# Now we have a full prediction pipeline.
clf = Pipeline(steps=[('preprocessor', preprocessor),
('classifier', LogisticRegression())])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
clf.fit(X_train, y_train)
print("model score: %.3f" % clf.score(X_test, y_test))
但是,在运行此代码时,我收到以下警告消息:
However, when running this code, I get the following warning message:
ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
model score: 0.988
有人可以向我解释此警告的含义吗?我是机器学习的新手,因此我对改进预测模型可以做些什么感到有些迷茫.正如您从numeric_transformer中看到的那样,我通过标准化对数据进行了缩放.我对模型得分如何很高以及这是好事还是坏事感到困惑.
Can someone explain to me what this warning means? I am new to machine learning so am a little lost as to what I can do to improve the prediction model. As you can see from the numeric_transformer, I scaled the data through standardisation. I am also confused as to how the model score is quite high and whether this is a good or bad thing.
推荐答案
警告的含义主要是:尝试制作 solver 的建议(算法)收敛.
The warning means what it mainly says: Suggestions to try to make the solver (the algorithm) converges.
lbfgs
代表:有限存储器Broyden-Fletcher-Goldfarb-Shanno算法".它是Scikit-Learn库提供的求解器算法之一.
lbfgs
stand for: "Limited-memory Broyden–Fletcher–Goldfarb–Shanno Algorithm". It is one of the solvers' algorithms provided by Scikit-Learn Library.
术语有限内存只是意味着它仅存储一些向量,这些向量隐式表示梯度近似值.
The term limited-memory simply means it stores only a few vectors that represent the gradients approximation implicitly.
在相对较小的 数据集上,它具有更好的收敛性.
It has better convergence on relatively small datasets.
但是算法收敛是什么?
简单来说.如果求解误差在很小的范围内(即几乎没有变化),则意味着算法已达到求解(不一定是最佳解决方案,因为它可能会停留在所谓的本地最优" ).
In simple words. If the error of solving is ranging within very small range (i.e., it is almost not changing), then that means the algorithm reached the solution (not necessary to be the best solution as it might be stuck at what so-called "local Optima").
另一方面,如果错误是 明显变化 (,即使错误相对较小[如您的情况得分一样],而是每次迭代错误之间的差异大于某个容差),那么我们说该算法没有收敛.
On the other hand, if the error is varying noticeably (even if the error is relatively small [like in your case the score was good], but rather the differences between the errors per iteration is greater than some tolerance) then we say the algorithm did not converge.
现在,您需要知道Scikit-Learn API有时会为用户提供选项,以指定算法以迭代方式搜索解决方案时应执行的最大迭代次数:
Now, you need to know that Scikit-Learn API sometimes provides the user the option to specify the maximum number of iterations the algorithm should take while it's searching for the solution in an iterative manner:
LogisticRegression(... solver='lbfgs', max_iter=100 ...)
如您所见,LogisticRegression中的默认求解器为'lbfgs',默认最大迭代次数为100.
As you can see, the default solver in LogisticRegression is 'lbfgs' and the maximum number of iterations is 100 by default.
不过,请注意最后一句话,增加最大迭代次数并不一定能保证收敛,但这肯定有帮助!
Final words, please, however, note that increasing the maximum number of iterations does not necessarily guarantee convergence, but it certainly helps!
根据下面的评论,尝试一些可能有助于算法收敛的提示(很多):
Based on your comment below, some tips to try (out of many) that might help the algorithm to converge are:
- Increase the number of iterations: As in this answer;
- Try a different optimizer: Look here;
- Scale your data: Look here;
- Add engineered features: Look here;
- Data pre-processing: Look here - use case and here;
- Add more data: Look here.
这篇关于收敛警告:lbfgs收敛失败(状态= 1):停止:总计.达到限制的迭代次数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!