问题描述
有多种预测比例的标准方法,例如逻辑回归(无阈值)和 Beta 回归.已经有关于此的讨论:
There are standard ways of predicting proportions such as logistic regression (without thresholding) and beta regression. There have already been discussions about this:
http://scikit-learn-general.narkive.com/lLVQGzyl/beta 回归
我不知道在 sklearn
框架内是否存在变通方法.
I cannot tell if there exists a work-around within the sklearn
framework.
推荐答案
存在一种解决方法,但它本质上并不在sklearn
框架内.
There exists a workaround, but it is not intrinsically within the sklearn
framework.
如果您有一个比例目标变量(值范围为 0-1),则使用 scikit-learn 会遇到两个基本困难:
If you have a proportional target variable (value range 0-1) you run into two basic difficulties with scikit-learn:
- 分类器(例如逻辑回归)仅将类标签作为目标变量处理.作为一种解决方法,您可以简单地将概率阈值设置为 0/1 并将它们解释为类标签,但您会丢失大量信息.
- 回归模型(例如线性回归)不限制目标变量.您可以在比例数据上训练它们,但不能保证看不见的数据的输出将限制在 0/1 范围内.但是,在这种情况下,有一个强大的解决方法(见下文).
有多种不同的方法可以用数学方式制定逻辑回归.其中之一是广义线性模型,它基本上将逻辑回归定义为正态线性对数转换概率的回归.通常,这种方法需要复杂的数学优化,因为概率是未知的,需要与回归系数一起估计.
There are different ways to mathematically formulate logistic regression. One of them is the generalized linear model, which basically defines the logistic regression as a normal linear regression on logit-transformed probabilities. Normally, this approach requires sophisticated mathematical optimization because the probabilities are unknown and need to be estimated along with the regression coefficients.
但是,在您的情况下,概率是已知的.这意味着您可以简单地使用 y = log(p/(1 - p))
转换它们.现在它们涵盖了从 -oo
到 oo
的全部范围,并且可以作为 LinearRegression 模型 [*].当然,模型输出然后需要再次转换以产生概率p = 1/(exp(-y) + 1)
.
In your case, however, the probabilities are known. This means you can simply transform them with y = log(p / (1 - p))
. Now they cover the full range from -oo
to oo
and can serve as the target variable for a LinearRegression model [*]. Of course, the model output then needs to be transformed again to result in probabilities p = 1 / (exp(-y) + 1)
.
import numpy as np
from sklearn.linear_model import LinearRegression
class LogitRegression(LinearRegression):
def fit(self, x, p):
p = np.asarray(p)
y = np.log(p / (1 - p))
return super().fit(x, y)
def predict(self, x):
y = super().predict(x)
return 1 / (np.exp(-y) + 1)
if __name__ == '__main__':
# generate example data
np.random.seed(42)
n = 100
x = np.random.randn(n).reshape(-1, 1)
noise = 0.1 * np.random.randn(n).reshape(-1, 1)
p = np.tanh(x + noise) / 2 + 0.5
model = LogitRegression()
model.fit(x, p)
print(model.predict([[-10], [0.0], [1]]))
# [[ 2.06115362e-09]
# [ 5.00000000e-01]
# [ 8.80797078e-01]]
- 还有许多其他选择.一些非线性回归模型可以在 0-1 范围内自然地工作.例如 随机森林回归器永远不会超过他们训练的目标变量的范围.简单地输入概率,你就会得到概率.具有适当输出激活函数(
tanh
,我猜)的神经网络也可以很好地处理概率,但如果您想使用这些函数,则有比 sklearn 更专业的库. - There are also numerous other alternatives. Some non-linear regression models can work naturally in the 0-1 range. For example Random Forest Regressors will never exceed the target variables' range they were trained with. Simply put probabilities in and you will get probabilities out. Neural networks with appropriate output activation functions (
tanh
, I guess) will also work well with probabilities, but if you want to use those there are more specialized libraries than sklearn.
[*] 你实际上可以插入任何 线性回归模型,可以使方法更强大,但不再完全等同于逻辑回归.
[*] You could in fact plug in any linear regression model which can make the method more powerful, but then it no longer is exactly equivalent to logistic regression.
这篇关于目标变量是比例时如何使用sklearn的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!