问题描述
我正在使用批量梯度下降实现逻辑回归.输入样本分为两类.类是1和0.在训练数据时,我正在使用以下Sigmoid函数:
t = 1 ./ (1 + exp(-z));
其中
z = x*theta
我正在使用以下成本函数来计算成本,以确定何时停止培训.
function cost = computeCost(x, y, theta)
htheta = sigmoid(x*theta);
cost = sum(-y .* log(htheta) - (1-y) .* log(1-htheta));
end
由于大多数情况下htheta
的值是1或0,所以我将每一步的成本设为NaN.如何确定每次迭代的成本值?
这是用于逻辑回归的梯度下降代码:
function [theta,cost_history] = batchGD(x,y,theta,alpha)
cost_history = zeros(1000,1);
for iter=1:1000
htheta = sigmoid(x*theta);
new_theta = zeros(size(theta,1),1);
for feature=1:size(theta,1)
new_theta(feature) = theta(feature) - alpha * sum((htheta - y) .*x(:,feature))
end
theta = new_theta;
cost_history(iter) = computeCost(x,y,theta);
end
end
这种情况可能有两个原因.
数据未规范化
这是因为当您将Sigmoid/logit函数应用于假设时,输出概率几乎都是0或全1,并且使用成本函数,log(1 - 1)
或log(0)
会产生-Inf
.您的成本函数中所有这些单独术语的累加最终将导致NaN
.
具体来说,如果y = 0
作为训练示例,并且假设的输出为log(x)
,其中x
是非常小的数字,接近0,那么检查成本函数的第一部分将为我们提供帮助0*log(x)
,实际上会产生NaN
.类似地,如果y = 1
用于训练示例,并且假设的输出也是log(x)
,其中x
是一个非常小的数字,那么这又会给我们0*log(x)
并将产生NaN
.简而言之,假设的输出非常接近0或非常接近1.
这很可能是由于每个功能的动态范围差异很大,因此是假设的一部分,特别是每个训练示例的x*theta
加权总和将给您带来很大的负值或正值,如果将sigmoid函数应用于这些值,则会非常接近0或1.
解决这个问题的一种方法是在进行梯度下降训练之前,对矩阵中的数据进行规范化.一种典型的方法是使用零均值和单位方差进行归一化.给定输入功能x_k
,其中k = 1, 2, ... n
其中具有n
功能,可以通过以下方式找到新的规范化功能x_k^{new}
:
m_k
是特征k
的平均值,而s_k
是特征k
的标准偏差.这也称为标准化数据.您可以在我在此处给出的另一个答案上阅读有关此内容的更多详细信息:
因为您使用线性代数方法进行梯度下降,所以我假设您在数据矩阵之前加了一个全为一的列.知道了这一点,我们可以像这样对您的数据进行规范化:
mX = mean(x,1);
mX(1) = 0;
sX = std(x,[],1);
sX(1) = 1;
xnew = bsxfun(@rdivide, bsxfun(@minus, x, mX), sX);
每个特征的平均值和标准偏差分别存储在mX
和sX
中.通过阅读我上面链接到您的文章,您可以了解此代码的工作方式.我不会在这里重复这些内容,因为这不是本文的讨论范围.为了确保正确归一化,我将第一列的均值和标准差分别设置为0和1. xnew
包含新的规范化数据矩阵.请将xnew
与您的梯度下降算法结合使用.现在,找到参数后,要执行任何预测,您必须必须使用与训练集的平均值和标准差对任何新的测试实例进行归一化.由于学习的参数是针对训练集的统计数据的,因此您还必须对要提交给预测模型的任何测试数据应用相同的转换.
假设您将新数据点存储在名为xx
的矩阵中,则需要进行归一化然后执行预测:
xxnew = bsxfun(@rdivide, bsxfun(@minus, xx, mX), sX);
现在有了这个,您可以执行预测了:
pred = sigmoid(xxnew*theta) >= 0.5;
您可以将阈值0.5更改为您认为最能确定示例属于正类别还是负类别的最佳阈值.
学习率太大
正如您在评论中提到的,对数据进行规范化后,成本似乎是有限的,但经过几次迭代后突然变为NaN.标准化只能使您到目前为止.如果您的学习率或alpha
太大,则每次迭代都将朝着最小值的方向超调,从而使每次迭代的成本振荡甚至发散,这似乎正在发生.在您的情况下,每次迭代的成本都在分散或增加到如此之大,以至于无法使用浮点精度来表示.
这样,另一个选择是降低学习率,直到您看到成本函数在每次迭代中都在下降.确定最佳学习率的一种流行方法是对alpha
的对数间隔值范围执行梯度下降,并查看最终成本函数值是什么,然后选择导致最小成本的学习率.
假设成本函数是凸的,则将以上两个事实一起使用应该可以使梯度下降很好地收敛.在这种情况下,对于逻辑回归,无疑是这样.
I am implementing logistic regression using batch gradient descent. There are two classes into which the input samples are to be classified. The classes are 1 and 0. While training the data, I am using the following sigmoid function:
t = 1 ./ (1 + exp(-z));
where
z = x*theta
And I am using the following cost function to calculate cost, to determine when to stop training.
function cost = computeCost(x, y, theta)
htheta = sigmoid(x*theta);
cost = sum(-y .* log(htheta) - (1-y) .* log(1-htheta));
end
I am getting the cost at each step to be NaN as the values of htheta
are either 1 or zero in most cases. What should I do to determine the cost value at each iteration?
This is the gradient descent code for logistic regression:
function [theta,cost_history] = batchGD(x,y,theta,alpha)
cost_history = zeros(1000,1);
for iter=1:1000
htheta = sigmoid(x*theta);
new_theta = zeros(size(theta,1),1);
for feature=1:size(theta,1)
new_theta(feature) = theta(feature) - alpha * sum((htheta - y) .*x(:,feature))
end
theta = new_theta;
cost_history(iter) = computeCost(x,y,theta);
end
end
There are two possible reasons why this may be happening to you.
The data is not normalized
This is because when you apply the sigmoid / logit function to your hypothesis, the output probabilities are almost all approximately 0s or all 1s and with your cost function, log(1 - 1)
or log(0)
will produce -Inf
. The accumulation of all of these individual terms in your cost function will eventually lead to NaN
.
Specifically, if y = 0
for a training example and if the output of your hypothesis is log(x)
where x
is a very small number which is close to 0, examining the first part of the cost function would give us 0*log(x)
and will in fact produce NaN
. Similarly, if y = 1
for a training example and if the output of your hypothesis is also log(x)
where x
is a very small number, this again would give us 0*log(x)
and will produce NaN
. Simply put, the output of your hypothesis is either very close to 0 or very close to 1.
This is most likely due to the fact that the dynamic range of each feature is widely different and so a part of your hypothesis, specifically the weighted sum of x*theta
for each training example you have will give you either very large negative or positive values, and if you apply the sigmoid function to these values, you'll get very close to 0 or 1.
One way to combat this is to normalize the data in your matrix before performing training using gradient descent. A typical approach is to normalize with zero-mean and unit variance. Given an input feature x_k
where k = 1, 2, ... n
where you have n
features, the new normalized feature x_k^{new}
can be found by:
m_k
is the mean of the feature k
and s_k
is the standard deviation of the feature k
. This is also known as standardizing data. You can read up on more details about this on another answer I gave here: How does this code for standardizing data work?
Because you are using the linear algebra approach to gradient descent, I'm assuming you have prepended your data matrix with a column of all ones. Knowing this, we can normalize your data like so:
mX = mean(x,1);
mX(1) = 0;
sX = std(x,[],1);
sX(1) = 1;
xnew = bsxfun(@rdivide, bsxfun(@minus, x, mX), sX);
The mean and standard deviations of each feature are stored in mX
and sX
respectively. You can learn how this code works by reading the post I linked to you above. I won't repeat that stuff here because that isn't the scope of this post. To ensure proper normalization, I've made the mean and standard deviation of the first column to be 0 and 1 respectively. xnew
contains the new normalized data matrix. Use xnew
with your gradient descent algorithm instead. Now once you find the parameters, to perform any predictions you must normalize any new test instances with the mean and standard deviation from the training set. Because the parameters learned are with respect to the statistics of the training set, you must also apply the same transformations to any test data you want to submit to the prediction model.
Assuming you have new data points stored in a matrix called xx
, you would do normalize then perform the predictions:
xxnew = bsxfun(@rdivide, bsxfun(@minus, xx, mX), sX);
Now that you have this, you can perform your predictions:
pred = sigmoid(xxnew*theta) >= 0.5;
You can change the threshold of 0.5 to be whatever you believe is best that determines whether examples belong in the positive or negative class.
The learning rate is too large
As you mentioned in the comments, once you normalize the data the costs appear to be finite but then suddenly go to NaN after a few iterations. Normalization can only get you so far. If your learning rate or alpha
is too large, each iteration will overshoot in the direction towards the minimum and would thus make the cost at each iteration oscillate or even diverge which is what is appearing to be happening. In your case, the cost is diverging or increasing at each iteration to the point where it is so large that it can't be represented using floating point precision.
As such, one other option is to decrease your learning rate alpha
until you see that the cost function is decreasing at each iteration. A popular method to determine what the best learning rate would be is to perform gradient descent on a range of logarithmically spaced values of alpha
and seeing what the final cost function value is and choosing the learning rate that resulted in the smallest cost.
Using the two facts above together should allow gradient descent to converge quite nicely, assuming that the cost function is convex. In this case for logistic regression, it most certainly is.
这篇关于Logistic回归中的成本函数得出NaN作为结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!