问题描述
我正在用keras.regularizers.l1(0.01)
对Keras中的神经网络参数进行L1正则化,以获得稀疏模型.我发现,虽然我的许多系数 close 都为零,但实际上几乎没有为零.
I am employing L1 regularization on my neural network parameters in Keras with keras.regularizers.l1(0.01)
to obtain a sparse model. I am finding that, while many of my coefficients are close to zero, few of them are actually zero.
查看正则化的源代码 ,这表明Keras只是将参数的L1范数添加到损失函数中.
Upon looking at the source code for the regularization, it suggests that Keras simply adds the L1 norm of the parameters to the loss function.
这将是不正确的,因为参数几乎肯定不会像L1正则化那样永远变为零(在浮点错误内).当参数为零时,L1范数是不可微的,因此,如果在优化例程中参数足够接近零,则将参数设置为零时,需要使用次梯度方法.请参见软阈值运算符max(0, ..)
此处.
This would be incorrect because the parameters would almost certainly never go to zero (within floating point error) as intended with L1 regularization. The L1 norm is not differentiable when a parameter is zero, so subgradient methods need to be used where the parameters are set to zero if close enough to zero in the optimization routine. See the soft threshold operator max(0, ..)
here.
Tensorflow/Keras是否这样做,或者这与随机梯度下降法不切实际吗?
Does Tensorflow/Keras do this, or is this impractical to do with stochastic gradient descent?
也在此处是一篇出色的博客文章解释了用于L1正则化的软阈值运算符.
Also here is a superb blog post explaining the soft thresholding operator for L1 regularization.
推荐答案
因此,尽管@Joshua回答了,但还有三件事值得一提:
So despite @Joshua answer, there are three other things that are worth to mention:
- 与
0
中的渐变无关.keras
会自动将其设置为1
,与relu
情况类似. - 请记住,小于
1e-6
的值实际上等于0
,因为这是float32
精度. -
由于基于梯度下降的算法的性质(并设置高的
l1
值),由于振荡,由于计算原因,可能会出现未将大多数值设置为0
的问题这可能是由于梯度不连续而发生的.可以理解的是,对于给定的权重w = 0.005
,您的学习率等于0.01
,而主要损失的梯度等于0
w.r.t.到w
.因此,您的体重将通过以下方式更新:
- There is no problem connected with a gradient in
0
.keras
is automatically setting it to1
similarly torelu
case. - Remember that values lesser than
1e-6
are actually equal to0
as this isfloat32
precision. The problem of not having most of the values set to
0
might arise due to computational reasons due to the nature of a gradient-descent based algorithm (and setting a highl1
value) because of oscillations which might occur due to gradient discontinuity. To understand imagine that for a given weightw = 0.005
your learning rate is equal to0.01
and a gradient of the main loss is equal to0
w.r.t. tow
. So your weight would be updated in the following manner:
w = 0.005 - 1 * 0.01 = -0.05 (because gradient is equal to 1 as w > 0),
并且在第二次更新之后:
and after the second update:
w = -0.005 + 1 * 0.01 = 0.05 (because gradient is equal to -1 as w < 0).
如您所见,即使您应用了l1
正则化,w
的绝对值也没有减小,而这是由于基于梯度的算法的性质而发生的.当然,这是简化的情况,但是使用l1
规范正则化工具时,您可能经常会遇到这种振荡行为.
As you may see the absolute value of w
hasn't decreased even though you applied l1
regularization and this happened due to the nature of the gradient-based algorithm. Of course, this is simplified situation but you could experience such oscillating behavior really often when using l1
norm regularizer.
这篇关于Keras/Tensorflow中的L1正则化是否真的* L1正则化?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!