问题描述
在tensorflow中,能不能用非平滑函数作为损失函数,比如piece-wise(或者用if-else)?如果不能,为什么可以使用 ReLU?
In tensorflow, can you use non-smooth function as loss function, such as piece-wise (or with if-else)? If you cant, why you can use ReLU?
在这个链接里瘦身,它说
例如,我们可能希望最小化对数损失,但我们感兴趣的指标可能是 F1 分数或联合交叉点分数(它们不可微,因此不能用作损失)."
"For example, we might want to minimize log loss, but our metrics of interest might be F1 score, or Intersection Over Union score (which are not differentiable, and therefore cannot be used as losses)."
它是否意味着不可微",例如集合问题?因为对于 ReLU,在点 0 处是不可微的.
Does it mean "not differentiable" at all, such as set problems? Because for ReLU, at point 0, it is not differentiable.
- 如果使用这种定制的损失函数,是否需要自己实现梯度?或者 tensorflow 可以自动为你做?我检查了一些自定义损失函数,他们没有为他们的损失函数实现梯度.
推荐答案
问题不在于损失是分段的还是不平滑的.问题是我们需要一个损失函数,当输出和预期输出之间存在误差时,它可以向网络参数(dloss/dparameter)发送一个非零梯度.这几乎适用于模型内部使用的任何函数(例如损失函数、激活函数、注意力函数).
The problem is not with the loss being piece-wise or non-smooth. The problem is that we need a loss function that can send back a non-zero gradient to the network parameters (dloss/dparameter) when there is an error between the output and the expected output. This applies to almost any function used inside the model (e.g. loss functions, activation functions, attention functions).
例如,感知器使用 单元步 H(x) 作为激活函数 (H(x) = 1 如果 x > 0 否则为 0).由于 H(x) 的导数始终为零(在 x=0 处未定义),来自损失的梯度不会通过它返回权重(链式法则),因此网络中该函数之前的权重无法更新使用梯度下降.基于此,梯度下降不能用于感知器,但可用于使用 sigmoid 激活函数(因为所有 x 的梯度都不为零).
For example, Perceptrons use a unit step H(x) as an activation function (H(x) = 1 if x > 0 else 0). since the derivative of H(x) is always zero (undefined at x=0), No gradient coming from the loss will pass through it back to the weights (chain rule), so no weights before that function in the network can be updated using gradient descent. Based on that, gradient descent can't be used for perceptrons but can be used for conventional neurons that uses the sigmoid activation function (since the gradient is not zero for all x).
对于 Relu,x > 0 的导数为 1,否则为 0.虽然在 x=0 时导数未定义,但当 x>0 时,我们仍然可以通过它反向传播损失梯度.这就是它可以使用的原因.
For Relu, the derivative is 1 for x > 0 and 0 otherwise. while the derivative is undefined at x=0, we still can back-propagate the loss gradient through it when x>0. That's why it can be used.
这就是为什么我们需要一个具有非零梯度的损失函数.像accuracy和F1这样的函数到处都是零梯度(或者在x的某些值处未定义),所以它们不能被使用,而像交叉熵这样的函数,L2 和 L1 具有非零梯度,因此可以使用它们.(请注意,L1绝对差"是分段的,在 x=0 处不平滑,但仍然可以使用)
That is why we need a loss function that has a non-zero gradient. Functions like accuracy and F1 have zero gradients everywhere (or undefined at some values of x), so they can't be used, while functions like cross-entropy, L2 and L1 have non-zero gradients, so they can be used. (note that L1 "absolute difference" is piece-wise and not smooth at x=0 but still can be used)
如果您必须使用不符合上述条件的函数,请尝试强化学习方法 代替(例如策略梯度).
In case you must use a function that doesn't meet the above criteria, try reinforcement learning methods instead (e.g. Policy gradient).
这篇关于非平滑不可微定制损失函数tensorflow的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!