本文介绍了Keras 和 TensorFlow 中所有这些交叉熵损失之间有什么区别?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所有这些交叉熵损失之间有什么区别?

What are the differences between all these cross-entropy losses?

Keras 正在谈论

  • 二元交叉熵
  • 分类交叉熵
  • 稀疏分类交叉熵

虽然 TensorFlow 有

While TensorFlow has

  • Softmax 交叉熵与 logits
  • 使用 logits 的稀疏 softmax 交叉熵
  • Sigmoid 交叉熵与 logits

它们之间有什么区别和关系?它们的典型应用是什么?数学背景是什么?还有其他应该知道的交叉熵类型吗?有没有没有 logits 的交叉熵类型?

What are the differences and relationships between them? What are the typical applications for them? What's the mathematical background? Are there other cross-entropy types that one should know? Are there any cross-entropy types without logits?

推荐答案

只有一个交叉 (Shannon) 熵定义为:

There is just one cross (Shannon) entropy defined as:

H(P||Q) = - SUM_i P(X=i) log Q(X=i)

在机器学习使用中,P 是实际(ground truth)分布,Q 是预测分布.您列出的所有函数都只是辅助函数,它们接受不同的方式来表示PQ.

In machine learning usage, P is the actual (ground truth) distribution, and Q is the predicted distribution. All the functions you listed are just helper functions which accepts different ways to represent P and Q.

基本上有 3 件主要的事情需要考虑:

There are basically 3 main things to consider:

  • 有两种可能的结果(二元分类)或更多.如果只有两个结果,则 Q(X=1) = 1 - Q(X=0) 所以 (0,1) 中的单个浮点数标识整个分布,这就是神经网络的原因在二元分类中有一个输出(逻辑回归也是如此).如果有 K>2 个可能的结果,则必须定义 K 个输出(每个 Q(X=...))

  • there are either 2 possibles outcomes (binary classification) or more. If there are just two outcomes, then Q(X=1) = 1 - Q(X=0) so a single float in (0,1) identifies the whole distribution, this is why neural network in binary classification has a single output (and so does logistic regresssion). If there are K>2 possible outcomes one has to define K outputs (one per each Q(X=...))

one 要么产生适当的概率(意味着 Q(X=i)>=0SUM_i Q(X=i) =1 或一个只是产生一个分数",并且有一些固定的将分数转换为概率的方法.例如,单个实数可以通过采用 sigmoid 来转换为概率",一组实数可以通过它们的 softmax 进行转换等等.

one either produces proper probabilities (meaning that Q(X=i)>=0 and SUM_i Q(X=i) =1 or one just produces a "score" and has some fixed method of transforming score to probability. For example a single real number can be "transformed to probability" by taking sigmoid, and a set of real numbers can be transformed by taking their softmax and so on.

j 使得 P(X=j)=1 (有一个真正的类",目标是硬的",比如此图像代表一只猫")或存在软目标"(例如我们 60% 确定这是一只猫,但 40% 确定它实际上是一只狗").

there is j such that P(X=j)=1 (there is one "true class", targets are "hard", like "this image represent a cat") or there are "soft targets" (like "we are 60% sure this is a cat, but for 40% it is actually a dog").

根据这三个方面,应该使用不同的辅助函数:

Depending on these three aspects, different helper function should be used:

                                  outcomes     what is in Q    targets in P
-------------------------------------------------------------------------------
binary CE                                2      probability         any
categorical CE                          >2      probability         soft
sparse categorical CE                   >2      probability         hard
sigmoid CE with logits                   2      score               any
softmax CE with logits                  >2      score               soft
sparse softmax CE with logits           >2      score               hard

最后可以使用分类交叉熵",因为这是数学定义的方式,但是由于硬目标或二元分类之类的东西非常流行 - 现代 ML 库确实提供了这些额外的辅助函数来制作东西更简单.特别是堆叠"的 sigmoid 和交叉熵可能在数值上不稳定,但如果知道这两个操作一起应用 - 它们组合的数值稳定版本(在 TF 中实现).

In the end one could just use "categorical cross entropy", as this is how it is mathematically defined, however since things like hard targets or binary classification are very popular - modern ML libraries do provide these additional helper functions to make things simpler. In particular "stacking" sigmoid and cross entropy might be numerically unstable, but if one knows these two operations are applied together - there is a numerically stable version of them combined (which is implemented in TF).

需要注意的是,如果您应用了错误的辅助函数,代码通常仍会执行,但结果将是错误的.例如,如果您将 softmax_* 助手应用于具有一个输出的二元分类,您的网络将被视为始终在输出处产生真".

It is important to notice that if you apply wrong helper function the code will usually still execute, but results will be wrong. For example if you apply softmax_* helper for binary classification with one output your network will be considered to always produce "True" at the output.

最后一点 - 这个答案考虑了分类,当您考虑多标签情况(当一个点可以有多个标签时)时,情况略有不同,因为Ps 总和不为 1,尽管有多个输出单元,但仍应使用 sigmoid_cross_entropy_with_logits.

As a final note - this answer considers classification, it is slightly different when you consider multi label case (when a single point can have multiple labels), as then Ps do not sum to 1, and one should use sigmoid_cross_entropy_with_logits despite having multiple output units.

这篇关于Keras 和 TensorFlow 中所有这些交叉熵损失之间有什么区别?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-14 15:22