问题描述
我对pytorch的向后功能有些疑问,我认为我没有得到正确的输出
i have some question about pytorch's backward function i don't think i'm getting the right output
import numpy as np
import torch
from torch.autograd import Variable
a = Variable(torch.FloatTensor([[1,2,3],[4,5,6]]), requires_grad=True)
out = a * a
out.backward(a)
print(a.grad)
输出为
tensor([[ 2., 8., 18.],
[32., 50., 72.]])
也许是2*a*a
但是我认为输出应该是
tensor([[ 2., 4., 6.],
[8., 10., 12.]])
2*a.
原因d(x^2)/dx=2x
推荐答案
请仔细阅读 backward()
可以更好地理解它.
Please read carefully the documentation on backward()
to better understand it.
默认情况下,pytorch期望网络的 last 输出调用backward()
-损失函数.损失函数始终会输出标量,因此标量损失与所有其他变量/参数的梯度都得到了很好的定义(使用链式规则).
因此,默认情况下,backwards()
在标量张量上调用,并且不包含任何参数.
例如:
By default, pytorch expects backward()
to be called for the last output of the network - the loss function. The loss function always outputs a scalar and therefore, the gradients of the scalar loss w.r.t all other variables/parameters is well defined (using the chain rule).
Thus, by default, backwards()
is called on a scalar tensor and expects no arguments.
For example:
a = torch.tensor([[1,2,3],[4,5,6]], dtype=torch.float, requires_grad=True)
for i in range(2):
for j in range(3):
out = a[i,j] * a[i,j]
out.backward()
print(a.grad)
收益
tensor([[ 2., 4., 6.],
[ 8., 10., 12.]])
如预期的那样:d(a^2)/da = 2a
.
但是,当您在2 x 3 out
张量(不再是标量函数)上调用backwards
时,您期望a.grad
是什么?实际上,您实际上需要一个2×3×2×3的输出:d out[i,j] / d a[k,l]
(!)
Pytorch不支持此非标量函数导数.
取而代之的是,pytorch假定out
只是一个中间张量,并且在上游"某处有一个标量损失函数,该函数通过链式规则提供了d loss/ d out[i,j]
.此上游"渐变的大小为2 x 3,在这种情况下,实际上是您提供的backward
自变量:out.backward(g)
其中g_ij = d loss/ d out_ij
.
然后通过链法则d loss / d a[i,j] = (d loss/d out[i,j]) * (d out[i,j] / d a[i,j])
计算梯度由于您提供了a
作为上游"渐变,因此获得
However, when you call backwards
on the 2-by-3 out
tensor (no longer a scalar function) - what do you expects a.grad
to be? You'll actually need a 2-by-3-by-2-by-3 output: d out[i,j] / d a[k,l]
(!)
Pytorch does not support this non-scalar function derivatives.
Instead, pytorch assumes out
is only an intermediate tensor and somewhere "upstream" there is a scalar loss function, that through chain rule provides d loss/ d out[i,j]
. This "upstream" gradient is of size 2-by-3 and this is actually the argument you provide backward
in this case: out.backward(g)
where g_ij = d loss/ d out_ij
.
The gradients are then calculated by chain rule d loss / d a[i,j] = (d loss/d out[i,j]) * (d out[i,j] / d a[i,j])
Since you provided a
as the "upstream" gradients you got
a.grad[i,j] = 2 * a[i,j] * a[i,j]
如果要提供上游"梯度,则全部使用
If you were to provide the "upstream" gradients to be all ones
out.backward(torch.ones(2,3))
print(a.grad)
收益
tensor([[ 2., 4., 6.],
[ 8., 10., 12.]])
符合预期.
这都是连锁法则.
这篇关于PyTorch中的向后功能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!