我们来看一个单列的例子,例如:[[x1], [x2], [x3]].

sumx1 + x2 + x3,然后标准化 x 将得到 y = [[y1], [y2], [y3]] = [[x1/sum], [x2/sum], [x3/sum]].您正在寻找 dL/dx1dL/x2dL/x3 - 我们将它们写成:dx1dx2dx3.所有 dL/dyi 都一样.

所以 dx1 等于 dL/dy1*dy1/dx1 + dL/dy2*dy2/dx1 + dL/dy3*dy3/dx1.这是因为 x1 对相应列上的所有输出元素都有贡献:y1y2y3.>


  • dy1/dx1 = d(x1/sum)/dx1 = (sum - x1)/sum²

  • dy2/dx1 = d(x2/sum)/dx1 = -x2/sum²

  • 同理,dy3/dx1 = d(x3/sum)/dx1 = -x3/sum²

因此dx1 = (sum - x1)/sum²*dy1 - x2/sum²*dy2 - x3/sum²*dy3.dx2dx3 相同.结果,雅可比行列式是 [dxi]_i = (sum - xi)/sum²[dxi]_j = -xj/sum²(对于所有 j 不同于 i).


保持相同的单列示例,使用 x1=2x2=3x3=5:

>>>x = torch.tensor([[2.], [3.], [5.]])>>>总和 = input.sum(0)张量([10])


>>>J = (sum*torch.eye(input.size(0)) - input)/sum**2张量([[ 0.0800, -0.0200, -0.0200],[-0.0300, 0.0700, -0.0300],[-0.0500, -0.0500, 0.0500]])

对于多列的实现,它有点棘手,更具体地说是对角矩阵的形状.将 column 轴保持在最后更容易,这样我们就不必为广播而烦恼了:

>>>x = torch.tensor([[2., 1], [3., 3], [5., 5]])>>>总和 = x.sum(0)张量([10., 9.])>>>diag = sum*torch.eye(3).unsqueeze(-1).repeat(1, 1, len(sum))张量([[[10., 9.],[0., 0.],[0., 0.]],[[0., 0.],[10., 9.],[0., 0.]],[[0., 0.],[0., 0.],[10., 9.]]])

上面的 diag 具有 (3, 3, 2) 的形状,其中两列 位于最后一个轴上.注意我们不需要广播 sum.

不会做的是:torch.eye(3).unsqueeze(0).repeat(len(sum), 1, 1).由于使用这种形状 - (2, 3, 3) - 您将不得不使用 sum[:, None, None],并且需要进一步向下广播路...


>>>J = (diag - x)/sum**2张量([[[ 0.0800, 0.0988],[-0.0300, -0.0370],[-0.0500, -0.0617]],[[-0.0200, -0.0123],[0.0700, 0.0741],[-0.0500, -0.0617]],[[-0.0200, -0.0123],[-0.0300, -0.0370],[ 0.0500, 0.0494]]])

您可以通过使用任意 dy 向量(但不使用 torch.ones,您将获得 0code>s 因为 J!).反向传播后,x.grad 应该等于 torch.einsum('abc,bc->ac', J, dy).


I'm trying to make my custom autograd function with pytorch.

But I had a problem with making analytical back propagation with y = x / sum(x, dim=0)

where size of tensor x is (Height, Width) (x is 2-dimensional).

Here's my code

class MyFunc(torch.autograd.Function):
def forward(ctx, input):
  input = input / torch.sum(input, dim=0)

  return input

def backward(ctx, grad_output):
  input = ctx.saved_tensors[0]
  H, W = input.size()
  sum = torch.sum(input, dim=0)
  grad_input = grad_output * (1/sum - input*1/sum**2)

  return grad_input

I used (torch.autograd import) gradcheck to compare Jacobian matrix,

from torch.autograd import gradcheck
func = MyFunc.apply
input = (torch.randn(3,3,dtype=torch.double,requires_grad=True))
test = gradcheck(func, input)

and the result was

Please someone help me to get correct back propagation result



Thanks for answers!

Because of your help, I could implement back propagation in case of (H,W) tensor.

However, while I implemented back propagation in case of (N,H,W) tensor, I got a problem.I think the problem would be initializing new tensor.

Here's my new code

import torch
import torch.nn as nn
import torch.nn.functional as F

class MyFunc(torch.autograd.Function):
  def forward(ctx, input):

    N = input.size(0)
    for n in range(N):
      input[n] /= torch.sum(input[n], dim=0)

    return input

  def backward(ctx, grad_output):
    input = ctx.saved_tensors[0]
    N, H, W = input.size()
    I = torch.eye(H).unsqueeze(-1)
    sum = input.sum(1)

    grad_input = torch.zeros((N,H,W), dtype = torch.double, requires_grad=True)
    for n in range(N):
      grad_input[n] = ((sum[n] * I - input[n]) * grad_output[n] / sum[n]**2).sum(1)

    return grad_input

Gradcheck code is

from torch.autograd import gradcheck
func = MyFunc.apply
input = (torch.rand(2,2,2,dtype=torch.double,requires_grad=True))
test = gradcheck(func, input)

and result is

I don't know why the error occurs...

Your help will be very helpful for me to implement my own convolutional network.

Thanks! Have a nice day.


Let's look an example with a single column, for instance: [[x1], [x2], [x3]].

Let sum be x1 + x2 + x3, then normalizing x will give y = [[y1], [y2], [y3]] = [[x1/sum], [x2/sum], [x3/sum]]. You're looking for dL/dx1, dL/x2, and dL/x3 - we'll just write them as: dx1, dx2, and dx3. Same for all dL/dyi.

So dx1 is equal to dL/dy1*dy1/dx1 + dL/dy2*dy2/dx1 + dL/dy3*dy3/dx1. That's because x1 contributes to all ouput element on the corresponding column: y1, y2, and y3.

We have:

  • dy1/dx1 = d(x1/sum)/dx1 = (sum - x1)/sum²

  • dy2/dx1 = d(x2/sum)/dx1 = -x2/sum²

  • similarly, dy3/dx1 = d(x3/sum)/dx1 = -x3/sum²

Therefore dx1 = (sum - x1)/sum²*dy1 - x2/sum²*dy2 - x3/sum²*dy3. Same for dx2 and dx3. As a result, the Jacobian is [dxi]_i = (sum - xi)/sum² and [dxi]_j = -xj/sum² (for all j different to i).

In your implementation, you seem to be missing all non-diagonal components.

Keeping the same one-column example, with x1=2, x2=3, and x3=5:

>>> x = torch.tensor([[2.], [3.], [5.]])

>>> sum = input.sum(0)

The Jacobian will be:

>>> J = (sum*torch.eye(input.size(0)) - input)/sum**2
tensor([[ 0.0800, -0.0200, -0.0200],
        [-0.0300,  0.0700, -0.0300],
        [-0.0500, -0.0500,  0.0500]])

For an implementation with multiple columns, it's a bit trickier, more specifically for the shape of the diagonal matrix. It's easier to keep the column axis last so we don't have to bother with broadcastings:

>>> x = torch.tensor([[2., 1], [3., 3], [5., 5]])
>>> sum = x.sum(0)
tensor([10.,  9.])

>>> diag = sum*torch.eye(3).unsqueeze(-1).repeat(1, 1, len(sum))
tensor([[[10.,  9.],
         [ 0.,  0.],
         [ 0.,  0.]],

        [[ 0.,  0.],
         [10.,  9.],
         [ 0.,  0.]],

        [[ 0.,  0.],
         [ 0.,  0.],
         [10.,  9.]]])

Above diag has a shape of (3, 3, 2) where the two columns are on the last axis. Notice how we didn't need to broadcast sum.

What I wouldn't have done is: torch.eye(3).unsqueeze(0).repeat(len(sum), 1, 1). Since with this kind of shape - (2, 3, 3) - you will have to use sum[:, None, None], and will need further broadcasting down the road...

The Jacobian is simply:

>>> J = (diag - x)/sum**2
tensor([[[ 0.0800,  0.0988],
         [-0.0300, -0.0370],
         [-0.0500, -0.0617]],

        [[-0.0200, -0.0123],
         [ 0.0700,  0.0741],
         [-0.0500, -0.0617]],

        [[-0.0200, -0.0123],
         [-0.0300, -0.0370],
         [ 0.0500,  0.0494]]])

You can check the results by backpropagating through the operation using an arbitrary dy vector (not with torch.ones though, you'll get 0s because of J!). After backpropagating, x.grad should equal to torch.einsum('abc,bc->ac', J, dy).

