PyTorch 学习笔记（十一）

1 . 如果我们想要修改 tensor 的数值，但是又不希望被 autograd 记录，则使用 tensor.data 进行操作。

Input:

import torch as t

a = t.ones(3, 4, requires_grad=True)
b = t.ones(3, 4, requires_grad=True)
c = a * b

print(a.data)  # 还是同一个 tensor

print(a.data.requires_grad)  # 但是已经独立于计算图之外了

d = a.data.sigmoid_()  # sigmoid_ 是一个 inplace 操作，会修改 a 自身的值
print(a)
print(d.requires_grad)

print(a.requires_grad)

# 近似于 tensor = a.data，但是如果 tensor 被修改，backward 可能会报错
tensor = a.detach()
print(tensor.requires_grad)

# 统计 tensor 的一些指标，不希望被记录
mean = tensor.mean()
std = tensor.std()
maximum = tensor.max()
print(mean, std, maximum)

tensor[0] = 1
print(a)
# 下面会报错： RuntimeError: one of the variables needed for gradient
#              computation has been modified by an inplace operation.
# 因为 c = a * b，b 的梯度取决于 a，现在修改了 tensor，其实也就是修改了 a，梯度不再准确
# c.sum().backward()

Output:

2 . 在反向传播过程中非叶子节点的导数计算完之后即被清空，若想查看这些变量的梯度，有两种方法：

使用 autograd.grad 函数；
使用 hook。

推荐使用 hook 方法，但是在实际应用中应尽量避免修改 grad 的值。

import torch as t

# 第一种方法：使用 grad 获取中间变量的梯度
x = t.ones(3, requires_grad=True)
w = t.rand(3, requires_grad=True)
y = w * x
z = y.sum()
# z 对 y 的梯度，隐式调用 backward()
print(t.autograd.grad(z, y))  # (tensor([1., 1., 1.]),)

import torch as t


# 第二种方法：使用 hook
# hook 是一个函数，输入是梯度，无返回值
def variable_hook(grad):
    print('y 的梯度：', grad)


x = t.ones(3, requires_grad=True)
w = t.rand(3, requires_grad=True)
y = w * x
# 注册 hook
hook_handle = y.register_hook(variable_hook)
z = y.sum()
z.backward()

# 除非你每次都要用 hook，否则用完之后记得移除 hook
hook_handle.remove()

3 . 看看 variable 中的 grad 属性和 backward 函数 grad_variables 参数的含义。

variable x 的梯度是目标函数 $f(x)$ 对 x 的梯度，$\frac{df(x)}{dx}=(\frac{df(x)}{dx_{0}},\frac{df(x)}{dx_{1}},…,\frac{df(x)}{dx_N})$，形状和 x 一致；
对于 y.backward(grad_variables) 中的 grad_variables 相当于链式求导法则 $\frac{\partial{z}}{\partial{x}}=\frac{\partial{z}}{\partial{y}}\cdot \frac{\partial{y}}{\partial{x}}$ 中的 $\frac{\partial{z}}{\partial{y}}$，z 是目标函数，一般是一个标量，故而 $\frac{\partial{z}}{\partial{y}}$ 的形状与 variable y 的形状一致，z.backward() 在一定程度上等价于 y.backward(grad_y)。z.backward() 省略了 grad_variables 参数，因为 z 是一个标量，而 $\frac{\partial{z}}{\partial{z}}=1$。

import torch as t

x = t.arange(0, 3).float()
x.requires_grad_()
y = x ** 2 + x * 2
z = y.sum()
z.backward()  # 从 z 开始反向传播
print(x.grad)  # tensor([2., 4., 6.])

import torch as t

x = t.arange(0, 3).float()
x.requires_grad_()
y = x ** 2 + x * 2
z = y.sum()
y_gradient = t.Tensor([1, 1, 1])  # dz/dy
y.backward(y_gradient)  # 从 y 开始反向传播
print(x.grad)  # tensor([2., 4., 6.])

另外需要注意，只有对 variable 的操作才能使用 autograd，如果对 variable 的 data 直接进行操作，将无法使用反向传播，除了对参数初始化，一般我们不会修改 variable.data 的值。

总结

PyTorch 中计算图的特点可总结如下：

autograd 根据用户对 variable 的操作构建计算图，对变量的操作抽象为 Function；
对于那些不是任何函数的输出，由用户创建的节点称为叶子节点，叶子节点的 grad_fn 为 None，叶子节点中需要求导的 variable，具有 AccumulateGrad 标识，因其梯度是累加的；
variable 默认是不需要求导的，即 requires_grad 属性默认为 False，如果某一个节点 requires_grad 被设置为 True，那么所有依赖它的节点 requires_grad 都为 True；
variable 的 volatitle 属性默认为 False，如果某一个 variable 的 volatitle 属性被设置为 True，那么所有依赖它的节点的 volatitle 属性都为 True，volatitle 为 True 的节点不会求导，volatitle 的优先级比 requires_grad 高；
多次反向传播时，梯度是累加的，反向传播的中间缓存会被清空，为进行多次反向传播需指定 retian_graph=True 来保存这些缓存；
非叶子节点的梯度计算完之后即被清空，可以使用 autograd.grad 或 hook 技术获取非叶子节点值；
variable 的 grad 与 data 形状一致，应避免直接修改 variable.data，因为对 data 的直接操作无法利用 autograd 进行反向传播；
反向传播函数 backward 的参数 grad_variables 可以看成链式求导的中间结果，如果是标量，可以省略，默认为 1；
PyTorch 采用动态图设计，可以很方便地查看中间层的输出，动态地设计计算图结构。

4 . 目前绝大多数函数都可以使用 autograd 实现反向求导，但如果需要自己写一个复杂的函数，不支持自动反向求导怎么办？那就需要自己写一个 Function，实现它的前向传播和反向传播代码。

此外实现了自己的 Function 之后，还可以使用 gradcheck 函数来检测实现是否正确，gradcheck 通过数值逼近来计算梯度，可能具有一定的误差，通过控制 eps 的大小可以控制容忍的误差。

import torch as t
from torch.autograd import Function


class MultiplyAdd(Function):

    @staticmethod
    def forward(ctx, w, x, b):
        ctx.save_for_backward(w, x)
        output = w * x + b
        return output

    @staticmethod
    def backward(ctx, grad_output):
        w, x = ctx.saved_tensors
        grad_w = grad_output * x
        grad_x = grad_output * w
        grad_b = grad_output * 1
        return grad_w, grad_x, grad_b


x = t.ones(1)
w = t.rand(1, requires_grad=True)
b = t.rand(1, requires_grad=True)

# 开始前向传播
z = MultiplyAdd.apply(w, x, b)
# 开始反向传播
z.backward()
# x 不需要求导，中间过程还是会计算它的导数，但随后被清空
print(x.grad, w.grad, b.grad)  # (None, tensor([1.]), tensor([1.]))

import torch as t
from torch.autograd import Function


class Sigmoid(Function):

    @staticmethod
    def forward(ctx, x, ):
        output = 1 / (1 + t.exp(-x))
        ctx.save_for_backward(output)
        return output

    @staticmethod
    def backward(ctx, grad_output):
        output, = ctx.saved_tensors
        grad_x = output * (1 - output) * grad_output
        return grad_x


# 采用数值逼近方式检验计算梯度的公式对不对
test_input = t.randn(3, 4)
test_input.requires_grad_()
t.autograd.gradcheck(Sigmoid.apply, (test_input,), eps=1e-3)

笔记来源：《pytorch-book》