PyTorch系列之批量归一化和残差网络|凸优化|梯度下降

批量归一化和残差网络|凸优化|梯度下降

Posted by Young on February 16, 2020

Task06:批量归一化和残差网络|凸优化|梯度下降

批量归一化(Batch Normalization)

Training deep neural nets is difficult. And getting them to converge in a reasonable amount of time can be tricky. In this section, we describe batch normalization (BN) [Ioffe & Szegedy, 2015], a popular and effective technique that consistently accelerates the convergence of deep nets.

  • 对输入的标准化(浅层模型)

    处理后的任意一个特征在数据集中所有样本上的均值为0、标准差为1。标准化处理输入数据使各个特征的分布相近

  • 批量归一化(深度模型)

    利用小批量上的均值和标准差,不断调整神经网络中间输出,从而使整个神经网络在各层的中间输出的数值更稳定。

对全连接层做批量归一化

  • 使用全连接层的情况,计算特征维上的均值和方差

  • 位置:全连接层中的仿射变换和激活函数之间

    • 全连接: \(\begin{aligned} \boldsymbol{x}=\boldsymbol{W} \boldsymbol{u}+\boldsymbol{b} \\ \text {output}=\phi(\boldsymbol{x}) \end{aligned}\)

    • 批量归一化: \(\begin{aligned} \text { output } &=\phi(\mathrm{BN}(\boldsymbol{x})) \\ \boldsymbol{y}^{(i)} &=\mathrm{BN}\left(\boldsymbol{x}^{(i)}\right) \\ \boldsymbol{\mu}_{\mathcal{B}} & \leftarrow \frac{1}{m} \sum_{i=1}^{m} \boldsymbol{x}^{(i)} \\ \boldsymbol{\sigma}_{\mathcal{B}}^{2} & \leftarrow \frac{1}{m} \sum_{i=1}^{m}\left(\boldsymbol{x}^{(i)}-\boldsymbol{\mu}_{\mathcal{B}}\right)^{2} \\ \hat{\boldsymbol{x}}^{(i)} & \leftarrow \frac{\boldsymbol{x}^{(i)}-\boldsymbol{\mu}_{\mathcal{B}}}{\sqrt{\boldsymbol{\sigma}_{\mathcal{B}}^{2}+\epsilon}} \end{aligned}\)

      这⾥ϵ > 0是个很小的常数,保证分母大于0,$\boldsymbol{y}^{(i)} \leftarrow \boldsymbol{\gamma} \odot \hat{\boldsymbol{x}}^{(i)}+\boldsymbol{\beta}$,引入可学习参数:拉伸参数γ和偏移参数β。若 $\gamma=\sqrt{\sigma_{\mathcal{B}}^{2}+\epsilon}$ 和 $ \boldsymbol{\beta}=\boldsymbol{\mu}_{\mathcal{B}}$ 批量归一化无效。

对卷积层做批量归⼀化

  • 使用二维卷积层的情况,计算通道维上(axis=1)的均值和方差
  • 位置:卷积计算之后、应⽤激活函数之前
  • 如果卷积计算输出多个通道,我们需要对这些通道的输出分别做批量归一化,且每个通道都拥有独立的拉伸和偏移参数。
    • 拉伸参数和偏移参数为可学习参数,非超参数

预测时的批量归⼀化

  • 训练:以batch为单位, 对每个batch计算均值和方差。 预测:用移动平均估算整个训练数据集的样本均值和方差。
net = nn.Sequential(
            nn.Conv2d(1, 6, 5), # in_channels, out_channels, kernel_size
            BatchNorm(6, num_dims=4),
            nn.Sigmoid(),
            nn.MaxPool2d(2, 2), # kernel_size, stride
            nn.Conv2d(6, 16, 5),
            BatchNorm(16, num_dims=4),
            nn.Sigmoid(),
            nn.MaxPool2d(2, 2),
            d2l.FlattenLayer(),
            nn.Linear(16*4*4, 120),
            BatchNorm(120, num_dims=2),
            nn.Sigmoid(),
            nn.Linear(120, 84),
            BatchNorm(84, num_dims=2),
            nn.Sigmoid(),
            nn.Linear(84, 10)
        )

Residual Networks (ResNet)

深度学习的问题:深度CNN网络达到一定深度后再一味地增加层数并不能带来进一步地分类性能提高,反而会招致网络收敛变得更慢,准确率也变得更差。

  • ResNet通过残差学习解决了深度网络的退化问题,让我们可以训练出更深的网络

  • 残差块(Residual Block)

    恒等映射

    左边:f(x)=x

    右边:f(x)-x=0 (易于捕捉恒等映射的细微波动)

    在残差块中,输⼊可通过跨层的数据线路更快 地向前传播

ResNet Block

  • ResNet follows VGG’s full 3×3 convolutional layer design. The residual block has two 3×3 convolutional layers with the same number of output channels. Each convolutional layer is followed by a batch normalization layer and a ReLU activation function. Then, we skip these two convolution operations and add the input directly before the final ReLU activation function. This kind of design requires that the output of the two convolutional layers be of the same shape as the input, so that they can be added together. If we want to change the number of channels or the stride, we need to introduce an additional 1×1 convolutional layer to transform the input into the desired shape for the addition operation.

class Residual(nn.Module):  # 本类已保存在d2lzh_pytorch包中方便以后使用
    #可以设定输出通道数、是否使用额外的1x1卷积层来修改通道数以及卷积层的步幅。
    def __init__(self, in_channels, out_channels, use_1x1conv=False, stride=1):
        super(Residual, self).__init__()
        self.conv1 = nn.Conv2d(in_channels, out_channels, kernel_size=3, padding=1, stride=stride)
        self.conv2 = nn.Conv2d(out_channels, out_channels, kernel_size=3, padding=1)
        if use_1x1conv:
            self.conv3 = nn.Conv2d(in_channels, out_channels, kernel_size=1, stride=stride)
        else:
            self.conv3 = None
        self.bn1 = nn.BatchNorm2d(out_channels)
        self.bn2 = nn.BatchNorm2d(out_channels)

    def forward(self, X):
        Y = F.relu(self.bn1(self.conv1(X)))
        Y = self.bn2(self.conv2(Y))
        if self.conv3:
            X = self.conv3(X)
        return F.relu(Y + X)
        
        
>>> blk = Residual(3, 3)
    X = torch.rand((4, 3, 6, 6))
    blk(X).shape # torch.Size([4, 3, 6, 6])
    
>>> blk = Residual(3, 6, use_1x1conv=True, stride=2)
		blk(X).shape # torch.Size([4, 6, 3, 3])

ResNet Model

  • The first two layers of ResNet are the same as those of the GoogLeNet we described before: the 7×77×7 convolutional layer with 64 output channels and a stride of 2 is followed by the 3×33×3 maximum pooling layer with a stride of 2. The difference is the batch normalization layer added after each convolutional layer in ResNet.

net = nn.Sequential(
        nn.Conv2d(1, 64, kernel_size=7, stride=2, padding=3),
        nn.BatchNorm2d(64), 
        nn.ReLU(),
        nn.MaxPool2d(kernel_size=3, stride=2, padding=1))
def resnet_block(in_channels, out_channels, num_residuals, first_block=False):
    if first_block:
        assert in_channels == out_channels # 第一个模块的通道数同输入通道数一致
    blk = []
    for i in range(num_residuals):
        if i == 0 and not first_block:
            blk.append(Residual(in_channels, out_channels, use_1x1conv=True, stride=2))
        else:
            blk.append(Residual(out_channels, out_channels))
    return nn.Sequential(*blk)

net.add_module("resnet_block1", resnet_block(64, 64, 2, first_block=True))
net.add_module("resnet_block2", resnet_block(64, 128, 2))
net.add_module("resnet_block3", resnet_block(128, 256, 2))
net.add_module("resnet_block4", resnet_block(256, 512, 2))

net.add_module("global_avg_pool", d2l.GlobalAvgPool2d()) # GlobalAvgPool2d的输出: (Batch, 512, 1, 1)
net.add_module("fc", nn.Sequential(d2l.FlattenLayer(), nn.Linear(512, 10))) 

>>> 0  output shape:	 torch.Size([1, 64, 112, 112])
    1  output shape:	 torch.Size([1, 64, 112, 112])
    2  output shape:	 torch.Size([1, 64, 112, 112])
    3  output shape:	 torch.Size([1, 64, 56, 56])
    resnet_block1  output shape:	 torch.Size([1, 64, 56, 56])
    resnet_block2  output shape:	 torch.Size([1, 128, 28, 28])
    resnet_block3  output shape:	 torch.Size([1, 256, 14, 14])
    resnet_block4  output shape:	 torch.Size([1, 512, 7, 7])
    global_avg_pool  output shape:	 torch.Size([1, 512, 1, 1])
    fc  output shape:	 torch.Size([1, 10])

Densely Connected Networks (DenseNet)

  • The main difference between ResNet (left) and DenseNet (right) in cross-layer connections: use of addition and use of concatenation.

主要构建模块(稠密块、过渡层)

  • 稠密块(dense block): 定义了输入和输出是如何连结的。

    输出通道数=输入通道数+卷积层个数*卷积输出通道数

    def conv_block(in_channels, out_channels):
        blk = nn.Sequential(nn.BatchNorm2d(in_channels), 
                            nn.ReLU(),
                            nn.Conv2d(in_channels, out_channels, kernel_size=3, padding=1))
        return blk
      
    class DenseBlock(nn.Module):
        def __init__(self, num_convs, in_channels, out_channels):
            super(DenseBlock, self).__init__()
            net = []
            for i in range(num_convs):
                in_c = in_channels + i * out_channels
                net.append(conv_block(in_c, out_channels))
            self.net = nn.ModuleList(net)
            self.out_channels = in_channels + num_convs * out_channels # 计算输出通道数
      
        def forward(self, X):
            for blk in self.net:
                Y = blk(X)
                X = torch.cat((X, Y), dim=1)  # 在通道维上将输入和输出连结
            return X
              
      
    >>> blk = DenseBlock(2, 3, 10)
        X = torch.rand(4, 3, 8, 8)
        Y = blk(X)
        Y.shape # torch.Size([4, 23, 8, 8])
    
  • 过渡层(transition layer):用来控制通道数,使之不过大。

    1×1卷积层:来减小通道数 步幅为2的平均池化层:减半高和宽

    def transition_block(in_channels, out_channels):
        blk = nn.Sequential(
                nn.BatchNorm2d(in_channels), 
                nn.ReLU(),
                nn.Conv2d(in_channels, out_channels, kernel_size=1),
                nn.AvgPool2d(kernel_size=2, stride=2))
        return blk
      
    >>> blk = transition_block(23, 10)
    		blk(Y).shape # torch.Size([4, 10, 4, 4])
    

优化与深度学习

优化与估计

  • 优化方法目标:训练集损失函数值

  • 深度学习目标:测试集损失函数值(泛化性)

  • 优化在深度学习中的挑战

    • 局部最小值

    • 鞍点

      • 鞍点是对所有自变量一阶偏导数都为0,且Hessian矩阵特征值有正有负的点

    • 梯度消失

凸性 (Convexity)

  • Sets

    • Sets are the basis of convexity. Simply put, a set $𝑋$ in a vector space is convex if for any $𝑎,𝑏∈𝑋$ **the line segment connecting $𝑎$ and $𝑏$ is also in **$𝑋$. In mathematical terms this means that for all $𝜆∈[0,1]$ we have $\lambda \cdot a+(1-\lambda) \cdot b \in X$ whenever $a, b \in X$

    • Three shapes, the left one is nonconvex, the others are convex

    • The intersection between two convex sets is convex

    • The union of two convex sets need not be convex

  • Functions

    • Given a convex set $𝑋$ a function defined on it $𝑓:𝑋→ℝ$ is convex if for all $𝑥,𝑥′∈𝑋$ and for all $𝜆∈[0,1]$ we have $\lambda f(x)+(1-\lambda) f\left(x^{\prime}\right) \geq f\left(\lambda x+(1-\lambda) x^{\prime}\right)$

    • As expected, the cosine function is nonconvex, whereas the parabola and the exponential function are.

  • Jensen’s Inequality

    • the expectation of a convex function is larger than the convex function of an expectation

      \[\sum_{i} \alpha_{i} f\left(x_{i}\right) \geq f\left(\sum_{i} \alpha_{i} x_{i}\right) \text { and } E_{x}[f(x)] \quad \geq f\left(E_{x}[x]\right)\]
    • One of the common applications of Jensen’s inequality is with regard to the log-likelihood of partially observed random variables.

      \[E_{y \sim P(y)}[-\log P(x | y)] \geq-\log P(x)\]
  • Properties of Convexity

    • 无局部极小值

      • 反例(存在比局部最小值更小的值)

    • 与凸集的关系

      • 反例(非凸函数的对应的集合也是非凸的)

    • 二阶条件

      $f^{‘’}(x) \ge 0 \Longleftrightarrow f(x)$ 是凸函数

      必要性 ($\Leftarrow$):

      对于凸函数:

      \[\frac{1}{2} f(x+\epsilon)+\frac{1}{2} f(x-\epsilon) \geq f\left(\frac{x+\epsilon}{2}+\frac{x-\epsilon}{2}\right)=f(x)\]

      故:

      \[f^{\prime \prime}(x)=\lim _{\varepsilon \rightarrow 0} \frac{\frac{f(x+\epsilon) - f(x)}{\epsilon}-\frac{f(x) - f(x-\epsilon)}{\epsilon}}{\epsilon}\] \[f^{\prime \prime}(x)=\lim _{\varepsilon \rightarrow 0} \frac{f(x+\epsilon)+f(x-\epsilon)-2 f(x)}{\epsilon^{2}} \geq 0\]

      充分性 ($\Rightarrow$):

      令 $a < x < b$ 为 $f(x)$ 上的三个点,由拉格朗日中值定理:

      \[\begin{array}{l}{f(x)-f(a)=(x-a) f^{\prime}(\alpha) \text { for some } \alpha \in[a, x] \text { and }} \\ {f(b)-f(x)=(b-x) f^{\prime}(\beta) \text { for some } \beta \in[x, b]}\end{array}\]

      根据单调性,有 $f^{\prime}(\beta) \geq f^{\prime}(\alpha)$, 故:

      \[\begin{aligned} f(b)-f(a) &=f(b)-f(x)+f(x)-f(a) \\ &=(b-x) f^{\prime}(\beta)+(x-a) f^{\prime}(\alpha) \\ & \geq(b-a) f^{\prime}(\alpha) \end{aligned}\]
  • Constraints

    • One of the nice properties of convex optimization is that it allows us to handle constraints efficiently. That is, it allows us to solve problems of the form:

      \[\begin{array}{l}{\underset{\mathbf{x}}{\operatorname{minimize}} f(\mathbf{x})} \\ {\text { subject to } c_{i}(\mathbf{x}) \leq 0 \text { for all } i \in\{1, \ldots, N\}}\end{array}\]

      Here $𝑓$ is the objective and the functions $𝑐_𝑖$ are constraint functions. To see what this does consider the case where $𝑐_1(𝐱)=‖𝐱‖_2−1$. In this case the parameters 𝐱 are constrained to the unit ball. If a second constraint is $c_2(x)=v^⊤x + b$, then this corresponds to all 𝐱 lying on a halfspace. Satisfying both constraints simultaneously amounts to selecting a slice of a ball as the constraint set.

    • 可以引入拉格朗日乘子法求解

      \[L(\mathbf{x}, \alpha)=f(\mathbf{x})+\sum_{i} \alpha_{i} c_{i}(\mathbf{x}) \text { where } \alpha_{i} \geq 0\]

梯度下降

一维梯度下降

证明:沿梯度反方向移动自变量可以减小函数值

泰勒展开:

\[f(x+\epsilon)=f(x)+\epsilon f^{\prime}(x)+\mathcal{O}\left(\epsilon^{2}\right)\]

代入沿梯度方向的移动量 $\eta f^{\prime}(x)$:

\[f\left(x-\eta f^{\prime}(x)\right)=f(x)-\eta f^{\prime 2}(x)+\mathcal{O}\left(\eta^{2} f^{\prime 2}(x)\right)\] \[f\left(x-\eta f^{\prime}(x)\right) \lesssim f(x)\] \[x \leftarrow x-\eta f^{\prime}(x)\]

多维梯度下降

\[\nabla f(\mathbf{x})=\left[\frac{\partial f(\mathbf{x})}{\partial x_{1}}, \frac{\partial f(\mathbf{x})}{\partial x_{2}}, \dots, \frac{\partial f(\mathbf{x})}{\partial x_{d}}\right]^{\top}\] \[f(\mathbf{x}+\epsilon)=f(\mathbf{x})+\epsilon^{\top} \nabla f(\mathbf{x})+\mathcal{O}\left(\|\epsilon\|^{2}\right)\] \[\mathbf{x} \leftarrow \mathbf{x}-\eta \nabla f(\mathbf{x})\]

自适应方法

  • 牛顿法

    • 在 $x + \epsilon$ 处泰勒展开:

      \[f(\mathbf{x}+\epsilon)=f(\mathbf{x})+\epsilon^{\top} \nabla f(\mathbf{x})+\frac{1}{2} \epsilon^{\top} \nabla \nabla^{\top} f(\mathbf{x}) \epsilon+\mathcal{O}\left(\|\epsilon\|^{3}\right)\]

      最小值点处满足: $\nabla f(\mathbf{x})=0$, 即我们希望 $\nabla f(\mathbf{x} + \epsilon)=0$, 对上式关于 $\epsilon$ 求导,忽略高阶无穷小,有:

      \[\nabla f(\mathbf{x})+\boldsymbol{H}_{f} \boldsymbol{\epsilon}=0 \text { and hence } \epsilon=-\boldsymbol{H}_{f}^{-1} \nabla f(\mathbf{x})\]
  • Preconditioning

    • Quite unsurprisingly computing and storing the full Hessian is very expensive. It is thus desirable to find alternatives. One way to improve matters is by avoiding to compute the Hessian in its entirety but only compute the diagonal entries. While this is not quite as good as the full Newton method, it is still much better than not using it. Moreover, estimates for the main diagonal elements are what drives some of the innovation in stochastic gradient descent optimization algorithms. This leads to update algorithms of the form

      \[\mathbf{x} \leftarrow \mathbf{x}-\eta \operatorname{diag}\left(H_{f}\right)^{-1} \nabla \mathbf{x}\]

Stochastic Gradient Descent

  • 参数更新

    • 对于有 $n$ 个样本对训练数据集,设 $f_i(x)$ 是第 $i$ 个样本的损失函数, 则目标函数为:

      \[f(\mathbf{x})=\frac{1}{n} \sum_{i=1}^{n} f_{i}(\mathbf{x})\]

      其梯度为:

      \[\nabla f(\mathbf{x})=\frac{1}{n} \sum_{i=1}^{n} \nabla f_{i}(\mathbf{x})\]

      使用该梯度的一次更新的时间复杂度为 $\mathcal{O}(n)$

      随机梯度下降更新公式 $\mathcal{O}(1)$:

      \[\mathbf{x} \leftarrow \mathbf{x}-\eta \nabla f_{i}(\mathbf{x})\]

      且有:

      \[\mathbb{E}_{i} \nabla f_{i}(\mathbf{x})=\frac{1}{n} \sum_{i=1}^{n} \nabla f_{i}(\mathbf{x})=\nabla f(\mathbf{x})\]

def sgd(params, states, hyperparams):
    for p in params:
        p.data -= hyperparams['lr'] * p.grad.data
        

def train_ch7(optimizer_fn, states, hyperparams, features, labels,
              batch_size=10, num_epochs=2):
    # 初始化模型
    net, loss = d2l.linreg, d2l.squared_loss
    
    w = torch.nn.Parameter(torch.tensor(np.random.normal(0, 0.01, size=(features.shape[1], 1)), dtype=torch.float32), requires_grad=True)
    b = torch.nn.Parameter(torch.zeros(1, dtype=torch.float32), requires_grad=True)

    def eval_loss():
        return loss(net(features, w, b), labels).mean().item()

    ls = [eval_loss()]
    data_iter = torch.utils.data.DataLoader(
        torch.utils.data.TensorDataset(features, labels), batch_size, shuffle=True)
    
    for _ in range(num_epochs):
        start = time.time()
        for batch_i, (X, y) in enumerate(data_iter):
            l = loss(net(X, w, b), y).mean()  # 使用平均损失
            
            # 梯度清零
            if w.grad is not None:
                w.grad.data.zero_()
                b.grad.data.zero_()
                
            l.backward()
            optimizer_fn([w, b], states, hyperparams)  # 迭代模型参数
            if (batch_i + 1) * batch_size % 100 == 0:
                ls.append(eval_loss())  # 每100个样本记录下当前训练误差
    # 打印结果和作图
    print('loss: %f, %f sec per epoch' % (ls[-1], time.time() - start))
    d2l.set_figsize()
    d2l.plt.plot(np.linspace(0, num_epochs, len(ls)), ls)
    d2l.plt.xlabel('epoch')
    d2l.plt.ylabel('loss')