ResNet|ResNet 解析 ResNet解析

ResNet是何凯明大神在2015发表的力作，影响深远，不得不说大佬真滴强。
先甩上论文原文：https://arxiv.org/abs/1512.03385
ResNet提出的背景首先深度学习为什么要叫深度学习，其原因在于深度学习模型的网络层数很多。网络层数越多模型的学习能力越强。我的理解是深度学习模型就是一个复杂的函数表达式，其中参数越多表明这个函数拟合的情况就越多（当然需要合适的结构）。所有有一句话 The deeper the better。
但是仅仅简单堆叠层数有没有效果，答案是没有。原因就是可恶的梯度消失和梯度爆炸。这个两个问题不断被研究，提出更好的方法，大大促进了深度学习的发展。比如更好的激活函数Relu，Mish。

文章图片
梯度消失或爆炸实验结果图
引用大神论文中的原图，可以看到56-layer模型的错误率明显高于20-layer模型，不管是在测试还是训练。这其中的原因并不是什么过拟合啥。前面也提到了，模型越深它的参数就越多，那么它就更难以优化。我理解的是因为参数多了，模型可以拟合的情况也多了，但没有办法保证它拟合的情况就是我们想要的，或者根本就拟合不了。
然后大神提出了残差学习。别怕残差学习这个高大上的名字，其实原理并没有那么复杂。
原本深度学习可以表示为一个函数 x -> H(x)，输入x，输出是H(x)。这样直接去学习H(x)在网络层数很深的时候，过于困难。那就让模型学习一个差值，x -> F(x)+x。可以直观的理解为一部分学习任务被x承担了。举一个不是特别准确的例子，可以用来直观的理解。比如 x = 100，H(x)=100.1 ，F(x)=0.1，原本模型需要学习到100.1，但是现在只需要学习到0.1。
论文中提到了恒等映射的概念，对于函数表达式 x -> x，对于残差学习只需要优化F(x)=0，这比优化非线性函数H(x) = x更加简单。
下图是一个简单示例，它把x提出来和后面卷积的结果F(x)相加，等到最终结果F(x)+x。这个操作论文中称为“shortcut”。具体的代码实现，后面会提到，也比较简单。

文章图片
image.png ResNet 网络结构 ResNet有很多版本，它们的不同之处在于网络层数不同。不过整体的结构和下图中最右边的结构一样。conv代表的是卷积，7x7 和 3x3代表的是卷积核大小，pool代表池化，/2 表示图像的高宽变为原来的1/2。我们可以看到网络是先经过一个7x7的卷积然后3x3的池化，接着是一堆3x3卷积和shortcut。

文章图片
ResNet 网络结构代码解析代码地址:resnet_utils.py
resnet_v1.py
看了前面的是不是还是云里雾里，还是看代码更加实在一点。那就解析一下slim库实现的ResNet50。
下图是ResNet各个版本的网络结构图。我们这里主要看50-layer一栏。

文章图片
image.png
我们看到resnet_v1.py中resnet_v1_50函数，这是调用resnet50的函数。

def resnet_v1_50(inputs, num_classes=None, is_training=True, global_pool=True, output_stride=None, spatial_squeeze=True, store_non_strided_activations=False, reuse=None, scope='resnet_v1_50'): """ResNet-50 model of [1]. See resnet_v1() for arg and return description.""" blocks = [ resnet_v1_block('block1', base_depth=64, num_units=3, stride=2), resnet_v1_block('block2', base_depth=128, num_units=4, stride=2), resnet_v1_block('block3', base_depth=256, num_units=6, stride=2), resnet_v1_block('block4', base_depth=512, num_units=3, stride=1), ] return resnet_v1(inputs, blocks, num_classes, is_training, global_pool=global_pool, output_stride=output_stride, include_root_block=True, spatial_squeeze=spatial_squeeze, store_non_strided_activations=store_non_strided_activations, reuse=reuse, scope=scope) resnet_v1_50.default_image_size = resnet_v1.default_image_size

ResNet实现的时候，使用了block和bottleneck的代码结构，使得整个代码看上去很“优雅”，而不是简单一层层网上加。这其中定义了4个resnet_v1_block对应ResNet结构图中conv2_x到conv_5_x。
我们来看一下resnet_v1_block，它返回了num_units个小的block，其中的参数我们在bottleneck函数中会用到。

def resnet_v1_block(scope, base_depth, num_units, stride): return resnet_utils.Block(scope, bottleneck, [{ 'depth': base_depth * 4, 'depth_bottleneck': base_depth, 'stride': 1 }] * (num_units - 1) + [{ 'depth': base_depth * 4, 'depth_bottleneck': base_depth, 'stride': stride }])

接下来这个函数调用了resnet_v1()函数。接着看。

def resnet_v1(inputs, blocks, num_classes=None, is_training=True, global_pool=True, output_stride=None, include_root_block=True, spatial_squeeze=True, store_non_strided_activations=False, reuse=None, scope=None): with tf.variable_scope(scope, 'resnet_v1', [inputs], reuse=reuse) as sc: end_points_collection = sc.original_name_scope + '_end_points' with slim.arg_scope([slim.conv2d, bottleneck, resnet_utils.stack_blocks_dense], outputs_collections=end_points_collection): with (slim.arg_scope([slim.batch_norm], is_training=is_training) if is_training is not None else NoOpScope()): net = inputs if include_root_block: if output_stride is not None: if output_stride % 4 != 0: raise ValueError('The output_stride needs to be a multiple of 4.') output_stride /= 4 net = resnet_utils.conv2d_same(net, 64, 7, stride=2, scope='conv1') net = slim.max_pool2d(net, [3, 3], stride=2, scope='pool1') net = resnet_utils.stack_blocks_dense(net, blocks, output_stride, store_non_strided_activations) # Convert end_points_collection into a dictionary of end_points. end_points = slim.utils.convert_collection_to_dict( end_points_collection)if global_pool: # Global average pooling. net = tf.reduce_mean(net, [1, 2], name='pool5', keep_dims=True) end_points['global_pool'] = net if num_classes: net = slim.conv2d(net, num_classes, [1, 1], activation_fn=None, normalizer_fn=None, scope='logits') end_points[sc.name + '/logits'] = net if spatial_squeeze: net = tf.squeeze(net, [1, 2], name='SpatialSqueeze') end_points[sc.name + '/spatial_squeeze'] = net end_points['predictions'] = slim.softmax(net, scope='predictions') return net, end_points resnet_v1.default_image_size = 224

这里代码还挺长的，我把一部分注释给去掉了。我们一点点来看。

if include_root_block: if output_stride is not None: if output_stride % 4 != 0: raise ValueError('The output_stride needs to be a multiple of 4.') output_stride /= 4 # 7x7卷积操作 net = resnet_utils.conv2d_same(net, 64, 7, stride=2, scope='conv1') # pool操作 net = slim.max_pool2d(net, [3, 3], stride=2, scope='pool1')net = slim.utils.collect_named_outputs(end_points_collection, 'pool2', net)

这一部分代码表示的是ResNet中刚开始的7x7卷积和pool操作。其中需要注意的是conv2d_same函数。

def conv2d_same(inputs, num_outputs, kernel_size, stride, rate=1, scope=None): if stride == 1: return slim.conv2d(inputs, num_outputs, kernel_size, stride=1, rate=rate, padding='SAME', scope=scope) else: kernel_size_effective = kernel_size + (kernel_size - 1) * (rate - 1) pad_total = kernel_size_effective - 1 pad_beg = pad_total // 2 pad_end = pad_total - pad_beg inputs = tf.pad(inputs, [[0, 0], [pad_beg, pad_end], [pad_beg, pad_end], [0, 0]]) return slim.conv2d(inputs, num_outputs, kernel_size, stride=stride, rate=rate, padding='VALID', scope=scope)

这一个函数保证图片卷积操作之后尺寸的合法性，得保证在stride=2的卷积下，得到的结果的高和宽是原来的1/2。使用了tf.pad来padding图像，注意这里图像的维度是4，包括了batch_size，我们只需要在w和h的维度上padding。
接下来就是重头戏了。就是下面这一行代码，是整个网络的精髓。

net = resnet_utils.stack_blocks_dense(net, blocks, output_stride)

这其中调用了stack_blocks_dense函数，我们接着来看。

def stack_blocks_dense(net, blocks, output_stride=None, store_non_strided_activations=False, outputs_collections=None): current_stride = 1# The atrous convolution rate parameter. rate = 1for block in blocks: with tf.variable_scope(block.scope, 'block', [net]) as sc: block_stride = 1 for i, unit in enumerate(block.args): if store_non_strided_activations and i == len(block.args) - 1: # Move stride from the block's last unit to the end of the block. block_stride = unit.get('stride', 1) unit = dict(unit, stride=1)with tf.variable_scope('unit_%d' % (i + 1), values=[net]): if output_stride is not None and current_stride == output_stride: net = block.unit_fn(net, rate=rate, **dict(unit, stride=1)) rate *= unit.get('stride', 1)else: net = block.unit_fn(net, rate=1, **unit) current_stride *= unit.get('stride', 1) if output_stride is not None and current_stride > output_stride: raise ValueError('The target output_stride cannot be reached.')# Collect activations at the block's end before performing subsampling. net = slim.utils.collect_named_outputs(outputs_collections, sc.name, net)# Subsampling of the block's output activations. if output_stride is not None and current_stride == output_stride: rate *= block_stride else: net = subsample(net, block_stride) current_stride *= block_stride if output_stride is not None and current_stride > output_stride: raise ValueError('The target output_stride cannot be reached.')if output_stride is not None and current_stride != output_stride: raise ValueError('The target output_stride cannot be reached.')return net

其中有一个output_stride和一个current_stride，这两个变量是用来控制输出图像的大小，举个例子current_stride=2 表示输出图像的高宽是原图的1/2。一般使用ResNet不会去设置output_stride，当然有特殊需求的话，可以去设置一下。(如果达到了output_stride，之后会使用空洞卷积来代替步长不同的卷积。空洞卷积的rate和普通卷积步长对应)。我们这里就不管output_stride了。
我们看到它对blocks进行一个遍历，这里的blocks就是我们一开始使用的那4个block。对于每一个block，程序执行这么一个操作。

net = block.unit_fn(net, depth=unit_depth, depth_bottleneck=unit_depth_bottleneck, stride=unit_stride, rate=1) current_stride *= unit_stride

其中unit_fn是一个函数，回到我们定义block的时候，我们可以看到unit_fn指向的是bottleneck函数。

def bottleneck(inputs, depth, depth_bottleneck, stride, rate=1, outputs_collections=None, scope=None, use_bounded_activations=False): with tf.variable_scope(scope, 'bottleneck_v1', [inputs]) as sc: depth_in = slim.utils.last_dimension(inputs.get_shape(), min_rank=4) if depth == depth_in: shortcut = resnet_utils.subsample(inputs, stride, 'shortcut') else: shortcut = slim.conv2d( inputs, depth, [1, 1], stride=stride, activation_fn=tf.nn.relu6 if use_bounded_activations else None, scope='shortcut')residual = slim.conv2d(inputs, depth_bottleneck, [1, 1], stride=1, scope='conv1') residual = resnet_utils.conv2d_same(residual, depth_bottleneck, 3, stride, rate=rate, scope='conv2') residual = slim.conv2d(residual, depth, [1, 1], stride=1, activation_fn=None, scope='conv3')if use_bounded_activations: # Use clip_by_value to simulate bandpass activation. residual = tf.clip_by_value(residual, -6.0, 6.0) output = tf.nn.relu6(shortcut + residual) else: output = tf.nn.relu(shortcut + residual)return slim.utils.collect_named_outputs(outputs_collections, sc.name, output)

好了，到了最终的时刻了。bottleneck函数所需要的参数，就是之前我们提到的block的参数。我们具体来看这个函数。
首先它获取了输入也就是inputs的最后一维（就是通道数）。

depth_in = slim.utils.last_dimension(inputs.get_shape(), min_rank=4) if depth == depth_in: shortcut = resnet_utils.subsample(inputs, stride, 'shortcut') else: shortcut = slim.conv2d( inputs, depth, [1, 1], stride=stride, activation_fn=tf.nn.relu6 if use_bounded_activations else None, scope='shortcut')

【ResNet|ResNet 解析】这一部分代码是为了保证shortcut和residual的高宽以及通道数是相同的。通过下采样步长和卷积步长来控制高宽相同，通过1x1卷积来保证通道数相同。这里的下采样操作，如果步长为1直接返回输入，不然就使用maxpool。
当让还需要保证residual和shortcut的通道数是一样的，这里使用1x1卷积来改变通道数。
最后的输出就是 output = tf.nn.relu6(shortcut + residual)，就是F(x)+x。