Neural network architectures

8 min readOct 10, 2019

Review a few important neural network architectures, including VGG, Resnet, GoogleNet(Inception), MobileNet.

Since 2012 AlexNet was published, many architectures have been developed to significantly improve the accuracy, increase the depth of neural networks, and reduce the model size as well as calculation operations. Here I study and review a few important developments.

An analysis of deep neural network models for practical applications

Let’s first have a big picture of these neural architectures regarding the accuracy, size, operations, inference time and power usage. This is a paper from 2016 so it doesn’t include MobileNet and other latest developments.

Figure 1 shows 1-crop top-1 accuracies of the most relevant entries submitted to the ImageNet challenge, from the AlexNet (Krizhevsky et al., 2012), on the far left, to the best performing Inception-v4 (Szegedy et al., 2016). The newest ResNet and Inception architectures surpass all other architectures by a significant margin of at least 7%. Note 1-crop, 5-crop or10-crop means # of times of cropping an image for testing, explained here.

Figure 2 shows model size (# of parameters) and the amount of operations required for a single forward pass (inference) in addition to the top-1 accuracy. The first thing that is very apparent is that VGG, even though it is widely used in many applications, is by far the most expensive architecture — both in terms of computational requirements and number of parameters. Its 16- and 19-layer implementations are in fact isolated from all other networks. The other architectures form a steep straight line, that seems to start to flatten with the latest incarnations of Inception and ResNet. This might suggest that models are reaching an inflection point on this data set. At this inflection point, the costs — in terms of complexity — start to outweigh gains in accuracy.

Figure 3 reports inference time per image on each architecture, as a function of image batch size (from 1 to 64). Is this the batch size for inference??? We notice that VGG processes one image in a fifth of a second, making it a less likely contender in real-time applications on an NVIDIA TX1.

In Figure 7, for a batch of 16 images, there is a linear relationship between operations count and inference time per image. Therefore, at design time, we can pose a constraint on the number of operation to keep processing speed in a usable range for real-time applications or resource-limited deployments.

VGG

The main work is to increase the depth to 16–19 layers by using small (3*3) convolution filters.

Pros

simple generic structure
deeper, depth matters
smaller filters have the same receptive field but more non-linearity and fewer parameters
multi-scaling to augment images
fully conv network

Cons

VGG 16 and 19 compared to ResNet 152, similar computation complexity
a large number of parameters, model size is large

深度学习VGG模型核心拆解

GoogleNet + Inception

GoogleNet is also called Inception-v1. It is developed to Inception v2, v3, and v4. Inception-v4 combines inception block and residual block. In contrast to ResNet, GoogleNet makes the network “wider” by adding multiple-scale convolution filters as Inception block and concatenating the feature maps from multi-scales.

ResNet

ResNet is a milestone that increases the depth of neural nets to 50, 100, even 1000 with reasonable training and test accuracy. Before ResNet, VGGNet and GoogleNet have ~20 layers.

There is a paradox shown in the ResNet paper, that deeper neural nets have higher training error.

The training error is larger for the 56-layer NN compared to the 20-layer.

If the test error is higher, it is probably due to overfitting. But it turns out the training error is also higher for deeper networks. In theory, adding more layers to a network is like adding more polynomial terms to an equation, the goodness of fitting on the training data should never get worse. As stated in the paper, “Let us consider a shallower architecture and its deeper counterpart that adds more layers onto it. There exists a solution by construction to the deeper model: the added layers are identity mapping, and the other layers are copied from the learned shallower model.” ResNet solves this paradox by adding the residual block.

Right, problem solved. But here are two questions to think about: (1) what is the cause of the paradox? (2) how can we explain that ResNet solves it?

The first question: what is the cause? Vanishing/exploding gradients due to deeper networks? As the paper states, “this problem, however, has been largely addressed by normalized initialization and intermediate normalization layers.” For example, Xavier init for linear activation and He init for ReLU, batch normalization to normalize the activations. Thus, the degradation is due to the optimization complexity of adding more layers. Although theoretically the new layers can be just identity mappings, it is not easy to fit them exactly as identity mappings due to the complex and stocastic optimization. Thus, this brings the second question. It turns out that by fitting the residual, it is easier to find the comparably good or even better solution. It is hypothesized that it is easier to optimize the residual mapping than to optimize the original mapping.

Compared to VGG 16 or 19, the memory and computation complexity of ResNet 152 is lower! It is because VGG uses too many conv filters. But the training time for ResNet is still long.

MobileNet v1

MobileNet v1 uses the depthwise separable convolution to replace the standard convolution. Given an image, depth separable convolution maintains the the same input and output dimensions but has fewer parameters in the conv layer compared to the regular convolution. Here is a good visualization to compare the standard and depthwise separable conv.

# image dimension: 32*32*3 image, feature map dimension: 32*32*16# regular conv
# H*W*C*F (# of filters, e.g. 16)
parameters: H*W*C*F = 3*3*3*16 = 432
calculations: (H*W*C)*(N*N)*F = (3*3*3)*(32*32)*16 = 442368#depthwise separable conv 
#depthwise conv - H*W*C, pointwise conv - 1*1*C*F
parameters: H*W*C + 1*1*C*F = (H*W+F)*C = 3*3*3 + 1*1*3*16 = 75
calculations: H*W*(N*N)*C + C*(N*N)*F = (H*W+F)*(N*N)*C = 3*3*32*32*3 + 3*32*32*16 = 76800Compare 
parameters: (H*W+F)*C / (H*W*C*F) = (H*W+F) / (H*W*F)
calculations: (H*W+F)*(N*N)*F / (H*W*C*N*N*F) = (H*W+F) / (H*W*F)

Question: But with fewer parameters, how can depthwise separable conv acheive the same accuracy as regular conv?

In addition to reducing the number of parameters and calculations compared to standard convolutions, depthwise separable convolution offers benefit more than that. Explanation from the paper: “It is not enough to simply define networks in terms of a small number of Mult-Adds. It is also important to make sure these operations can be efficiently implementable. For instance, unstructured sparse matrix operations are not typically faster than dense matrix operations until a very high level of sparsity. Our model structure puts nearly all of the computation into dense 1*1 convolutions. This can be implemented with highly optimized general matrix multiply (GEMM) functions. Often convolutions are implemented by a GEMM but require an initial reordering in memory called im2col in order to map it to a GEMM. … 1*1 convolutions do not require this reordering in memory and can be implemented directly with GEMM which is one of the most optimized numerical linear algebra algorithms. MobileNet spends 95% of its computation time in 1*1 convolutions which also has 75% of the parameters as can be seen in Table 2.”

MobileNet v2

Compare the block structure of MobileNet v1 and v2

Common: Both use depth-wise and point-wise convolutions instead of regular convolution to significantly reduce the computation complexity by ~1/k², whre k is the kernel size.

Difference:

MobileNet v2 adds a point-wise conv before depth-wise conv in the block to increase the channels. Depthwise conv doesn’t change # channels so it may not work well with in lower dimensional input.
MobileNet v2 replace Relu with Linear bottleneck after the second pointwise conv. Because the second pointwise conv is used to reduce # channels, non-linear activation may not work well in the lower dimensional space.

Compare Mobile v2 inverted residual block with ResNet residual block

Common:

both have 1*1 -> 3*3 -> 1*1 structure
both have skip connections from input to the last pointwise conv

Difference

residual block uses regular conv vs. inverted residual block uses depthwise
residual block (沙漏形): 1*1 reduce # channels -> conv2d -> 1*1 increase # channels vs. inverted residual block(纺锤形): 1*1 increase # channels -> depthwise -> 1*1 reduce # channels

Readings about mobilenet v2

MobileNet V2论文初读

Optional readings in Mandarin:

im2col的原理和实现

MobileNet网络的理解

深度解读谷歌MobileNet

“Deep-wise结合1x1的卷积方式代替传统卷积不仅在理论上会更高效，而且由于大量使用1x1的卷积，可以直接使用高度优化的数学库来完成这个操作。以Caffe为例，如果要使用这些数学库，要首先使用im2col的方式来对数据进行重新排布，从而确保满足此类数学库的输入形式；但是1x1方式的卷积不需要这种预处理。” im2col是优化卷积运算的一种操作，也就是说计算regular conv需要一些类似于im2col的操作，而1*1 conv这种不需要这些操作。“在MobileNet中，有95%的计算量和75%的参数属于1x1卷积。”

A brief review of im2col and GEMM

im2col converts a standard convolution on an image to general matrix-matrix multiplication (GEMM). The following diagram is from this paper.

GEMM is the level 3 operation in BLAS (Basic Linear Algebra Subprogams), which is a specification (API) that prescribes a set of low-level routines for performing common linear algebra operations, such as vector addition, scalar multiplication, dot products, linear combinations, and matrix multiplication (from Wiki). Thus, im2col transfers the intermediate operations of convolutions on image to two matrices and then GEMM is applied to finish matrix-matrix multiplication. Many blogs talk about why GEMM is the heart of deep learning.