Pandy's Blog

Pandy Song

Deep Learning, Pytorch

What is a CNN

From my understanding, Convolutional neural network (ConvNets or CNNs) is basically a neural network where Convolution is used to extract features and neural network is used to do the classification.

A typical CNN using several layers of convolution (including ReLU and Pooling) to extract features and then flat the output to vectors as inputs to the FC (Full connected network) which is basically the average neural network.

Refer to the link for more detailed reading.

Number of Input channel and Number of output could be different

Please note that for example following python (torch) code create a Convolution layer, the first parameter 1 is the number of input channels, while 6 is number of the output channels, How does it work?

self.conv1 = nn.Conv2d(1, 6, 3)

It means that there are 6 filters to do convolution on the same location of the image, so that one channel of image could outputs 6 different channels of images.

Refer to input and output channels for the discussion.

size reduced after convolution

Following example create a convolution function conv2d_test, which takes 1 channel and output another channel with kernel size 3 that is 3 x 3.

>>> conv2d_test = nn.Conv2d(1,1,3)
>>> x=torch.rand(1,1,5,5)
>>> conv2d_test(x)
tensor([[[[ 0.1419,  0.0967, -0.1615],
          [-0.1023,  0.2995, -0.0513],
          [ 0.0512,  0.1443, -0.0801]]]], grad_fn=<MkldnnConvolutionBackward>)

We could see that the output dimension is 3 x 3.

Input size 5 - Kernel Size 3 + 1 = Output size 3

understand an example


The examples define a network:

One channel 32 x 32 image input -> (convolution with kernel size 5) 6 Channels 28 x 28 images (32 - 3 + 1 = 30) -> (ReLU and apply 2 x 2 pooling, that is down sampling), 6 Channels 14 x 14 images -> (convolution with kernel size 3) 16 Channels 12 x 12 images -> (ReLU and apply 2 x 2 pooling, that is down sampling), 16 Channels 6 x 6 images

what is the purpose to “accumulate” the gradient?

Refer to github

Gradients add up at forks. The forward expression involves the variables x,y
multiple times, so when we perform backpropagation we must be careful to use +=
instead of = to accumulate the gradient on these variables (otherwise we would
overwrite it). This follows the multivariable chain rule in Calculus, which
states that if a variable branches out to different parts of the circuit, then
the gradients that flow back to it will add.

To make it easy, it allows the same parameters to be used multiple times, so the gradient is added up from the different path of back-propagation.

That is to say in each training loop, you have to zero the gradients before performing back-propagation. Refer to end of tutorial.

Different models on ImageNet

Refer to




What is Deep learning?

CNN layers could be designed different, a lot of design was introduced. In the early days, people proposed AlexNet, VGG, GoogleNet which has tens of CNN layers which is already “deep” at that time. Later ResNet was introduced which has 50-200 layers.

Hence the “deep” looks like means the number of layers is large, so features could be extracted in different layers.

What state-of-art deep CNN model architecture to use?

Refer to

Inception V4 is kind of combination of ResNet and Inception.

Transfer Learnning

These state of art model actually has generic capability to recognize an image. The pre-trained model contains generic knowledge, which could be used to do image recognition task for other purposes. The method is called “transfer learning”.

Refer to

The basic premise of transfer learning is simple: take a model trained on a
large dataset and transfer its knowledge to a smaller dataset. For object
recognition with a CNN, we freeze the early convolutional layers of the network
and only train the last few layers which make a prediction. The idea is the
convolutional layers extract general, low-level features that are applicable
across images — such as edges, patterns, gradients — and the later layers
identify specific features within an image such as eyes or wheels.