Analysis of Parameter Influence on CNN

I was curious about the real influence of parameter on CNN training result. And I want to see if the result aligns with the estimation. I used the MNIST dataset and tune the parameters based on an old network called LeNet-5.

CNN Layer

The CNN layout is mostly the same as LeNet-5, with small modifications, for example, adding Dropout to avoid overfitting. The reason that I used LeNet-5 is mostly that it’s small and can train at a super fast speed with little GPU requirement as I have to change the parameter to do a few hundred training tests.

1Image Input28x28x1 images with ’zerocenter’ normalization
2Convolution6 9x9x1 convolutions with stride [1 1] and padding [0 0 0 0]
3Batch NormalizationBatch normalization with 6 channels
4ReLUReLU
5Average Pooling2×2 average pooling with stride [2 2] and padding [0 0 0 0]
6Convolution16 9x9x6 convolutions with stride [1 1] and padding [0 0 0 0]
7Batch NormalizationBatch normalization with 16 channels
8ReLUReLU
9Average Pooling2×2 average pooling with stride [2 2] and padding [0 0 0 0]
10Fully Connected128 fully connected layer
11ReLUReLU
12Fully Connected64 fully connected layer
13ReLUReLU
14Dropout50% dropout
15Fully Connected10 fully connected layer
16Regression Outputmean-squared-error with response ’Response’

Testing Result

Convolutional Layer Filter Size

The filter size changes from 1 to 9, which is the maximum for the network. The result is shown in Figure 1.

Figure 1: The filter size changes from 1 to 9, which is the maximum for the network. The maximum accuracy is at 3 or 5 filter size.

The result shows that network with a filter size of 3 or 5 has the best accuracy. The reason why filter with size 1 is not accurate is most likely because it’s not correlating a pixel with another. The reason why we experience accuracy drop after 5 can be the increase of parameters that may lead to overfitting or low training speed (the maximum training iteration reached). This result aligns with the suggestion by Szegedy et al. [2] that a relatively small convolution kernel can be sufficient in training. Considering the fact that increasing the filter size significantly increases the number of calculations, 3 is a good enough filter size.

Convolutional Layer Filter Number

I changed the filter number of both convolutional layer separately and together. The change on the filter number of the first convolutional layer is shown as Figure 2. The change on the filter number of the second convolutional layer is shown as Figure 3. The change on the filter number of both convolutional layer with the size of the second layer twice the size of the first layer is shown as Figure 4.

Figure 2: The accuracy of different filter number in the first convolutional layer. The number of the first layer filters has a relatively small impact on accuracy. And it also shows that with roughly 10 or more filters in the first layer there’s no longer any accuracy improvement.
Figure 3: The accuracy of different filter number in the second convolutional layer. The number of filters in the second layer has a more significant impact on the training result. And the accuracy increase is relatively small after 20 filters.
Figure 4: The accuracy of different filter number in both convolutional layer. The x-axis is the number of filters in the first layer. The number of filters in the second layer is twice of that in the first layer. The accuracy increase is relatively small after 10 filters in the first layer.

The general conclusion is that increasing the filter number will result in better accuracy. However, the gain through increasing filter size drops dramatically after a certain point. So, it may not be necessary to have a huge filter size in exchange for the training speed.

Full-connected Layer Size

The sizes of both fully connected layers are changed together with the first layer twice the size of the second layer. The result is shown in Figure 5.

Figure 5: The accuracy of different fully-connected layer size. The xaxis is the number of filters int the second layer. The number of filters in the first layer is twice of that in the second layer. Result suggests that increasing the fully-connected Layer size may have a huge impact on training success rate. (Considering results with less than 95% accuracy failure)

The accuracy remains almost the same with a slight increase after a certain point. In Figure 5, this point is 40. The interesting thing here is that when the fully-connected layer size is low, there are training results with significantly low accuracy. However, I am not quite sure about the reason behind this.

Reference

[1] Kaggle. (2018) Digit recognizer. [Online]. Available: https://www.kaggle.
com/c/digit-recognizer#description
[2] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking
the inception architecture for computer vision,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, jun 2016.
[3] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradientbased learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.

Leave a Reply