A convolutional neural network, also called ConvNet and CNN is a computer vision algorithm and is also a class of artificial neural networks. It is mainly used for detecting and classifying images. It, at the most basic level, uses pixel data of the image which is supposed to be classified. Which might immediately make you think that this is exactly what a normal neural network does. Well this is not wrong, a cnn is in fact a kind of ann, but a little more sophisticated and is specially designed for image related problems. Mainly classification and detection.
Layers of a convolutional neural network
There are typically 5 kind of layers used in a cnn.
- Input layer
- Convolutional layer
- Pooling layer
- Flatten layer
- Classification layer (Fully connected layer)
The input layer
Being the first layer, It is simply just a collection of all pixel data of the image. If you have done some simple image classification of grayscale image using an ANN you might remember the input layer is flatten array consisting of pixel values of image. Well we don’t do that here. We keep it in the same shape like it was in the image.
For example if you have an image which is 700×700 pixels. It probably is an RGB image. Which means it has 3 different pallets having intensity of each color ie, red green blue.
which also means it will be stored as a 700 x 700 x 3 array of data. Just like I used an image above which has an ENT from the movie lord of the rings : return of morbius
This is the most important part of a convolution neural network. Convolving in ‘english’ means rolling over something. This is exactly what will happen. The purpose of a convolution layer is to reduce the image and extracting features
We get it by using something called a feature. This feature is basically a matrix which resembles a specific feature of an image, like a straight line, a circle. If you are using this for human face classification, your feature can even resemble a nose, eye etc. So this basically extracts that part of the image if it exists.
Here we basically multiply the corresponding values in the feature and image and add them to get an output – here they are 4, 3, 4 etc
Also in the image above, the feature resembles an ‘X’ . notice how the rest of the pixels have a zero.
This layer comes after the convolution layer. This layer is also used to reduce the computation of the model. But by using different technique. There are different kind of pooling. what this layer basically does is it obtains the most dominant pixel value of a region. Just like the feature convolves over the image, the pooling layer also convolves but instead of doing the feature function, it just picks the most dominant pixel.
After this, it is not necessary to directly jump into the flattening layer. In most cases, the convolution and pooling layer might be repeated many times one after the other.
Before jumping to the flatten layer, the last pooling layer is supposed to turn the last convolution layer into a flat layer. Remember you can have multiple pairs of convolution and pooling layers.
Fully connected layer
After this we simple do what we do in a simple ANN, we add some hidden layers after the flatten layer , add some activation function and we are done. At last we simply use some classification algorithm like softmax to obtain our prediction.