Intro to Machine Learning, PART I. Let’s talk about CNNs

By: Andrés Felipe Echeverri Guevara

This guide is the first of many series related to Neural Networks, Machine Learning, and AI. This brief introduction will give you some hints on how you could build your custom dataset and your Neural Network. A full guide in Neural Networks will take several posts and time, so be patient here, we are just starting now. Since I know that some key points are missing such as data augmentation, dropout, hot encoding, hyperparameter tunning and more, upcoming posts will explore deeper into these Machine Learning features and some others worth mentioning. Moreover, along with the post, I will try to answer some of the questions that arise while Machine Learning is applied to a specific problem.

Before we start coding, First thing first, data, lots of data. For this example, we are going to classify among different vehicles, let’s keep it simple for now and let’s classify between 2 classes, “car” and “motorcycles”. I tried to find a good dataset, but most of my findings lack generalization. Also, the number of samples wasn’t representative. For this reason, I have created a car and motorcycle dataset, one that comprises a variety of “styles”: SUVs, sports cars, compacts, hatchbacks, sedans, dirt bikes, choppers, enduros, motocrosses, naked bikes with different colors, shapes, positions, illumination, etc. How much data is enough? That’s a tricky question, but a rule of thumb says that you might need 10 times as many degrees of freedom in your model. The more complex the model, the more data you might need, in this context, complex refers to the number of classes. Keep in mind that Machine Learning approaches are data-hungry and enough won’t be enough sometimes. I have used google image downloader to create my dataset. However, the dataset that I’m using is still not fully representative, so you could guess why by taking a look at some pictures, I will elaborate more in regard to this matter later on. For now, bear with me and let’s use it.

What framework should I use? TensorFlow (TF) is one of the best known and most used frameworks in Machine Learning which is also open-source and very flexible. It can be deployed across different devices, from mobile up to embedded systems. TF has become a complete ecosystem that comprises more than just Machine Learning techniques.

Let’s read the images and label them, every car will be assigned the value “1”, for motorcycles the value “0” will be used. The function img_read will return a list of the image path list and the label list corresponding to each image.

Moreover, some preprocessing is needed, depending on the source of the images, data cleaning should be applied. Furthermore, that step was already addressed in this case at the time of the data collection. Nonetheless, the images come in different sizes and a standardized size is needed for the neural network. For our case, an 80x80x3 image is used (80 pixels height by 80 pixels width and the full-color space). In Addition, value normalization is a good practice when working with machine learning models, most of the activation functions work with values between [0,1] and [-1,1], for this case each pixel will be scaled to a value between 0 and 1. Other normalization that relies on the standard deviation and average of images can be used too.

Example of a resized image

Data needs to be split into training and validation. Most Data Scientists use training and validation data and some other splitting ranges can be found. Usually, for training 70% of the data is taken and the leftover is used to validate how well the model performs with unseen data.

Supervised or unsupervised? Well, that depends on the approach used. Supervised learning is where you have an algorithm to learn the mapping function from the input to the output. The goal is to find a function that you can use to predict new output values. For our case, we are using a supervised algorithm, our approach will try to compare the image with its label, during the process it will come up with a function that will be used to predict future images that we feed into our Neural Network.

Which topology should we use? Convolutional Neural Networks or CNNs are known for performing well with image classification tasks. CNNs take an input image, process it and classify it under a certain category. In our case, “car” or “motorcycle”.

Let’s see what is a convolution for our case. As it was mentioned before, the input images have a size of 80x80x3, the images will be convolved with 32 filters of size 2×2, each filter will identify a specific feature of the image. The result of the filter is a feature map of size 40×40. Since we are using strides of 2, the filter will be sliced every two pixels; resulting in a feature map of size 40x40x3.

machine learning and AI_ Yuxi Global_ feature map

The topology of the CNN used

machine learning and AI_ Yuxi Global_topology

Kernels used on the CNN, these kernels will be sliced every 2 pixels trough the images in both x and y image’s direction.

The result will be the sum of all of the dot products of the kernel with the image patch. Once, the non-linearity is applied, a rectified feature map can be displayed and the convolved filter will produce maps that will cause some features so as to be more prominent.

machine learning and AI_ Yuxi Global_map

Rectified feature map

Pooling is another block of a CNN, it has several uses, it works to downsample the image, this helps to reduce the number of parameters and computation in the network. Pooling operates on each feature map independently. There are several pooling methods such as average pooling, sum pooling, and max pooling. However, max pooling is preferred due to its simplicity. Max pooling takes the largest element from the rectified feature map to create the downsampled image.

machine learning and AI_ Yuxi Global_max pooling

Example of pooling using a Kernel with size 2×2 and stride of 2

The resulting image after the pooling layer will have a size of 20x20x3, resulting in 1200 features at the very end of our convolution. Keep in mind that until this point, we haven’t done any classification yet, that’s accomplished with a fully connected layer. Each resulting feature will be assigned to a weight. Then, in the end, we just want a binary value, that tells us if we are seeing a car or motorcycle. That’s why the output is a single neuron.

As you can see, the previous CNN has several hyperparameters that have an impact on the performance. Hyperparameters tuning and topology selection are an important part of the Machine Learning Field. This will be covered in future posts.

How to know if our CNN is performing well? There are several metrics and plots that can be displayed to inform how well is our approach doing. However, for now, the loss, accuracy training, and validation can give us a lot of insights, just by looking at the graphs. It is expected to have a small loss and a high accuracy in both training and validation. Yet sometimes, it is not always the case. To assess better our approach, it is good two understand two concepts used in Machine Learning: overfitting and underfitting.

Overfitting is a typical issue when it comes to neural networks, and it refers to a lack of generalization in the model. In Machine Learning, generalization is the model’s ability to give sensible outputs to some sets of inputs that have never seen before and with a lack of generalization, the model will not be able to predict new incoming data. The overfitting can be seen when the training accuracy is high but the validation accuracy is low. This can be resolved with data augmentation or trying a different topology that includes some performing regularisation techniques such as dropout.

Underfitting is another issue, which it refers to a low generalization of the model by not learning enough from it. It can be seen when the training and validation accuracy are both low. We will cover these metrics and more in future posts. For now, bear in mind that the proper fitting model is the one that captures the complexity of the data, allowing to find insights and make accurate predictions once new images are fed to the network.

It is also possible to know how well the CNN is working by making predictions. These are just a few of the random images used in the validation, so let’s take a look at how well it works.

machine learning and AI_ Yuxi Global_predictions

Result of some predicted images with the trained model

Summing up

If you have come this far, you might have guessed why the dataset is not the best representation of “vehicles”, and yes, I did it intentionally. I would like to challenge you to find some good parameters for the model and train the model by yourself. Moreover, try to feed it with a couple of pictures found by you and see the results (hint: find pictures with a background), you can find the full code on GitHub and Colab. The previous CNN will work with any dataset, nevertheless, if you want to use more classes, hot encoding is highly suggested, so try to find the best performing parameters. We will explore this and more techniques and features that data scientists are using. Up to this part, we have covered a very basic implementation of a CNN.

If you would like to get in touch with us, you can do it by clicking in this link: https://yuxiglobal.com/contact

Intro to Machine Learning, PART I. Let’s talk about CNNs

Summing up

Share This Story, Choose Your Platform!