Boosting Image Classification Accuracy (2024)

A guide to training better CNNs

Mashrur Mahmud

Published in

intelligentmachines

10 min read

Jul 19, 2020

Boosting Image Classification Accuracy (3)

Deep Learning encompasses a number of learning tasks, and the underlying concepts do not always overlap. As such, I’ll be tackling one of its more commonly practiced problem domains: Image Classification.

Convolutional Neural Networks (CNNs) have dominated the field of Computer Vision since the advent of AlexNet. They have great off-the-shelf performance, there’s no denying that. You can find a myriad of boilerplate code roaming around on the vast expanses of GitHub; it’s easy enough to hot-wire one to train on your own data.

What is not so easy, however, is knowing what to do when the model you’ve trained isn’t quite up to the mark. I intend to discuss exactly that — a set of concepts and guidelines to train more performant image classification models.

Simply feeding data into your model is certainly not the recommended practice when it comes to training. Representation is extremely important in machine learning and holds true to some extent for deep learning as well.

Let me illustrate with a very basic example:

Boosting Image Classification Accuracy (4)

Consider that you have a very simple logistic regression model at your disposal. It has a linear separator, but the problem you’ve been tasked with seems to have a circular distribution, giving you no end of trouble.

However, a circle in Cartesian coordinates is a line in Polar coordinates! Simply transforming the representation of the data can make it a lot easier for you to solve the task. There’s a fine balance between the complexity and the meaningfulness of your data; strive to minimize the former and maximize the latter.

Let’s see how the concept of better representation can be applied when we’re working with images:

Boosting Image Classification Accuracy (5)

Above, you’ll see two similar yet starkly different images of hillside scenery. Although it doesn’t seem like it, the amount of information encoded in either image is exactly the same! The only difference is in how the information is being represented.

If we talk about pixel intensity space, it is defined as a range of 256 possible values. The original image, on the left, poorly utilizes this space. Most of its intensity values lie in a similar, grayish region.

This is what we call a low-contrast image.

The second image, on the right, distributes these compressed intensities much better, covering as much of the intensity-space as possible. It is the exact same image as the first one, save for the fact that it has undergone a histogram equalization operation. We’ve transformed the representation of the original image and have gained more meaningfulness.

Good preprocessing is vital to a model’s performance. Depending on the task at hand, you may find dozens of possible preprocessing operations you can try. For example grayscale conversion, quantization, color-channel normalization, contrast stretching (which is what we’ve done in the last picture!) are some common preprocessing methods when working with images.

This is pretty much always a good idea to do. If you find yourself reading a Deep Learning paper, scroll down and take a look at the training details — it’s highly likely that the authors have employed some form of scheduling.

The step-size or learning rate a model requires naturally changes with the passing of iterations. For one, the gradient descent curve becomes flatter the more the model steers towards the optimum. The original learning rate may no longer suffice for convergence.

Hence, the need for scheduling. Rather than selecting a learning rate before training and having it remain static throughout time, it makes more sense for the learning rate to be dynamic, changing when necessary.

Some scheduling policies involve manually choosing reduced learning rate values at certain points of time. Other schedulers may involve constant decays, like the Exponential scheduler, or are perhaps defined by a mathematical formulation, such as the Cosine Annealing scheduler.

Here’s an actual visualization of how learning rates are adjusted by various policies. I’ll also show you how to generate this GIF yourself, since I can’t resist sneaking in a bit of code.

Boosting Image Classification Accuracy (6)

I’ll tap into PyTorch’s torch.optim module to make this.

In laymen terms, label smoothing is a method which penalizes a model for making overconfident assumptions. If that doesn’t really make any sense, bear with me for a bit.

Let’s take a look at a simple N-class classification problem. A one-hot encoded target vector for such a problem could be formulated as:

OneHot(y, N) = [ 1 if i=y, otherwise 0 :: for i in 1…N ]

Consequently, a label smoothed target vector would then be defined as:

LabelSmoothed(OneHot, α, N) = (1-α)*OneHot + α/N

Where α is a constant and a parameter to be chosen.

The one-hot target vector defined previously has what we call hard probabilities, or definite odds. An example one-hot vector for a 5-class problem would be [0,0,0,1,0].

Let’s apply label smoothing on this vector with α = 0.1. Following the formula I’ve defined above, we would then get: [0.2,0.2,0.2,0.92,0.2].

Our targets are no longer hard probabilities. They are soft targets, but they’ve got an interesting property — the probabilities all still sum to 1! This is the conservation of the fact that in any sample space, the odds of all mutually exclusive events which encompass all possible outcomes, must add up to 1.

What is important to note here is the subtle, but powerful effect this has on our models. I’ve mentioned overconfident assumptions earlier, but what does that really mean?

When training classification models, we measure up the predicted probabilities against the targets. These probabilities are calculated by passing the logits, or the final hidden units in your model, through a softmax transformation.

softmax(z) = [ exp(i)/sum(exp(z)) :: for i in z ]

Effectively, our model’s aim is to produce a z for which the difference between softmax(z)and OneHot would be minimized. Let’s do some backward calculation.

The logit z = [0,0,0,12,0] gives a softmaxed value of [0,0,0,0.9999,0] approximately, which is a very close approximation of the hard target [0,0,0,1,0].

But to get a prediction close to the label-smoothed target [0.2,0.2,0.2,0.92,0.2] we’d need a logit z = [0,0,0,3.83,0]. The difference between the logit values of the predicted class and the other classes in the first case is more than three times as high as that in the second case!

This is what we mean by overconfident assumptions. Due to the nature of hard targets, they force the model to assume larger confidence values than necessary.

In a real test scenario, a particular data instance may have a small probability (let’s say 0.5) of being a certain class. An overconfident model will predict a higher probability like 0.9 or more, giving us an incorrect confidence assumption. The model will also be prone to confident mistakes, predicting certain classes when it should do otherwise.

With label smoothing, predicted outcomes are no longer forced to be skewed towards unnecessarily high probabilities, thus solving this problem.

Augmentation refers to the process of diversifying data through various transformations. This one’s one of the basics, really, but I nevertheless wish to stress on its importance.

Although powerful, CNNs are prone to overfitting. Suppose you’re training a model to classify chairs. All your training images have chairs in their natural state: standing upright. Later, while testing, your model encounters a flipped chair — and utterly fails.

Boosting Image Classification Accuracy (7)

If you randomly rotate some chair images during training, however, your model might just be able to classify that flipped chair!

Datasets become trickier after the augmentation process, setting stronger constraints on models. Moreover, augmentation provides a form of model-independent regularization, which is not the case with L2 decay and other hyperparameter-based methods.

A general set of transformations when working on image classification includes scaling, random rotation, random crops, random horizontal and vertical flips, etc.

Here’s some empirical evidence on how effective augmentation can be:

Boosting Image Classification Accuracy (8)

The above method is called Cutout. It involves zeroing out one or more square regions of a certain size in an image.

Seems simple enough, right? However, simply adding Cutout on top of existing methods allowed its authors to reach SOTA results of their time — on three datasets!

Augmentation has further evolved with the advent of AutoML methods. Much like Neural Architecture Search (NAS) which finds optimal model architectures, AutoAugment finds the best set of augmentation policies for a specific dataset.

If I ask you what optimizer you’ve used the most while dabbling in deep learning, chances are that the answer is Adam.

It’s popularity is well rooted due to its faster convergence. However, what most people don’t know is that Adam and even the more recent Rectified Adam still fail to surpass the convergence of the basic Stochastic Gradient Descent!

So why exactly does this happen? As it turns out, Adam confidently takes some large and potentially sub-optimal steps during the initial stages of training.

Don’t get me wrong. Adam is still a very good optimizer choice, as you’ll certainly finish training your model a lot faster and get decent results. However, if you’re not hurting for resources and if that extra percentage of accuracy is an absolute must, then you might as well give SGD with Warmup a go.

Boosting Image Classification Accuracy (9)

This one’s something that bothers me and possibly quite a lot of people out there: Deep Learning in general involves a lot of hyperparameters. While there are good or recommended settings, it remains to be said that hyperparameters are rather similar to clothes, in the fact that there is no one size fits all.

A better choice of hyperparameters will give you better results. But how do we actually go about finding these good choices?

Hyperparameter tuning can be loosely categorized into four schools of thought:

Babysitting
Grid Search
Random Search
Bayesian Optimization

Don’t let the name put you off, as babysitting is still the most widely practiced method of finding parameters. It is as it sounds: you make an educated guess of hyperparameters, or perhaps simply use the default settings and then watch your model train. If it’s not converging right, you make some manual changes based on intuition, and then again watch your model train. You do this till you run out of patience or resources or both.

Going on to the next two search methods, Grid Search involves searching among combinations from a fixed set of values for each parameter. Random Search on the other hand, involves choosing a random combination of values for each training instance.

One of these two is better than the other, and I want you to take a moment and think about it. Take a look at the image below and see if you can figure it out:

Boosting Image Classification Accuracy (10)

Consider an N-dimensional space, where N is the number of parameters you have to tune. For each parameter, let’s say you want to try M possible values. Grid Search will train a model M^N times, to figure out an approximate optimal configuration. If you see the figure above, you’ll see that although Grid Search has trained a model nine times, effectively it has tried only three values for each hyperparameter!

Random Search, however, has tried nine possible values for each hyperparameter even though it trained the same number of times as Grid Search. We could say that Random Search has sampled M^(N-1) more values for each hyperparameter. What makes it worse for Grid Search is the fact that not every hyperparameter is equally important. It has poorly explored the search space; Random Search is clearly the winner here.

The last method, Bayesian Optimization, is a bit different from the other two. In both Grid and Random Search, we’re not taking past results into account. We choose a set of values and train our models a certain number of times; it really makes no difference if we train the models in parallel or in sequence.

But what if the outcome of each training run affected the choice of the hyperparameters for the next run? The Bayesian scenario comes into play here — now, in each trial, the choices are no longer independent of past trials. We use the probabilistic model P( score | hyperparameters ) to find our approximately optimal set of hyperparams. Here’s a great read on the topic that I still refer to from time to time.

You really deserve an award if you’ve powered through all of that. Kudos!

I hope I’ve given you several things to think about and further investigate. Perhaps you could even implement and try out some of them!

Lastly, I’d like to point you to an inspired read, if you have the time for it later. It’s called Bag of Tricks for Image Classification with Convolutional Neural Networks — the title says it all.

Boosting Image Classification Accuracy (2024)

A guide to training better CNNs

References