Deep Learning

Profiling Neural Networks to improve model training and inference speed

Published in

Towards AI

7 min readDec 16, 2021

In a previous post, we studied how to teach the Anki Vector robot to recognize human sign language. Specifically, we trained a custom Convolutional Neural Network (CNN) with a labeled dataset of 8500 images of human signs taken from Vector’s camera. We demonstrated how the trained CNN can be used to detect human signs in this video. We also explored the tradeoffs between a small custom build CNN model and a large-scale well recognized RESNET model. Similar efforts have been made by other researchers; such as an effort to teach Anki Cozmo to learn the human sign language.

In this post, we will figure out ways to optimize this model in terms of the time it takes to train a model (speed of training), and the time it takes to classify an image using this model (speed of inference). We can break down this process into several steps.

Profile your existing model, and find opportunities for improvement.
Make corresponding changes in your model to achieve the desired improvement
Rerun your training pipeline, re-profile, and measure if you got the improvements.

Profiling for performance

If your project is based on Tensorflow, the easiest way to profile your model is using TensorBoard Profiler. Similar performance profilers are available for PyTorch. This article walks you through the steps of trying to optimize a GPU for performance. In our specific case, we will profile the performance of the RESNET model which is trained to learn the human sign recognition language. This Colab notebook demonstrates how to run the TensorBoard profiler while training RESNET. In short, we trained RESNET for 5 epochs, and achieved an accuracy of 94%; at the same time, we collected profiles to understand the performance of training the model.

Here is a snapshot from the profiler.

Figure 1: Output from TensorBoard Profiler

A few important observations can be made form the above profile.

Each step of training a model took 631ms. A step consists of one iteration of updating the parameters of a neural network model based on a segment of training data. For each of these steps of training a neural network model, the CPU needs to offload the model parameters to the GPU and fire off the computations. So in other words, the time that it took for one round of updates to the model parameters was 631ms.
The majority time (~590 ms) falls under the category of device compute time(see light-green curve), which is the time it takes for the GPU to compute the matrix multiplications to derive the error and the derivatives to derive the gradient of the error curve (It's okay if you do not understand the details of how a CNN is trained for this exercise).
About 23ms (3.7%) falls in the category of Kernel Launch time. This is the time it takes for the CPU to offload the data to a GPU. TensorBoard Profile notes an optimization that can be made in this step. We will discuss this optimization later.
TensorBoard Profile notes that none of the computations are based on Floating Point 16 (FP16) arithmetic (Noted in the black circle). This presents a ripe ground for improvements, and we will discuss how to make this improvement, and the tradeoff it presents in the next part of this post.

Mixed Precision Arithmetic

Let us first understand the difference between Floating Point 16 (FP16) and Floating-Point 32 (FP32) arithmetic.

FP16 vs FP32 arithmetic

Using FP16 buys us two significant computational advantages

The computational speed of operating on FP16 (Say multiplying two FP16 numbers) is many times faster compared to FP32 arithmetic. There are two reasons: (i) You have 4x less work to do, and (ii) Many processors such as those from Nvidia have specialized units to handle FP16.
We consume much lower memory bandwidth accessing FP16 data, thus reducing the probability of running into a memory access bottleneck before a computational bottleneck.

However, the consequence of using FP16 is lower precision. So, if we simply used FP16 to represent the coefficients of a Neural Network, there is a high likelihood that the training phase will not converge because of loss of data precision.

This has led to the introduction of mixed-precision training. In mixed-precision training, we perform most operations in FP16 but preserve some core parts of the network in FP32 (usually model activations and gradients are stored using a 16-bit floating-point format while model weights and optimizer states use 32-bit precision 1) so that information loss can be minimized. NVIDIA’s documentation shows how mixed-precision training can achieve a 3x boost in speed while converging with the same level of accuracy while training.

ML Libraries support Mixed Precision Training

Most ML libraries have built-in native support for mixed precision. As an example, in Tensorflow, enabling Mixed Precision Training just involves adding the following lines:

from tensorflow.keras import mixed_precision
policy = mixed_precision.Policy('mixed_float16')
mixed_precision.set_global_policy(policy)

In the above lines of code, we set a global policy to use mixed precision with FP16. A detailed document explaining the intricacies in Tensorflow is available here. Now, we need to measure the effect of the above code changes.

Quantifying model changes

Here are a few principles on how to scientifically quantity model changes:

Make an apples-to-apples comparison: Make sure that you thoroughly know and understand what you are comparing. Often there are subtle changes that happen in the background which can cloud any comparisons that we make between two experiments. In the context of this work, it is important to check that the training is performed on the same type of GPU, on the same training dataset, and for the same number of epochs.
Be careful about run-to-run variations: When we run two experiments, the end result is rarely the same. The differences occur due to many kinds of random and stochastic processes that always happen in a system. Hence, it is important to not judge the results of an experiment from a single run of the experiment. Rather, one should run the experiment multiple times. Results presented from multiple experiments give the readers a much higher degree of confidence.
Use multiple data points to compare and contrast results. Once we have the results from multiple runs of an experiment, we need to use different measures to interpret the results. Often, the mean of the results is used as the sole point of comparison. However, usually mean is not enough. Additional metrics that may be useful for comparison are i) Standard Deviation, ii) Median, iii) Percentiles: such as 90th percentile.

Understanding the effect of Mixed Precision Training

Now that we understand how to compare the results of experiments, let us compare the performance of training of the RESNET model on the human sign language recognition dataset. In this case, we desire to compare the following:

Training RESNET with default 32-bit precision
Training RESNET with mixed precision.

The training was terminated at 5 epochs. (If we were actually training a model to be deployed in a production environment, we would train for a higher number of epochs till the accuracy converges. But for performance comparison, 5 epochs is good enough) We ran five iterations for each set of experiments with the help of this Google Colab notebook. Here are the results.

Figure 2: Box and Whisker plot of time to train the model to 5 epochs

Figure 3: Box and Whisker plot of Accuracy of the model on the validation dataset after 5 epochs of training

These plots are commonly known as Box and Whisker plots. The top and bottom bars denote the maximum and minimum, while the colored region of the plot denotes the range between the upper quartile (75 percentile) and lower quartile (25 percentile). With the help of these plots, we can make the following conclusions.

Mixed precision training leads to faster time to train the model, as well as lower variance in the time to train. If we look at the mean as a comparison metric, Mixed precision training is 6% faster.
Mixed precision training has a far larger variance in the accuracy of the model. In the worst case, the accuracy of the model may be significantly worse (Worst case of 66% accuracy after 5 epochs compared to 86% after 5 epochs for 32-bit training). A worse accuracy might imply that the model needs to be trained to a higher number of epochs, or alternatively, the model retrained… both options would increase the time to train.

By the way, a model trained with Mixed Precision would have a lower time for inference as well… but we will discuss that in a subsequent post.

If you have any questions or thoughts, please leave them in the comments below. Please follow my publication: “Programming Robots” for more interesting articles. I also have an online course to teach AI with the help of Vector available at: https://robotics.thinkific.com I will feel honored to have you as a student.

Deep Learning

Profiling Neural Networks to improve model training and inference speed

Written by Amitabha