Speeding up deep learning

By Michaelstone428 [CC BY-SA 3.0 (https://creativecommons.org/licenses/by-sa/3.0)], from Wikimedia Commons

Speeding up deep learning

There’s a lot going on in the world of AI.  Significant research effort is applied to a seemingly infinite number of aspects of deep learning.  One that caught my eye today was this story in The Register, discussing how researchers have been working to speed up neural network training times. There were a few key points about these techniques to speed up Deep Learning that I thought I’d share.  And by speed up, I mean for a specific neural network architecture and training set.  So I’m excluding other techniques like changing the number of hidden layers, convolution size, downsampling, dimensionality reduction etc.

Make a coffee…

These things take a long time to train to attempt to get any reasonable/useful accuracy unless you can throw a heap of resources at it.  For example, when training convolutional neural networks (CNNs) for image classification tasks, the researchers were using 1024 GPUs – not exactly within the reach of your average data science team.

Arguably I could scale up massively using services like Amazon SageMaker, although quite sensibly the default service limits are set quite low, especially for the GPU-enabled p instances.  The default limit lets you get up to 16 GPUs (for an on-demand rate of circa $50/hour). Beyond that, you need to contact AWS to increase your service limits.  This is something we do all the time for a range of services, so it’s not a one-time exception, it’s the norm. By default, you could end up running a 5-day training job :).  It’s less “make a coffee”, and more like “make a coffee machine”.  These kinds of run times are what deep learning engineers have to live with.

Of course, there’s no such thing as a free lunch here. Throwing more GPUs at a problem does not lead to a linear scaling down of training times.  This is due to the inherent communication and coordination overhead involved with large distributed processing architectures.  In the cloud world, we’ve got used to assuming that we can scale compute resources up linearly with cost (up) and often with run-times (down). We’ve been spoiled!  The researchers got a 1.3x speed up when they doubled their number of GPUs.  Not everything in life is embarrassingly parallelisable after all – shocker!

Make your maths worse…

Lower calculation precision is often a good compromise. This is well known in the deep learning world, but not immediately obvious outside it.  The power of neural network architectures does not come from uber-precise calculations in the matrix maths, but rather lots of them. So somewhat unintuitively, you can drop the calculation bit-size and use lower precision multiplications for the vector processing. If it allows you to get more calculations per second out of each GPU, then you can train faster with limited loss of model accuracy.  This team used a mix of 32-bit full and 16-bit half-precision floating-point maths operations.  Quite the opposite of the relentless march we’ve seen over the decades in general computing towards longer and longer register sizes.

Make your batches bigger…

Training batch size is another key trade-off.  You have a hyperparameter choice that defines how many of your input data samples you randomly collect together and use for each training epoch of your neural network.  A bigger batch size means your inputs are more “averaged” before they are used to update all the weights in your network.  There are less training cycles, saving processing time.  Model accuracy will degrade as a result, but not necessarily by too much.

Results

Using these kinds of techniques, the Chinese research team have managed to get training times for the classic ImageNet dataset using a ResNet-50 architecture (an industry reference point) in an amazing 8.7 minutes.  However, accuracy is very poor, i.e. not exactly going to be very useful for commercial applications – at 76.2%.   But that’s not the point – they could train for longer and allow the network to converge further.  The classic time/accuracy trade-off.  And sometimes a lower accuracy network is what you need to prove a point during research.  Interesting stuff!

Image : By Michaelstone428 [CC BY-SA 3.0 (https://creativecommons.org/licenses/by-sa/3.0)], from Wikimedia Commons

Robin Meehan
robin@inawisdom.com
No Comments

Post A Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.