Based on a collaboration between the Facebook Artificial Intelligence Research and Applied Machine Learning groups, a new paper has been released that details how Facebook researchers developed a new way to train computer vision models that speeds up the process of training an image classification model in a significant way.
Facebook explains in the paper that deep learning techniques thrive with large neural networks and datasets, but these tend to involve longer training times that may impede research and development progress. Using distributed synchronous stochastic gradient descent (SGD) algorithms offered a potential solution to this problem by dividing SGD minibatches over a pool of parallel workers, but to make this method efficient, the per-worker workload must be large, which implies nontrivial growth in the SGD minibatch size, according to Facebook.
In the paper, the researchers explained that on the ImageNet dataset, large minibatches cause optimization difficulties, but when these are addressed, the trained networks exhibited good generalization. The time to train the ImageNet-1k dataset of over 1.2 million images would previously take multiple days, but Facebook has found a way to reduce this time to one hour, while maintaining classification accuracy.
"They can say 'OK, let’s start my day, start one of my training runs, have a cup of coffee, figure out how it did,'" Pieter Noordhuis, a software engineer on Facebook’s Applied Machine Learning team, told VentureBeat.“And using the performance that [they] get out of that, form a new hypothesis, run a new experiment, and do that until the day ends. And using that, [they] can probably do six sequenced experiments in a day, whereas otherwise that would set them back a week."
Specifically, the researchers noted that with a large minibatch size of 8192, using 256 GPUs, they trained ResNet-50 in one hour while maintaining the same level of accuracy as a 256 minibatch baseline. This was accomplished by using a linear scaling rule for adjusting learning rates as a function of minibatch size and developing a new warmup scheme that overcomes optimization challenges early in training by gradually ramping up the learning rate from a small to large value and the batch size over time to help maintain accuracy.
With these techniques, noted the paper, a Caffe2-based system trained ResNet-50 with a minibatch size of 8192 on 256 GPUs in one hour, while matching small minibatch accuracy.
"Using commodity hardware, our implementation achieves ∼90% scaling efficiency when moving from 8 to 256 GPUs," notes the paper’s abstract. "This system enables us to train visual recognition models on internet-scale data with high efficiency."
In summary, according to Facebook's Lauren Rugani, the paper demonstrates how creative infrastructure design can contribute to more efficient deep learning at scale.
"With these findings, machine learning researchers will be able to experiment, test hypotheses, and drive the evolution of a range of dependent technologies — everything from fun face filters to 360 video to augmented reality," wrote Rugani.
Learn more: search the Vision Systems Design Buyer's Guide for companies, new products, press releases, and videos