(Part 1) Joseph E

Home , Jeff Dean

Distributed Deep Learning (part 1) Joseph E. Gonzalez Co-director of the RISE Lab [email protected] What is the Problem Being Solved?

Ø Training models is time consuming Ø Convergence can be slow Ø Training is computationally intensive Ø Not all models fit in single machine or GPU memory Ø Less of a problem: big data Ø Problem for data preparation / management Ø Not a problem for training … why? On Dataset Size and Learning

Ø Data is a a resource! (e.g., like processors and memory) Ø Is having lots of processors a problem? Ø You don’t have to use all the data! Ø Though using more data can often help

Ø More data often* EXPERTdominates OPINION models and algorithms EXPERTContact Editor: BrianOPINION Brannon, [email protected] Contact Editor: Brian Brannon, [email protected]

EXPERT OPINION Contact Editor: Brian Brannon, [email protected] *More data also enables The Unreasonable more sophisticated. TheEffectiveness Unreasonable of Data Effectiveness of Data TheAlon Halevy,Unreasonable Peter Norvig, and Fernando Pereira, Google Alon Halevy, Peter Norvig, and Fernando Pereira, Google Effectiveness of Data ugene Wigner’s article “The Unreasonable Ef- behavior. So, this corpus could serve as the basis of fectiveness of Mathematics in the Natural Sci- a complete model for certain tasks—if only we knew ugeneE Wigner’s article “The Unreasonable Ef- behavior.how So, to this extract corpus the couldmodel servefrom theas the data. basis of 1 fectivenessences”Alon examinesof Halevy,Mathematics why Peter so inNorvig, much the Natural ofand physics Fernando Sci- cana Pereira,complete be Google model for certain tasks—if only we knew E how to extract the model from the data. ences”1 neatly examines explained why sowith much simple of mathematical physics can beformulas Learning from Text at Web Scale The biggest successes in natural-language-related neatly explained with simple mathematical formulas such as f = ma or e = mc2. Meanwhile, sciences Learningthat machine from learning Text have at Web been statisticalScale speech rec- The biggest successes in natural-language-related ugeneinvolve Wigner’s human article beings “The rather Unreasonable than elementary Ef- behavior. par- ognition So, this andcorpus statistical could serve machine as the translation.basis of The such as f = ma or e = mc2. Meanwhile, sciences that machine learning have been statistical speech rec- ticles have proven more resistant to elegant math-a completereason model for for these certain successes tasks—if is not only that we these knew tasks are involvefectiveness human ofbeings Mathematics rather than in the elementary Natural Sci-par- ognition and statistical machine translation. The E ematics. Economists suffer from physics envyhow over to extracteasier thanthe model other from tasks; the they data. are in fact much harder ticles 1have proven more resistant to elegant math- reason for these successes is not that these tasks are ences” examinestheir inability why so to much neatly of model physics human can be behavior. than tasks such as document classif cation that ex- ematics. Economists suffer from physics envy over easier than other tasks; they are in fact much harder neatly explainedAn informal, with simple incomplete mathematical grammar formulas of the English tract just a few bits of information from each doc- their inability to neatly model human behavior. Learningthan tasks suchfrom as Textdocument at Web classi fScale cation that ex- language runs over 1,700 pages.2 Perhaps whenThe it biggest ument. successes The reason in natural-language-related is that translation is a natural An informal, incomplete grammar of the English tract just a few bits of information from each doc- such as fcomes = ma or to e natural= mc2. Meanwhile,language processing sciences that and relatedmachine task learning routinely have done been every statistical day for speech a real human rec- need language runs over 1,700 pages.2 Perhaps when it ument. The reason is that translation is a natural involve humanf elds, we’rebeings doomed rather thanto complex elementary theories par- thatognition will (think and statisticalof the operations machine of translation. the European The Union or comes to natural language processing and related task routinely done every day for a real human need ticles havenever proven have more the eleganceresistant ofto physicselegant equations.math- reason But forof thesenews successesagencies). is The not samethat theseis true tasks of speech are tran- f elds, we’re doomed to complex theories that will (think of the operations of the European Union or ematics. ifEconomists that’s so, we suffer should from stop physics acting envy as ifover our goaleasier is thanscription other tasks; (think they of are closed-caption in fact much broadcasts).harder In never have the elegance of physics equations. But of news agencies). The same is true of speech tran- their inabilityto author to neatlyextremely model elegant human theories, behavior. and insteadthan tasks other such words, as document a large training classif cation set of thethat input-output ex- if that’s so, we should stop acting as if our goal is scription (think of closed-caption broadcasts). In embrace complexity and make use of the best ally behavior that we seek to automate is available to us Anto informal,author extremely incomplete elegant grammar theories, of theand English instead tractother just words, a few a largebits of training information set of fromthe input-output each doc- 2 languageembracewe runscomplexity have: over the 1,700 unreasonableand makepages. use Perhaps effectiveness of the whenbest ally ofit data. ument.behavior Thein thatthe reason wild we seek. In is contrast, thatto automate translation traditional is available is a natural natural to uslanguage comeswe have: to naturaltheOne unreasonable of languageus, as an undergraduate processingeffectiveness and ofat relateddata.Brown Univer-taskin the routinely wildprocessing. In done contrast, every problems traditional day for such a real asnatural documenthuman language need classi f ca- f elds,One we’re ofsity, us, doomed remembers as an undergraduate to complex the excitement theories at Brown of that having Univer- will access (thinkprocessing to oftion, the problems operationspart-of-speech such of the astagging, European document named-entity Union classif orca- recogni- neversity, remembershavethe the Brown elegance the Corpus, excitement of physics containing of havingequations. one access million But to Englishoftion, news part-of-speech tion,agencies). or parsing The tagging, same are notis named-entity true routine of speech tasks, recogni- tran-so they have 3 ifthe that’s Brown words.so, we Corpus, should Since containingthen, stop ouracting f oneeld as has millionif ourseen goal Englishseveral is notablescriptiontion, orno parsing (think large corpus ofare closed-caption not available routine tasks,in the broadcasts). wild.so they Instead, have In a cor- towords. author3corpora Since extremely then, that our areelegant aboutf eld hastheories, 100 seen times several and larger, instead notable and in 2006,otherno large words,pus corpus for a large these available training tasks in requires theset ofwild. the skilled Instead,input-output human a cor- annota- embracecorporaGoogle complexitythat are released about and 100 a trillion-wordmake times use larger, of corpusthe and best in with 2006, ally frequency behaviorpus for tion. thesethat Suchwe tasks seek annotation requires to automate skilled is not is humanavailableonly slow annota- to and us expen- 4 weGoogle have: countsreleased the unreasonable for a trillion-word all sequences effectiveness corpus up to with f ofve frequency data. words long. intion. the In wildSuchsive. In annotationto contrast, acquire buttraditionalis notalso only dif fnatural slowcult for and language experts expen- to agree countsOne of some for us, all asways sequencesan undergraduatethis corpus up to is f aveat step wordsBrown backwards long.Univer-4 Infrom processingsive the to acquireon, problemsbeing but bedeviled also such dif f as cultby document many for experts of the classi to dif fagree fca- culties we sity,some remembers waysBrown this Corpus: thecorpus excitement isit’s a takenstep ofbackwards from having un f accessltered from Web tothe tion,pageson, beingpart-of-speech discuss bedeviled later tagging, inby relationmany named-entity of to the the dif Semanticf cultiesrecogni- weWeb. The theBrown Brown Corpus:and Corpus, thus it’s contains taken containing fromincomplete oneunf ltered million sentences, Web English pages spelling tion,discuss er- or fparsing laterrst lesson in arerelation ofnot Web-scale routine to the tasks, Semanticlearning so theyisWeb. to haveuse The available 3 words.and thus Sincerors, contains then, grammatical ourincomplete f eld errors, has sentences, seen and several all spelling sorts notable of er- other nof rst er-large lesson large-scalecorpus of Web-scale available data in ratherlearning the wild. than is Instead,to hoping use available a for cor- annotated corporarors, grammatical thatrors. are It’s about not errors, annotated 100 times and with alllarger, sorts carefully and of in other hand-corrected2006, er- puslarge-scale for thesedata data thattasks ratherisn’t requires available. than skilled hoping For human instance, for annotated annota- we f nd that Googlerors. It’s releasedpart-of-speech not annotated a trillion-word tags.with carefully But corpus the with facthand-corrected frequency that it’s a milliontion.data Suchthatuseful annotationisn’t semanticavailable. is relationshipsnot For only instance, slow can andwe be f expen- nd automatically that 4 countspart-of-speech fortimes all larger sequences tags. than But upthe the to Brown fact f ve that wordsCorpus it’s long.a outweighs million In sivetheseuseful to acquire semanticlearned but from relationships also the dif statisticsf cult canfor ofexperts be search automatically to queries agree and the sometimes ways largerdrawbacks. this than corpus the A Browntrillion-wordis a step Corpus backwards corpus—along outweighs from thesethe with on,otherlearned being corresponding from bedeviled the statistics by resultsmany of5 ofsearch or thefrom difqueries thef culties accumulated and we the evi- Browndrawbacks. Corpus:Web-derived A trillion-word it’s taken corpora from corpus—along ofun f millions,ltered Web with billions, pages other ordiscuss corresponding tril- laterdence in of resultsrelation Web-based5 orto from the text Semanticthe patterns accumulated Web.and formatted The evi- ta- andWeb-derived thuslions contains of corpora links, incomplete videos, of millions, sentences,images, billions, tables, spelling and or er- tril-user finter- dencerst lesson ofbles, Web-based of6 Web-scalein both text cases learning patterns without is and to needing useformatted available any ta-manually rors,lions grammatical of actions—captureslinks, videos, errors, images, and even alltables, very sorts rareand of aspectsuser other inter- er- of humanlarge-scalebles, 6 inannotated both data cases rather data. without than hopingneeding for any annotated manually rors.actions—captures It’s not annotated even with very carefully rare aspects hand-corrected of human dataannotated that isn’t data. available. For instance, we f nd that part-of-speech tags. But the fact that it’s a million useful semantic relationships can be automatically 8 times larger than the Brown Corpus outweighs1541-1672/09/$25.00 these learned © 2009 from IEEE the statistics of searchIEEE queries INTELLIGENT and the SYSTEMS Published by the IEEE Computer Society 8 drawbacks. A trillion-word corpus—along1541-1672/09/$25.00 with other © 2009corresponding IEEE results5 or fromIEEE the INTELLIGENT accumulated SYSTEMS evi- Web-derived corpora of millions,Published billions, by the IEEE or tril- Computerdence Society of Web-based text patterns and formatted ta- lions of links, videos, images, tables, and user inter- bles,6 in both cases without needing any manually actions—capturesAuthorized licensed use even limited veryto: Univ rareof Calif aspectsBerkeley. Downloaded of human on February annotated 5, 2010 at 22:51data. from IEEE Xplore. Restrictions apply. Authorized licensed use limited to: Univ of Calif Berkeley. Downloaded on February 5, 2010 at 22:51 from IEEE Xplore. Restrictions apply.

Authorized licensed use limited to: Univ of Calif Berkeley. Downloaded on February 5, 2010 at 22:51 from IEEE Xplore. Restrictions apply. What are the Metrics of Success?

Ø Marketing Team: Maximize number of GPUs/CPUs used Ø A bad metric … why? Ø Machine Learning: Minimize passes through the training data Ø Easy to measure, but not informative … why? Ø Systems: minimize time to complete a pass through the training data Ø Easy to measure, but not informative … why? Ideal Metric of Success

How do we measure Learning?

“Learning” “Learning” Record = x Second Record Second

Convergence Throughput Machine Learning System Property Property Training and Validation This is the thing we want to minimize

Test* ErrorThis is what our training algorithm tries to minimize. Error Training Error

Time

*If you are making modeling decisions based on this then it should be called validation error. Metrics of Success

Ø Minimize training time to “best model” Ø Best model measured in terms of test error Ø Other Concerns? Ø Complexity: Does the approach introduce additional training complexity (e.g., hyper-parameters) Ø Stability: How consistently does the system train the model? Two papers

NIPS 2012 (Same Year as AlexNet) 2018 (Unpublished on Arxiv)

Large Scale Distributed Deep Networks Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

Jeffrey Dean, Greg S. Corrado, Rajat Monga, Kai Chen, Priya Goyal Piotr Dollar´ Ross Girshick Pieter Noordhuis Matthieu Devin, Quoc V. Le, Mark Z. Mao, Marc’Aurelio Ranzato, Lukasz Wesolowski Aapo Kyrola Andrew Tulloch Yangqing Jia Kaiming He Andrew Senior, Paul Tucker, Ke Yang, Andrew Y. Ng jeff, gcorrado @google.com { Google Inc., Mountain} View, CA Facebook

40 Abstract Abstract

35 Recent work in unsupervised feature learning and deep learning has shown that be- Deep learning thrives with large neural networks and ing able to train large models can dramatically improve performance. In this paper, large datasets. However, larger networks and larger we consider the problem of training a deep network with billions of parameters datasets result in longer training times that impede re- 30 using tens of thousands of CPU cores. We have developed a software framework search and development progress. Distributed synchronous called DistBelief that can utilize computing clusters with thousands of machines to train large models. Within this framework, we have developed two algorithms for SGD offers a potential solution to this problem by dividing 25 large-scale distributed training: (i) Downpour SGD, an asynchronous stochastic SGD minibatches over a pool of parallel workers. Yet to gradient descent procedure supporting a large number of model replicas, and (ii) make this scheme efficient, the per-worker workload must 20 ImageNet top-1 validation error Sandblaster, a framework that supports a variety of distributed batch optimization be large, which implies nontrivial growth in the SGD mini- 64 128 256 512 1k 2k 4k 8k 16k 32k 64k mini-batch size procedures, including a distributed implementation of L-BFGS. Downpour SGD batch size. In this paper, we empirically show that on the and Sandblaster L-BFGS both increase the scale and speed of deep network train- ImageNet dataset large minibatches cause optimization dif- Figure 1. ImageNet top-1 validation error vs. minibatch size. ing. We have successfully used our system to train a deep network 30x larger than Error range of plus/minus two standard deviations is shown. We ficulties, but when these are addressed the trained networks previously reported in the literature, and achieves state-of-the-art performance on present a simple and general technique for scaling distributed syn- ImageNet, a visual object recognition task with 16 million images and 21k cate- exhibit good generalization. Specifically, we show no loss chronous SGD to minibatches of up to 8k images while maintain- gories. We show that these same techniques dramatically accelerate the training of accuracy when training with large minibatch sizes up to ing the top-1 error of small minibatch training. For all minibatch of a more modestly- sized deep network for a commercial speech recognition ser- 8192 images. To achieve this result, we adopt a hyper- vice. Although we focus on and report performance of these methods as applied sizes we set the learning rate as a linear function of the minibatch to training large neural networks, the underlying algorithms are applicable to any parameter-free linear scaling rule for adjusting learning size and apply a simple warmup phase for the first few epochs of gradient-based machine learning algorithm. rates as a function of minibatch size and develop a new training. All other hyper-parameters are kept fixed. Using this warmup scheme that overcomes optimization challenges simple approach, accuracy of our models is invariant to minibatch early in training. With these simple techniques, our Caffe2- size (up to an 8k minibatch size). Our techniques enable a lin- 1 Introduction based system trains ResNet-50 with a minibatch size of 8192 ear reduction in training time with ⇠90% efficiency as we scale on 256 GPUs in one hour, while matching small minibatch to large minibatch sizes, allowing us to train an accurate 8k mini- Deep learning and unsupervised feature learning have shown great promise in many practical ap- accuracy. Using commodity hardware, our implementation batch ResNet-50 model in 1 hour on 256 GPUs. plications. State-of-the-art performance has been reported in several domains, ranging from speech recognition [1, 2], visual object recognition [3, 4], to text processing [5, 6]. achieves ⇠90% scaling efficiency when moving from 8 to 256 GPUs. Our findings enable training visual recognition tation [8, 10, 28]. Moreover, this pattern generalizes: larger It has also been observed that increasing the scale of deep learning, with respect to the number models on internet-scale data with high efficiency. of training examples, the number of model parameters, or both, can drastically improve ultimate datasets and neural network architectures consistently yield classification accuracy [3, 4, 7]. These results have led to a surge of interest in scaling up the improved accuracy across all tasks that benefit from pre- arXiv:1706.02677v2 [cs.CV] 30 Apr 2018 training and inference algorithms used for these models [8] and in improving applicable optimization training [22, 41, 34, 35, 36, 16]. But as model and data procedures [7, 9]. The use of GPUs [1, 2, 3, 8] is a significant advance in recent years that makes 1. Introduction scale grow, so does training time; discovering the potential the training of modestly sized deep networks practical. A known limitation of the GPU approach is and limits of large-scale deep learning requires developing that the training speed-up is small when the model does not fit in GPU memory (typically less than 6 gigabytes). To use a GPU effectively, researchers often reduce the size of the data or parameters Scale matters. We are in an unprecedented era in AI novel techniques to keep training time manageable. so that CPU-to-GPU transfers are not a significant bottleneck. While data and parameter reduction research history in which the increasing data and model The goal of this report is to demonstrate the feasibility of, work well for small problems (e.g. acoustic modeling for speech recognition), they are less attractive scale is rapidly improving accuracy in computer vision and to communicate a practical guide to, large-scale train- for problems with a large number of examples and dimensions (e.g., high-resolution images). [22, 41, 34, 35, 36, 16], speech [17, 40], and natural lan- ing with distributed synchronous stochastic gradient descent In this paper, we describe an alternative approach: using large-scale clusters of machines to distribute guage processing [7, 38]. Take the profound impact in com- (SGD). As an example, we scale ResNet-50 [16] training, training and inference in deep networks. We have developed a software framework called DistBe- puter vision as an example: visual representations learned originally performed with a minibatch size of 256 images lief that enables model parallelism within a machine (via multithreading) and across machines (via by deep convolutional neural networks [23, 22] show excel- (using 8 Tesla P100 GPUs, training time is 29 hours), to lent performance on previously challenging tasks like Ima- larger minibatches (see Figure 1). In particular, we show 1 geNet classification [33] and can be transferred to difficult that with a large minibatch size of 8192, we can train perception problems such as object detection and segmen- ResNet-50 in 1 hour using 256 GPUs while maintaining

1 Large Scale NIPS 2012 (Same Year as AlexNet)

Distributed Large Scale Distributed Deep Networks

Deep Networks Jeffrey Dean, Greg S. Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z. Mao, Marc’Aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, Andrew Y. Ng jeff, gcorrado @google.com Described the system for the 2012 ICML Paper { Google Inc., Mountain} View, CA

Abstract

Recent work in unsupervised feature learning and deep learning has shown that being able to train large models can dramatically improve performance. In this paper, we consider the problem of training a deep network with billions of parameters using tens of thousands of CPU cores. We have developed a software framework Building High-level Features called DistBelief that can utilize computing clusters with thousands of machines to Using Large Scale Unsupervised Learning train large models. Within this framework, we have developed two algorithms for large-scale distributed training: (i) Downpour SGD, an asynchronous stochastic gradient descent procedure supporting a large number of model replicas, and (ii) Sandblaster, a framework that supports a variety of distributed batch optimization procedures, including a distributed implementation of L-BFGS. Downpour SGD Quoc V. Le [email protected] Discovers Cat Features and Sandblaster L-BFGS both increase the scale and speed of deep network train- Marc’Aurelio Ranzato [email protected] ing. We have successfully used our system to train a deep network 30x larger than Rajat Monga [email protected] previously reported in the literature, and achieves state-of-the-art performance on Matthieu Devin [email protected] ImageNet, a visual object recognition task with 16 million images and 21k cate- Kai Chen [email protected] gories. We show that these same techniques dramatically accelerate the training Greg S. Corrado [email protected] of a more modestly- sized deep network for a commercial speech recognition ser- Jeff Dean [email protected] vice. Although we focus on and report performance of these methods as applied Andrew Y. Ng [email protected] to training large neural networks, the underlying algorithms are applicable to any gradient-based machine learningLabel algorithm. Abstract 1. Introduction The focus of this work is to build high-level,class- 1 Introduction We consider the problem of building high- specific feature detectors from unlabeled images. For level, class-specific feature detectors from instance, we would like to understand if it is possible to Deep learning and unsupervised feature learning have shown great promise in many practical ap- only unlabeled data. For example, is it build a face detector from only unlabeled images. This plications. State-of-the-art performance has beenDistBelief reported in several domains, ranging from speech possible to learn a face detector using only approach is inspired by the neuroscientific conjecture recognition [1, 2], visual object recognition [3, 4], to text processing [5, 6]. unlabeled images using unlabeled images? that there exist highly class-specific neurons in the hu- It has also been observed that increasing the scale of deep learning, with respect to the number To answer this, we train a 9-layered locally man brain, generally and informally known as “grand- of training examples, the number of model parameters, or both, can drastically improve ultimate connected sparse autoencoder with pooling mother neurons.” The extent of class-specificity of classification accuracy [3, 4, 7]. These results have led to a surge of interest in scaling up the and local contrast normalization on a large neurons in the brain is an area of active investigation, training and inference algorithms used for these models [8] and in improving applicable optimization dataset of images (the model has 1 bil- but current experimental evidence suggests the possi- procedures [7, 9]. The use of GPUs [1, 2, 3, 8] is a significant advance in recent years that makes lion connections, the dataset has 10 million bility that some neurons in the temporal cortex are the training of modestly sized deep networks practical. A known limitation of the GPU approach is 200x200 pixel images downloaded from the highly selective for object categories such as faces or that the training speed-up is small when the model does not fit in GPU memory (typically less than Internet). We train this network using model hands (Desimone et al., 1984), and perhaps even spe- 6 gigabytes). To use a GPU effectively, researchers often reduce the size of the data or parameters parallelism and asynchronous SGD on a clus- cific people (Quiroga et al., 2005). ter with 1,000 machines (16,000 cores) for so that CPU-to-GPU transfers are not a significant bottleneck. While data and parameter reduction three days. Contrary to what appears to be Contemporary computer vision methodology typically work well for small problems (e.g. acoustic modeling for speech recognition), they are less attractive awidely-heldintuition,ourexperimentalre- emphasizes the role of labeled data to obtain these for problems with a large number of examples and dimensions (e.g., high-resolution images). class-specific feature detectors. For example, to build sults reveal that it is possible to train a face In this paper, we describe an alternative approach: using large-scale clusters of machines to distribute afacedetector,oneneedsalargecollectionofimages detector without having to label images as DistBe- labeled as containing faces, often with a bounding box training and inference in deep networks. We have developed a software framework called containing a face or not. Control experiments lief that enables model parallelism within a machine (via multithreading) and across machines (via show that this feature detector is robust not around the face. The need for large labeled sets poses only to translation but also to scaling and asignificantchallengeforproblemswherelabeleddata out-of-plane rotation. We also find that the are rare. Although approaches that make use of inex- 1 same network is sensitive to other high-level pensive unlabeled data are often preferred, they have concepts such as cat faces and human bod- not been shown to work well for building high-level ies. Starting with these learned features, we features. trained our network to obtain 15.8% accu- This work investigates the feasibility of building high- racy in recognizing 20,000 object categories level features from only unlabeled data. A positive from ImageNet, a leap of 70% relative im- answer to this question will give rise to two significant provement over the previous state-of-the-art. results. Practically, this provides an inexpensive way to develop features from unlabeled data. But perhaps more importantly, it answers an intriguing question as to whether the specificity of the “grandmother neuron” Appearing in Proceedings of the 29 th International Confer- could possibly be learned from unlabeled data. Infor- ence on Machine Learning, Edinburgh, Scotland, UK, 2012. mally, this would suggest that it is at least in principle Copyright 2012 by the author(s)/owner(s). possible that a baby learns to group faces into one class Building high-level features using large-scale unsupervised learning

DBNs (Lee et al., 2009), trained on aligned images of Lyu & Simoncelli, 2008; Jarrett et al., 2009).2 faces, can learn a face detector. This result is inter- As mentioned above, central to our approach is the use esting, but unfortunately requires a certain degree of Building High-Level Featuresof local connectivity Using between neurons. In our experi- supervision during dataset construction: their training ments, the first sublayer has receptive fields of 18x18 images (i.e., Caltech 101 images) are aligned, homoge- pixels and the second sublayer and the second sub- Largeneous and Scale belong to one selected Unsupervised category. Learning layer pools over 5x5 overlapping neighborhoods of features (i.e., pooling size). The neurons in the firstICML sub- 2012 layer connect to pixels in all input channels (or maps) whereas the neurons in the second sublayer connect to pixels of only one channel (or map).3 While the

first sublayer outputs linear filterBuilding responses, High-level Features the pooling layer outputs the squareUsing Large root Scale of Unsupervised the sum Learning of the Pre-Convolutional* squares of its inputs,Quoc V. Le and therefore, it is [email protected] as L2 9x “deep” Marc’Aurelio Ranzato [email protected] Architecture Rajat Monga [email protected] pooling. Matthieu Devin [email protected] Kai Chen [email protected] Greg S. Corrado [email protected] Jeff Dean [email protected] Our style of stackingAndrew Y. Ng a series of [email protected] mod- Each Neuronules, switchinghas between selectivity and toler- Abstract 1. Introduction ance layers, is reminiscent of NeocognitionThe focus of this work is to build andhigh-level,class- separate weights We consider the problem of building high- specific feature detectors from unlabeled images. For level, class-specific feature detectors from instance, we would like to understand if it is possible to HMAX (Fukushimaonly unlabeled & Miyake data. For example,, is1982 it build; aLeCun face detector from only et unlabeled al. images., This possible to learn a face detector using only approach is inspired by the neuroscientific conjecture unlabeled images using unlabeled images? that there exist highly class-specific neurons in the hu- 1998; RiesenhuberTo &answer Poggio this, we train a 9-layered, 1999 locally ).man brain, It generally has and informally also known as “grand- connected sparse autoencoder with pooling mother neurons.” The extent of class-specificity of sparse and local contrast normalization on a large neurons in the brain is an area of active investigation, been argued to bedataset an of images architecture (the model has 1 bil- employedbut current experimental by evidence thesuggests the possi- lion connections, the dataset has 10 million bility that some neurons in the temporal cortex are connectivitybrain (DiCarlo et al.200x200, 2012 pixel images). downloaded from the highly selective for object categories such as faces or Internet). We train this network using model hands (Desimone et al., 1984), and perhaps even spe- parallelism and asynchronous SGD on a clus- cific people (Quiroga et al., 2005). ter with 1,000 machines (16,000 cores) for three days. Contrary to what appears to be Contemporary computer vision methodology typically Although we useawidely-heldintuition,ourexperimentalre- local receptive fields,emphasizes the role they of labeled data are to obtain these sults reveal that it is possible to train a face class-specific feature detectors. For example, to build detector without having to label images as afacedetector,oneneedsalargecollectionofimages Figure30x 1. The bigger architecture than and parameters~1 in onebillion layer of not convolutional:containing the a face or parameters not. Control experiments labeled are as containing not faces, shared often with a bounding box show that this feature detector is robust not around the face. The need for large labeled sets poses across different locationsonly to translation but also in to scaling the and image.asignificantchallengeforproblemswherelabeleddata This is our network.other deep The overall nets. network replicatesparameters! this structure out-of-plane rotation. We also find that the are rare. Although approaches that make use of inex- same network is sensitive to other high-level pensive unlabeled data are often preferred, they have three times. For simplicity, the images are in 1D. astarkdifferenceconcepts between such as cat faces and our human bod- approachnot been shown to and work well for pre- building high-level ies. Starting with these learned features, we features. *This pre-dates AlexNet but is two decades after LeNet. trained our network to obtain 15.8% accu- This work investigates the feasibility of building high- vious work (LeCunracy inet recognizing al., 20,0001998 object categories; Jarrettlevel features et from al. only,unlabeled2009data.; A positive 3.2. Architecture from ImageNet, a leap of 70% relative im- answer to this question will give rise to two significant Lee et al., 2009).provement In addition over the previous state-of-the-art. to beingresults. morePractically, this biolog-provides an inexpensive way to develop features from unlabeled data. But perhaps more importantly, it answers an intriguing question as ically plausible, unshared weights allowto whether the the specificity learning of the “grandmother neuron” Our algorithm is built upon these ideas and can be Appearing in Proceedings of the 29 th International Confer- could possibly be learned from unlabeled data. Infor- ence on Machine Learning, Edinburgh, Scotland, UK, 2012. mally, this would suggest that it is at least in principle viewed as a sparse deep autoencoder with three im- of more invariancesCopyright 2012 other by the author(s)/owner(s). than translationalpossible that a baby learns invari- to group faces into one class portant ingredients: local receptive fields, pooling ances (Le et al., 2010). and local contrast normalization. First, to scale the In terms of scale, our network is perhaps one of the autoencoder to large images, we use a simple idea largest known networks to date. It has 1 billion train- known as local receptive fields (LeCun et al., 1998; able parameters, which is more than an order of magni- Raina et al., 2009; Lee et al., 2009; Le et al., 2010). tude larger than other large networks reported in liter- This biologically inspired idea proposes that each fea- ature, e.g., (Ciresan et al., 2010; Sermanet & LeCun, ture in the autoencoder can connect only to a small 2011)witharound10millionparameters. Itis region of the lower layer. Next, to achieve invari- worth noting that our network is still tiny com- ance to local deformations, we employ local L2 pool- pared to the human visual cortex, which is 106 ing (Hyvärinen et al., 2009; Le et al., 2010)andlocal times larger in terms of the number of neurons and contrast normalization (Jarrett et al., 2009). L2 pool- synapses (Pakkenberg et al., 2003). ing, in particular, allows the learning of invariant features (Hyvärinen et al., 2009; Le et al., 2010). 3.3. Learning and Optimization Our deep autoencoder is constructed by replicating Learning: During learning, the parameters of the three times the same stage composed of local filtering, second sublayers (H)arefixedtouniformweights, local pooling and local contrast normalization. The output of one stage is the input to the next one and 2The subtractive normalization removes the the overall model can be interpreted as a nine-layered weighted average of neighboring neurons from the current neuron g = h − Guvhi,j u,i v network (see Figure 1). i,j,k i,j,k iuv + + The divisive normalization computes! yi,j,k = 2 0.5 The first and second sublayers are often known as fil- gi,j,k/ max{c, ( iuv Guvgi,j+u,i+v) },wherec is set tering (or simple) and pooling (or complex) respec- to be a small! number, 0.01, to prevent numerical errors. G is a Gaussian weighting window. (Jarrett et al., 2009) tively. The third sublayer performs local subtractive 3For more details regarding connectivity patterns and and divisive normalization and it is inspired by bio- parameter sensitivity, see Appendix B and E. logical and computational models (Pinto et al., 2008; More Context (ML circa 2012)

Ø Focus of existing distributed ML research Ø Convex Optimization (e.g., SVMs, Lasso) Ø Matrix Factorization Ø Graphical Models Ø Key systems at the time Ø Map Reduce à not great for iterative computation (why?) Ø Spark really wasn’t visible to ML community Ø GraphLab à A truly wonderful system* Ø we worked with Quoc/Andrew to get their model running on GraphLab but wasn’t as performant.

*Developed by the speaker. Key Problems Addressed in DistBelief Paper Main Problem Ø Speedup training for large models Sub Problems Ø How to partition models and data Ø Variance in worker performance à Stragglers Ø Failures in workers à Fault-Tolerance Crash Course on Stochastic Gradient Descent The Gradient Descent Algorithm

✓(0) initial model parameters (random, warm start)

For ! from 1 to convergence: n (t+1) (t) 1 Evaluated ✓ ✓ ⌘t ✓L(yi,f(xi; ✓)) at n r ✓=✓(t) ! AAACkXicbVFdb9MwFHXC1ygwCnvci0WFlAqoEpgECBVV8IIED0Oi26S6i25cJ7XmOJF9s1FZ+T/8nr3xb3DbSIyNK1k6Puce3w9ntZIW4/h3EN66fefuvZ37vQcPH+0+7j95emSrxnAx5ZWqzEkGViipxRQlKnFSGwFlpsRxdvZ5rR+fC2NlpX/gqhbzEgotc8kBPZX2fzFcCoRTFyF9QZNhS5kSOYIx1QX9q3n+FWX+kuI2IaIsN8Bd0jrtPbYpUyfHSXuqKdOQKUi3ZspKwGWWu29ttErlS5pHP1P5oXt6OGSZLArDjO8RU7dlx1frtszIYonDtD+IR/Em6E2QdGBAujhM+5dsUfGmFBq5AmtnSVzj3IFByZVoe6yxogZ+BoWYeaihFHbuNhtt6XPPLGheGX800g171eGgtHZVZj5zPZ+9rq3J/2mzBvN3cyd13aDQfFsobxTFiq6/hy6kERzVygPgRvpeKV+C3zT6T+z5JSTXR74Jjl6Pkjej+PvBYPKpW8cO2SfPSEQS8pZMyBdySKaEB7vBQTAOPoZ74ftwEna5YdB59sg/EX79AwU4x1A= i=1 Learning Rate X Average Gradient of How do we distribute this Over Training Dataset computation? ✓(0) initial model parameters (random, warm start)

Data parallelism: divide data across machines, compute local gradient sums and then aggregate across machines. repeat. Issues? Repeatedly scanning the data… what if we cache it? also substantially improved the programming API over Hadoop. The Gradient Descent Algorithm

✓(0) initial model parameters (random, warm start)

Can we use statistics to For t from 1 to convergence: improve this algorithm? n (t+1) (t) 1 Evaluated ✓ ✓ ⌘t ✓L(yi,f(xi; ✓)) at n r ✓=✓(t) ! AAACkXicbVFdb9MwFHXC1ygwCnvci0WFlAqoEpgECBVV8IIED0Oi26S6i25cJ7XmOJF9s1FZ+T/8nr3xb3DbSIyNK1k6Puce3w9ntZIW4/h3EN66fefuvZ37vQcPH+0+7j95emSrxnAx5ZWqzEkGViipxRQlKnFSGwFlpsRxdvZ5rR+fC2NlpX/gqhbzEgotc8kBPZX2fzFcCoRTFyF9QZNhS5kSOYIx1QX9q3n+FWX+kuI2IaIsN8Bd0jrtPbYpUyfHSXuqKdOQKUi3ZspKwGWWu29ttErlS5pHP1P5oXt6OGSZLArDjO8RU7dlx1frtszIYonDtD+IR/Em6E2QdGBAujhM+5dsUfGmFBq5AmtnSVzj3IFByZVoe6yxogZ+BoWYeaihFHbuNhtt6XPPLGheGX800g171eGgtHZVZj5zPZ+9rq3J/2mzBvN3cyd13aDQfFsobxTFiq6/hy6kERzVygPgRvpeKV+C3zT6T+z5JSTXR74Jjl6Pkjej+PvBYPKpW8cO2SfPSEQS8pZMyBdySKaEB7vBQTAOPoZ74ftwEna5YdB59sg/EX79AwU4x1A= i=1 Learning Rate X Average Gradient of Over Training Dataset The empirical gradient is an approximation of what I really want: n (t+1) (t) 1 ✓ ✓ ⌘t ✓L(yi,f(xi; ✓)) (x,y) [ ✓L(y, f(x; ✓))] AAACSnicbVBNixNBFOzJrrrGr7gevTQGIYElzKyCgpfFD/DgYQWzu5AehjedN0mzPT1D9xvJMMzv28uevPkjvHhQxIudZA66a0FDUfWK97rSUitHYfg16O3s3rh5a+92/87de/cfDB7un7iishKnstCFPUvBoVYGp6RI41lpEfJU42l6/mbtn35G61RhPlFdYpzDwqhMSSAvJQMQUJa2WHGRAy3TtHnXJs1odVCPhVP5VpWgm7dtKzRmNBMGUg2JoCUSdKms+dCO6gOejVav+NYZj7mwarGkOBkMw0m4Ab9Ooo4MWYfjZPBFzAtZ5WhIanBuFoUlxQ1YUlJj2xeVwxLkOSxw5qmBHF3cbKpo+VOvzHlWWP8M8Y36d6KB3Lk6T/3k+nR31VuL//NmFWUv40aZsiI0crsoqzSngq975XNlUZKuPQFplb+VyyVYkOTb7/sSoqtfvk5ODifRs0n08fnw6HVXxx57zJ6wEYvYC3bE3rNjNmWSXbBv7Af7GVwG34Nfwe/taC/oMo/YP+jt/gH3NbNa E n r ⇡✓=✓(t) ! ⇠D r AAACkXicbVFdb9MwFHXC1ygwCnvci0WFlAqoEpgECBVV8IIED0Oi26S6i25cJ7XmOJF9s1FZ+T/8nr3xb3DbSIyNK1k6Puce3w9ntZIW4/h3EN66fefuvZ37vQcPH+0+7j95emSrxnAx5ZWqzEkGViipxRQlKnFSGwFlpsRxdvZ5rR+fC2NlpX/gqhbzEgotc8kBPZX2fzFcCoRTFyF9QZNhS5kSOYIx1QX9q3n+FWX+kuI2IaIsN8Bd0jrtPbYpUyfHSXuqKdOQKUi3ZspKwGWWu29ttErlS5pHP1P5oXt6OGSZLArDjO8RU7dlx1frtszIYonDtD+IR/Em6E2QdGBAujhM+5dsUfGmFBq5AmtnSVzj3IFByZVoe6yxogZ+BoWYeaihFHbuNhtt6XPPLGheGX800g171eGgtHZVZj5zPZ+9rq3J/2mzBvN3cyd13aDQfFsobxTFiq6/hy6kERzVygPgRvpeKV+C3zT6T+z5JSTXR74Jjl6Pkjej+PvBYPKpW8cO2SfPSEQS8pZMyBdySKaEB7vBQTAOPoZ74ftwEna5YdB59sg/EX79AwU4x1A= i=1 X Law of large numbers à more data provides a better approximation (variance in the estimator decreases linearly)

Do I really need to use all the data? The empirical gradient is an approximation of what I really want: n (t+1) (t) 1 ✓ ✓ ⌘t ✓L(yi,f(xi; ✓)) (x,y) [ ✓L(y, f(x; ✓))] AAACSnicbVBNixNBFOzJrrrGr7gevTQGIYElzKyCgpfFD/DgYQWzu5AehjedN0mzPT1D9xvJMMzv28uevPkjvHhQxIudZA66a0FDUfWK97rSUitHYfg16O3s3rh5a+92/87de/cfDB7un7iishKnstCFPUvBoVYGp6RI41lpEfJU42l6/mbtn35G61RhPlFdYpzDwqhMSSAvJQMQUJa2WHGRAy3TtHnXJs1odVCPhVP5VpWgm7dtKzRmNBMGUg2JoCUSdKms+dCO6gOejVav+NYZj7mwarGkOBkMw0m4Ab9Ooo4MWYfjZPBFzAtZ5WhIanBuFoUlxQ1YUlJj2xeVwxLkOSxw5qmBHF3cbKpo+VOvzHlWWP8M8Y36d6KB3Lk6T/3k+nR31VuL//NmFWUv40aZsiI0crsoqzSngq975XNlUZKuPQFplb+VyyVYkOTb7/sSoqtfvk5ODifRs0n08fnw6HVXxx57zJ6wEYvYC3bE3rNjNmWSXbBv7Af7GVwG34Nfwe/taC/oMo/YP+jt/gH3NbNa E n r ⇡✓=✓(t) ! ⇠D r AAACkXicbVFdb9MwFHXC1ygwCnvci0WFlAqoEpgECBVV8IIED0Oi26S6i25cJ7XmOJF9s1FZ+T/8nr3xb3DbSIyNK1k6Puce3w9ntZIW4/h3EN66fefuvZ37vQcPH+0+7j95emSrxnAx5ZWqzEkGViipxRQlKnFSGwFlpsRxdvZ5rR+fC2NlpX/gqhbzEgotc8kBPZX2fzFcCoRTFyF9QZNhS5kSOYIx1QX9q3n+FWX+kuI2IaIsN8Bd0jrtPbYpUyfHSXuqKdOQKUi3ZspKwGWWu29ttErlS5pHP1P5oXt6OGSZLArDjO8RU7dlx1frtszIYonDtD+IR/Em6E2QdGBAujhM+5dsUfGmFBq5AmtnSVzj3IFByZVoe6yxogZ+BoWYeaihFHbuNhtt6XPPLGheGX800g171eGgtHZVZj5zPZ+9rq3J/2mzBvN3cyd13aDQfFsobxTFiq6/hy6kERzVygPgRvpeKV+C3zT6T+z5JSTXR74Jjl6Pkjej+PvBYPKpW8cO2SfPSEQS8pZMyBdySKaEB7vBQTAOPoZ74ftwEna5YdB59sg/EX79AwU4x1A= i=1 X Law of large numbers à more data provides a better approximation (variance in the estimator decreases linearly)

1 1 L(y ,f(x ; ✓)) L(y ,f(x ; ✓)) n r✓ i i ⇡ r✓ i i

AAACLnicbZDLSsNAFIYn3q23qks3g0WoICVRQUEEUQQXLhSsCk0JJ9OJHTqZhJkTsYQ8kRtfRReCirj1MZxeFt4ODHz8/znMOX+YSmHQdV+ckdGx8YnJqenSzOzc/EJ5cenSJJlmvM4SmejrEAyXQvE6CpT8OtUc4lDyq7Bz1POvbrk2IlEX2E15M4YbJSLBAK0UlI+pH2lguVfkqqC+yeIgF/ueRQWhhMDHNkegfgzYDqP8tKh2A7FBo+pdIPbowF1fD8oVt+b2i/4FbwgVMqyzoPzktxKWxVwhk2BMw3NTbOagUTDJi5KfGZ4C68ANb1hUEHPTzPvnFnTNKi0aJdo+hbSvfp/IITamG4e2s7e2+e31xP+8RobRbjMXKs2QKzb4KMokxYT2sqMtoTlD2bUATAu7K2VtsPGhTbhkQ/B+n/wXLjdr3lbNO9+uHBwO45giK2SVVIlHdsgBOSFnpE4YuSeP5JW8OQ/Os/PufAxaR5zhzDL5Uc7nF34aqCU= i=1 AAACTnicbZFNaxsxEIa1bpO4zked9tiLqAk4EMxuWmggl5BeeughhToJeM0yK8/GIlrtIs0GG2V/YS4ht/6MXnpoCK38cUiTDghe3mcGjV6lpZKWwvBH0HjxcmV1rfmqtb6xufW6vf3m1BaVEdgXhSrMeQoWldTYJ0kKz0uDkKcKz9LLzzN+doXGykJ/p2mJwxwutMykAPJW0sYYytIUEx5nBoSLancd50BjAcod19c1j22VJ07GUvNHwPsaUgVJTGMkWKA0c1/r7jSRezzrThJ5yBd0dzdpd8JeOC/+XERL0WHLOknad/GoEFWOmoQCawdRWNLQgSEpFNatuLJYgriECxx4qSFHO3TzOGq+450RzwrjjyY+dx9POMitneap75ytbZ+ymfk/NqgoOxg6qcuKUIvFRVmlOBV8li0fSYOC1NQLEEb6XbkYg8+V/A+0fAjR0yc/F6f7vehDL/r2sXN0vIyjyd6x96zLIvaJHbEv7IT1mWA37Cf7ze6D2+BX8BD8WbQ2guXMW/ZPNZp/AaOPtas= i X |B| X2B Random subset of Small B: fast but less accurate the data Large B: slower but more accurate ✓(0) initial vector (random, zeros …)

Assuming Decomposable Loss Functions For t from 1 to convergence: n (t+1) (t) 1 ✓ ✓ ⌘t ✓L(yi,f(xi; ✓)) Gradient Descent Gradient n r ✓=✓(t) ! AAACkXicbVFdb9MwFHXC1ygwCnvci0WFlAqoEpgECBVV8IIED0Oi26S6i25cJ7XmOJF9s1FZ+T/8nr3xb3DbSIyNK1k6Puce3w9ntZIW4/h3EN66fefuvZ37vQcPH+0+7j95emSrxnAx5ZWqzEkGViipxRQlKnFSGwFlpsRxdvZ5rR+fC2NlpX/gqhbzEgotc8kBPZX2fzFcCoRTFyF9QZNhS5kSOYIx1QX9q3n+FWX+kuI2IaIsN8Bd0jrtPbYpUyfHSXuqKdOQKUi3ZspKwGWWu29ttErlS5pHP1P5oXt6OGSZLArDjO8RU7dlx1frtszIYonDtD+IR/Em6E2QdGBAujhM+5dsUfGmFBq5AmtnSVzj3IFByZVoe6yxogZ+BoWYeaihFHbuNhtt6XPPLGheGX800g171eGgtHZVZj5zPZ+9rq3J/2mzBvN3cyd13aDQfFsobxTFiq6/hy6kERzVygPgRvpeKV+C3zT6T+z5JSTXR74Jjl6Pkjej+PvBYPKpW8cO2SfPSEQS8pZMyBdySKaEB7vBQTAOPoZ74ftwEna5YdB59sg/EX79AwU4x1A= i=1 X ✓(0) initial vector (random, zeros …)

Descent For t from 0 to convergence: Random subset of indices B ⇠ (t+1) (t) 1 ✓ ✓ ⌘t ✓L(yi,f(xi; ✓)) r ✓=✓(t) !

AAACqnicbVFbb9MwFHbCbYRbB4+8WFSIVIwqgUkgIaRpvPAA0kC0HaqL5bhOas1xIvsEqLz8OP4Cb/wb3CYSZeNIlj5/37mfrFbSQpL8DsIrV69dv7F3M7p1+87de4P9+1NbNYaLCa9UZU4zZoWSWkxAghKntRGszJSYZWdvN/rsmzBWVvozrGuxKFmhZS45A0/RwU8CKwHsq4sBP8XpqMVEiRyYMdV3/Ffz/DNM/IdC5xDjiOSGcZe27pyUDFacKXfcnvsEtimpk0RqvCN4XrNMMdol7aQsd+/beE3lAc7jH1S+7kuORhHJZFEYYnzzQF1Hv9ltqCVGFisY0cEwGSdbw5dB2oMh6u2EDn6RZcWbUmjgilk7T5MaFo4ZkFyJNiKNFTXjZ6wQcw81K4VduO2qW/zYM0ucV8Y/DXjL7kY4Vlq7LjPvuRnQXtQ25P+0eQP5q4WTum5AaN4VyhuFocKbu+GlNIKDWnvAuJG+V8xXzF8A/HUjv4T04siXwfT5OH0xTj8eDo+O+3XsoYfoEYpRil6iI/QOnaAJ4sGT4EMwDWbhQfgp/BLOO9cw6GMeoH8sXP4B4HnR4g== Stochastic Gradient Gradient Stochastic i |B| X2B Gradient Descent

✓AAAJx3icpZZfb9s2EMDVrtti70/b7bEvwowALZAGlrcg2VuxIkD/BIvb1Wkw0zAo+iQTpSiBpFO7jB72FfbafYJ+o32bUZScSFQLaBsB22fe/e5OJ95JYcaoVMPh3zdufnbr8y++3On1v/r6m29v37n73ZlMV4LAhKQsFechlsAoh4miisF5JgAnIYPX4ZvHhf71BQhJU/5KbTKYJTjmNKIEK7N1jtQSFJ6P5ncGw/2hXX5bCCph4FVrPL+78wEtUrJKgCvCsJTTYJipmcZCUcIg76OVhAyTNziGqRE5TkDOtE0493fNzsKPUmE+XPl2t05onEi5SUJjmWC1lK6u2PyYbrpS0dFMU56tFHBSBopWzFepX1y9v6ACiGIbI2AiqMnVJ0ssMFGmRv1GmDLVvlmIw1uSJgnmC41wKPNpMNOIQaQuBwESNF6qy7xpRYCyazPEiv/+IPBLaySsvolELDXxaozdqEOlhZtQlrINS+N8OjLkHipqEkb6hOZzPQhy6+u+PxhVXh44Uc3FJymnxMGXn6Sb+EnKIdcncx3kjuMTyiOrQUZQm5b6KY9+TUVSq2X5fX3Bl9XPJ12ccujqImiVTWxybe5TcZOnIg5nerh/dHh4cHCwtz32hfDjaHR4lLdCi3V3eN2Cbc91d1D2aMsLZJKylP8LP1uifahjegHGE9q7RHtOHKkWZgqVNa4OB8po7m+PxrbUD3wHHIuwQY27McduMEXZAgzdAT/mTfZJl5D0KlEj6MIyd22eg+DW5HmHHNaNFI6rFKbXyKyNyLL9rhjbfFfgqAa65FuhOrMN9AwLijmBRrZms0O+T/miQVVzoo44xC/bAiIFa6XtX1vEonz93V3/4X9dBfwbKJ+nyj7R3CbBTOZVoqF+6TYR5QpiM/mvTX53TSSo2lDW24GPXLsLINI8OqByRTDTZ67NhNOiYVFIY7LKWlU1z6BKiV3lqiLbGK2wNiPpO6jNxtpULAbA/y364zTJGKyp2ji3msanNuopWtqC6IdBpvJW37gnhMav7JizCVux3Wtt5jSBuGKs+DGm3zdvN4H7LtMWJqP9n/eDFz8NHo2r15wd7573g3ffC7xD75H3xBt7E494zPvTe+/91XvWy3oXvXVpevNGxXzvNVbvj38A7odVhA== 2

ˆ ✓AAAJzHicpZZfb9s2EMDV7l/s/Wm7Pe5FmBGgBdLAcuMleStWBGjRoPG2Ji1mGgFFn2SiFCWQdGqP0Ws/Ql+3536jfptRlJxIVANoGwHbJ9797s4n3klhxqhUw+HHW7c/+/yLL7/a6vW//ubb7+7cvff9mUyXgsApSVkqXodYAqMcThVVDF5nAnASMngVvnlS6F9dgJA05S/VOoNZgmNOI0qwMlsILbDSSC1A4fz87mC4ezgMDoM93wiPxsNxYISDw73Rz4Ef7A7tGnjVmpzf2/qA5ilZJsAVYVjKaTDM1ExjoShhkPfRUkKGyRscw9SIHCcgZ9omnfvbZmfuR6kwH658u1snNE6kXCehsUywWkhXV2x+SjddquhgpinPlgo4KQNFS+ar1C8q4M+pAKLY2giYCGpy9ckCC0yUqVO/EaZMtW8W4vCWpEmC+VwjHMp8Gsw0YhCpy0GABI0X6jJvWhGg7NoMseLaHwR+aY2E1TeRiKUmXo2xG3WotHATylK2ZmmcT0eG3EFFTcJIH9P8XA+C3Pq67w9GlZcHTlTz55OUU+LgixvpJn6ccsj18bkOcsfxMeWR1SAjqHVL/YxHL1KR1GpZfl//4cvq50YXJxy6ughaZRPrXJv7VNzkqYjDmTanfX9/PB7vDKvTXgiPRqP9g7wVWqy6w6sWXLZcZwdVi7peIJOUpfxf+NkQ7UMd0wswntDOJdpx4kg1N5OorHF1OFBGc39zNDalfuA74ESEDWrSjTlygynK5mDoDvgRb7JPu4SkV4kaQReWuWvzHAS3Js875LBqpHBUpTC9RmZtRJbtd8XY5rsCRzXQJd8K1ZltoGdYUMwJNLI1mx3yfcbnDaqaE3XEIX7ZFBApWCltL20Ri/L1t7f9h/91FfDvoHyeKvtUc5sEM5lXiYb6N7eJKFcQm8l/bfKHayJB1Yay3gx85NpdAJHm0QGVK4KZPnNtTjktGhaFNCbLrFVV8wyqlNhVLiuyjdEKazOS/gm12VibisUA+L9Ff5ImGYMVVWvnVtP4xEY9QQtbEP0wyFTe6hv3hND4pR1zNmErtnutzZwkEFeMFT/F9Pvm7WbzCuPfLJyNdoPhbvDr3uDxpHrP2fJ+9H7y7nuBt+899p56E+/UI17mvff+8v7uveipnu7lpentWxXzg9dYvXf/AJL+WCU=

✓AAAJx3icpZZfb9s2EMDVrtti70/b7bEvwowALZAGlrcg2VuxIkD/BIvb1Wkw0zAo+iQTpSiBpFO7jB72FfbafYJ+o32bUZScSFQLaBsB22fe/e5OJ95JYcaoVMPh3zdufnbr8y++3On1v/r6m29v37n73ZlMV4LAhKQsFechlsAoh4miisF5JgAnIYPX4ZvHhf71BQhJU/5KbTKYJTjmNKIEK7N1jtQSFJ4H8zuD4f7QLr8tBJUw8Ko1nt/d+YAWKVklwBVhWMppMMzUTGOhKGGQ99FKQobJGxzD1IgcJyBn2iac+7tmZ+FHqTAfrny7Wyc0TqTcJKGxTLBaSldXbH5MN12p6GimKc9WCjgpA0Ur5qvUL67eX1ABRLGNETAR1OTqkyUWmChTo34jTJlq3yzE4S1JkwTzhUY4lPk0mGnEIFKXgwAJGi/VZd60IkDZtRlixX9/EPilNRJW30Qilpp4NcZu1KHSwk0oS9mGpXE+HRlyDxU1CSN9QvO5HgS59XXfH4wqLw+cqObik5RT4uDLT9JN/CTlkOuTuQ5yx/EJ5ZHVICOoTUv9lEe/piKp1bL8vr7gy+rnky5OOXR1EbTKJja5NvepuMlTEYczPdw/Ojw8ODjY2x77QvhxNDo8yluhxbo7vG7Btue6Oyh7tOUFMklZyv+Fny3RPtQxvQDjCe1doj0njlQLM4XKGleHA2U097dHY1vqB74DjkXYoMbdmGM3mKJsAYbugB/zJvukS0h6lagRdGGZuzbPQXBr8rxDDutGCsdVCtNrZNZGZNl+V4xtvitwVANd8q1QndkGeoYFxZxAI1uz2SHfp3zRoKo5UUcc4pdtAZGCtdL2ry1iUb7+7q7/8L+uAv4NlM9TZZ9obpNgJvMq0VC/dJuIcgWxmfzXJr+7JhJUbSjr7cBHrt0FEGkeHVC5IpjpM9dmwmnRsCikMVllraqaZ1ClxK5yVZFtjFZYm5H0HdRmY20qFgPg/xb9cZpkDNZUbZxbTeNTG/UULW1B9MMgU3mrb9wTQuNXdszZhK3Y7rU2c5pAXDFW/BjT75u3m8B9l2kLk9H+z/vBi58Gj8bVa86Od8/7wbvvBd6h98h74o29iUc85v3pvff+6j3rZb2L3ro0vXmjYr73Gqv3xz/lK1WD 1 Stochastic Gradient Descent

✓AAAJx3icpZZfb9s2EMDVrtti70/b7bEvwowALZAGlrcg2VuxIkD/BIvb1Wkw0zAo+iQTpSiBpFO7jB72FfbafYJ+o32bUZScSFQLaBsB22fe/e5OJ95JYcaoVMPh3zdufnbr8y++3On1v/r6m29v37n73ZlMV4LAhKQsFechlsAoh4miisF5JgAnIYPX4ZvHhf71BQhJU/5KbTKYJTjmNKIEK7N1jtQSFJ4H8zuD4f7QLr8tBJUw8Ko1nt/d+YAWKVklwBVhWMppMMzUTGOhKGGQ99FKQobJGxzD1IgcJyBn2iac+7tmZ+FHqTAfrny7Wyc0TqTcJKGxTLBaSldXbH5MN12p6GimKc9WCjgpA0Ur5qvUL67eX1ABRLGNETAR1OTqkyUWmChTo34jTJlq3yzE4S1JkwTzhUY4lPk0mGnEIFKXgwAJGi/VZd60IkDZtRlixX9/EPilNRJW30Qilpp4NcZu1KHSwk0oS9mGpXE+HRlyDxU1CSN9QvO5HgS59XXfH4wqLw+cqObik5RT4uDLT9JN/CTlkOuTuQ5yx/EJ5ZHVICOoTUv9lEe/piKp1bL8vr7gy+rnky5OOXR1EbTKJja5NvepuMlTEYczPdw/Ojw8ODjY2x77QvhxNDo8yluhxbo7vG7Btue6Oyh7tOUFMklZyv+Fny3RPtQxvQDjCe1doj0njlQLM4XKGleHA2U097dHY1vqB74DjkXYoMbdmGM3mKJsAYbugB/zJvukS0h6lagRdGGZuzbPQXBr8rxDDutGCsdVCtNrZNZGZNl+V4xtvitwVANd8q1QndkGeoYFxZxAI1uz2SHfp3zRoKo5UUcc4pdtAZGCtdL2ry1iUb7+7q7/8L+uAv4NlM9TZZ9obpNgJvMq0VC/dJuIcgWxmfzXJr+7JhJUbSjr7cBHrt0FEGkeHVC5IpjpM9dmwmnRsCikMVllraqaZ1ClxK5yVZFtjFZYm5H0HdRmY20qFgPg/xb9cZpkDNZUbZxbTeNTG/UULW1B9MMgU3mrb9wTQuNXdszZhK3Y7rU2c5pAXDFW/BjT75u3m8B9l2kLk9H+z/vBi58Gj8bVa86Od8/7wbvvBd6h98h74o29iUc85v3pvff+6j3rZb2L3ro0vXmjYr73Gqv3xz/lK1WD 1 How do you distribute SGD?

✓(0) initial vector (random, zeros …) Model Parallelism Descent For t from 0 to convergence: Speed up Gradient. Random subset of indices Depends on Model B ⇠ (t+1) (t) 1 ✓ ✓ ⌘t ✓L(yi,f(xi; ✓)) r ✓=✓(t) !

Large Scale Distributed Deep Networks

Key Innovations in Jeffrey Dean, Greg S. Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z. Mao, Marc’Aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, Andrew Y. Ng jeff, gcorrado @google.com Large Scale { Google Inc., Mountain} View, CA Abstract

Recent work in unsupervised feature learning and deep learning has shown that being able to train large models can dramatically improve performance. In this paper, Distributed we consider the problem of training a deep network with billions of parameters using tens of thousands of CPU cores. We have developed a software framework called DistBelief that can utilize computing clusters with thousands of machines to train large models. Within this framework, we have developed two algorithms for large-scale distributed training: (i) Downpour SGD, an asynchronous stochastic Deep Networks gradient descent procedure supporting a large number of model replicas, and (ii) Sandblaster, a framework that supports a variety of distributed batch optimization procedures, including a distributed implementation of L-BFGS. Downpour SGD and Sandblaster L-BFGS both increase the scale and speed of deep network training. We have successfully used our system to train a deep network 30x larger than previously reported in the literature, and achieves state-of-the-art performance on ImageNet, a visual object recognition task with 16 million images and 21k categories. We show that these same techniques dramatically accelerate the training of a more modestly- sized deep network for a commercial speech recognition service. Although we focus on and report performance of these methods as applied to training large neural networks, the underlying algorithms are applicable to any gradient-based machine learningLabel algorithm.

1 Introduction

Deep learning and unsupervised feature learning have shown great promise in many practical ap- plications. State-of-the-art performance has beenDistBelief reported in several domains, ranging from speech recognition [1, 2], visual object recognition [3, 4], to text processing [5, 6]. It has also been observed that increasing the scale of deep learning, with respect to the number of training examples, the number of model parameters, or both, can drastically improve ultimate classification accuracy [3, 4, 7]. These results have led to a surge of interest in scaling up the training and inference algorithms used for these models [8] and in improving applicable optimization procedures [7, 9]. The use of GPUs [1, 2, 3, 8] is a significant advance in recent years that makes the training of modestly sized deep networks practical. A known limitation of the GPU approach is that the training speed-up is small when the model does not fit in GPU memory (typically less than 6 gigabytes). To use a GPU effectively, researchers often reduce the size of the data or parameters so that CPU-to-GPU transfers are not a significant bottleneck. While data and parameter reduction work well for small problems (e.g. acoustic modeling for speech recognition), they are less attractive for problems with a large number of examples and dimensions (e.g., high-resolution images). In this paper, we describe an alternative approach: using large-scale clusters of machines to distribute training and inference in deep networks. We have developed a software framework called DistBe- lief that enables model parallelism within a machine (via multithreading) and across machines (via

1 Combine Model and Data Parallelism

Model Parallelism Data Parallelism Machine 2 Machine Machine 1 Machine 4 Machine Machine 3

This appears in earlier work on graph systems … Figure 1: An example of model parallelism in DistBelief. A ﬁve layer deep neural network withDownpour SGD local connectivity is shown here, partitioned across four machines (blue rectangles). Only those nodes with edges that cross partition boundaries (thick lines) will need to have their state transmitted between machines. Even in cases where a node has multiple edges crossing a partition boundary, its state is only sent to the machine on the other side of that boundary once. Within each partition, computation for individual nodes will the parallelized across all available CPU cores.

3 Model parallelism

To facilitate the training of very large deep networks, we have developed a software framework, DistBelief, that supports distributed computation in neural networks and layered graphical models. The user defines the computation that takes place at each node in each layer of the model, and the messages that should be passed during the upward and downward phases of computation.2 For large models, the user may partition the model across several machines (Figure 1), so that respon- sibility for the computation for different nodes is assigned to different machines. The framework automatically parallelizes computation in each machine using all available cores, and manages communication, synchronization and data transfer between machines during both training and inference. The performance benefits of distributing a deep network across multiple machines depends on the connectivity structure and computational needs of the model. Models with a large number of parameters or high computational demands typically benefit from access to more CPUs and memory, up to the point where communication costs dominate. We have successfully run large models with up to 144 partitions in the DistBelief framework with significant speedups, while more modestly sized models show decent speedups for up to 8 or 16 partitions. (See Section 5, under the heading Model Parallelism Benchmarks, for experimental results.) Obviously, models with local connectivity structures tend to be more amenable to extensive distribution than fully-connected structures, given their lower communication requirements. The typical cause of less-than-ideal speedups is variance in processing times across the different machines, leading to many machines waiting for the single slowest machine to finish a given phase of computation. Nonetheless, for our largest models, we can efficiently use 32 machines where each machine achieves an average CPU utilization of 16 cores, for a total of 512 CPU cores training a single large neural network. When combined with the distributed optimization algorithms described in the next section, which utilize multiple replicas of the entire neural network, it is possible to use tens of thousands of CPU cores for training a single model, leading to significant reductions in overall training times.

4 Distributed optimization algorithms

Parallelizing computation within the DistBelief framework allows us to instantiate and run neural networks considerably larger than have been previously reported. But in order to train such large models in a reasonable amount of time, we need to parallelize computation not only within a single

2In the case of a neural network ‘upward’ and ‘downward’ might equally well be called ‘feedforward’ and ‘backprop’, while for a Hidden Markov Model, they might be more familiar as ‘forward’ and ‘backward’.

3 Combine Model and Data Parallelism

Model Parallelism Data Parallelism Machine 2 Machine Machine 1

Machine 4 Machine Asynchronous Synchronous Machine 3

Figure 1: An example of model parallelism in DistBelief. A ﬁve layer deep neural network withDownpour SGD Sandblaster L-BFGS local connectivity is shown here, partitioned across four machines (blue rectangles). Only those nodes with edges that cross partition boundaries (thick lines) will need to have their state transmitted between machines. Even in cases where a node has multiple edges crossing a partition boundary, its state is only sent to the machine on the other side of that boundary once. Within each partition, computation for individual nodes will the parallelized across all available CPU cores.

3 Model parallelism

4 Distributed optimization algorithms

3 Sandblaster L-BFGS

Ø L-BFGS Ø Commonly used for convex opt. problems Ø Requires repeated scans of Synchronous all data Ø Robust, minimal tuning Ø Naturally fits map-reduce pattern Ø Innovations: Ø accumulate gradients and store outputs in a sharded key value store (parameter server) Ø Tiny tasks + backup tasks to mitigate stragglers Combine Model and Data Parallelism

Model Parallelism Data Parallelism Machine 2 Machine Machine 1

Machine 4 Machine Asynchronous Synchronous Machine 3

3 Model parallelism

4 Distributed optimization algorithms

3 Downpour SGD

Claimed Innovations Ø Parameter Server Ø Combine model and data parallelism in an async. Asynchronous execution. Ø Adagrad stabilization Ø Warmstarting ParameterAlgorithm 1 State Synchronization ServersData flow in terms of documents is entirely local.On Initialize n(t, w)=ni(t, w)=ni (t, w) for all i. each machine it is handled by Intel’s Threading-Building- old 4 while sampling do Blocks library since it provides a convenient pipeline struc- Lock n(t, w) globally for some w. ture which automatically handles parallelization and schedul- Lock ni(t, w)locally. ing for multicore processors. Locking between samplers, up- Ø UpdateEssentiallyn(t, w)=n(t, w)+ nia(t, w )shardedni (t, w) keydaters, and-value synchronizers isstore handled by a read-write lock old Update ni(t, w)=ni (t, w)=n(t, w) (spinlock) — the samplers impose a non-exclusive read lock old ⇥ ⇤ Update local ni(t). while the update thread imposes an exclusive write lock. Ø support for put, get, addThe asynchronous communication between a cluster of Release ni(t, w)locally. 5 Release n(t, w)globally. computers is handled by memcached servers which run stan- end while dalone on each of the computers and the libmemcached client access library which is integrated into the LDA code- Ø Idea appears in earlierbase. papers: The advantage of this design is that no dedicated is happening one word at a time the algorithm does not in- server code needs to be written. A downside is the high duce deadlocks in the sampler. Moreover, the probability of latency of memcached, in particular, when client and server “Anlock Architecture contention between di↵ forerent Parallel computers is minimalTopic (we are located on di↵erent“Scalable racks in the server Inference center. Given in the Latent Variable Models”, have > 106 distinct words and typically 102 computers with modularity of our designAhmed, it would Aly, be easy Gonzalez to replace it by, a Narayanamruthy, and Models”less than 10, synchronizationSmola and threads Narayanamruthy per computer). Note . service with lower latency, such as RAMCloud once the latter (VLDB’10)that the high number of synchronization threads (up to 10) becomes available.Smola In particular,. (WSDM’12) a versioned write would be in practice is due to the high latency of memcached. highly preferable to the current pessimistic locking mecha- nism that is implemented in Algorithm 1 — collisions are far less likely than successful independent updates. Failed sampler sampler sampler sampler writes due to versioned data, as they will be provided in RAMCloud would address this problem. 4.2 Data Layout To store the n(t, w) we use the same memory layout as Mallet. That is, we maintain a list of (topic, count) pairs for memcached memcached memcached memcached each word w sorting in order of decreasing counts. This allows us to implement a sampler eciently (with high probability we do not reach the end of the list) since the most likely topics occur first. Random access (which occurs rarely), Figure 3: Each sampler keeps on processing the sub- however, is O(k) where k is the number of topics with nonzero set of data associated with it. Simultaneously a syn- count. Our code requires twice the memory footprint as that DistBeliefchronizationwas thread probably keeps on reconciling the first the paper local to call a sharded key-value store a Parameter Server. of Mallet (64bit rather than 32bit per (topic, count) pair) and global state tables. since for millions of documents the counters would overflow. The updater thread receives a list of messages of the form Note that this communications template could be used (word, old topic id, new topic id) from the sampler for every in a considerably more general context: the blackboard ar- document (see Figure 1). Whenever the changes in counts chitecture supports any system where a common state is do not result in a reordering of the list of (topic, count) pairs shared between a large number of systems whose changes and update is carried out without locking. This is possible a↵ect the global value of the state. For instance, we may since on modern x86 architectures updates of 32bit integers use it to synchronize parameters in a stochastic gradient de- are atomic provided that the data is aligned with the bus scent scenario by asynchronously averaging local and global boundaries. Whenever changes necessitate a reordering we parameter values as is needed in dual decomposition meth- acquire a write lock (any sampler using this word at the very ods. Likewise, the same architecture could be used to per- moment stalls at this point) before e↵ecting changes. Since form message passing [1] whenever the junction tree of a counts change only by 1 it is unlikely that (topic, count) graphical model has star topology. By keeping copies of the i pairs move far within the list. This reduces lock time. 126 old messages local (represented by nold)onthenodesitis possible to scale such methods to large numbers of clients 4.3 Initialization and Recovery for Multicore without exhausting memory on memcached. At initialization time no useful topic assignment exists and we want to assign topics at random to words of the docu- 4. IMPLEMENTATION ments. This can be accommodated by a random assignment sampler as described in the diagram below: 4.1 Basic Tools 3 We use Google’s protobuf with optimization set to favor sampler file sampler build word- output to tokens sampler topics speed, since it provides disk-speed data serialization with combiner samplersampler: topic table file randomly little overhead. Since protobuf cannot deal well with ar- assigned bitrary length messages (it tries loading them into memory entirely before parsing) we treat each document separately In particular, the file combiner and the output routine as a message to be parsed. To minimize write requirements are identical. Obviously this could be replaced with a more we store documents and their topic assignments separately. 4http://www.threadingbuildingblocks.org/ 3http://code.google.com/p/protobuf/ 5http://www.danga.com/memcached/

707 Downpour SGD

Claimed Innovations Ø Parameter Server Ø Combine model and data parallelism in an async. Asynchronous execution. Ø Adagrad stabilization Ø Warmstarting Adagrad Stabilization

Ø Address large variability in magnitudes of gradients Ø Rescale the gradients by an estimate of the diagonal variance Ø More recently superseded by Adam Ø Stability is needed here to support asynchronous gradient updates Warmstarting Starting closer to a solution can help! Ø Recall

✓(0) initial vector (random, zeros …)

Descent For t from 0 to convergence: Random subset of indices B ⇠ (t+1) (t) 1 ✓ ✓ ⌘t ✓L(yi,f(xi; ✓)) r ✓=✓(t) !

See http://www.ds100.org/fa17/assets/notebooks/26-lec/Logistic_Regression_Part_2.html portions of data to the same worker makes data access a non-issue. In contrast with Downpour SGD, which requires relatively high frequency, high bandwidth parameter synchronization with the parameter server, Sandblaster workers only fetch parameters at the beginning of each batch (when they have been updated by the coordinator), and only send the gradients every few completed portions (to protect against replica failures and restarts).

5 Experiments

We evaluated our optimization algorithms by applying them to training models for two different deep learning problems: object recognition in still images and acoustic processing for speech recognition. The speech recognition task was to classify the central region (or frame) in a short snippet of audio as one of several thousand acoustic states. We used a deep network with five layers: four hidden layer with sigmoidal activations and 2560 nodes each, and a softmax output layer with 8192 nodes. The input representation was 11 consecutive overlapping 25 ms frames of speech, each represented by 40 log-energy values. The network was fully-connected layer-to-layer, for a total of approximately 42 million model parameters. We trained on a data set of 1.1 billion weakly labeled examples, and evaluated on a hold out test set. See [27] for similar deep network configurations and training procedures. For visual object recognition we trained a larger neural network with locally-connected receptive fields on the ImageNet data set of 16 million images, each of which we scaled to 100x100 pixels [28]. The network had three stages, each composed of filtering, pooling and local contrast normalization, where each node in the filtering layer was connected to a 10x10 patch in the layer below. Our infrastructure allows many nodes to connect to the same input patch, and we ran experiments varying the number of identically connected nodes from 8 to 36. The output layer consisted of 21 thousand one-vs-all logistic classifier nodes, one for each of the ImageNet object categories. See [29] for similar deep network configurations and training procedures.

Model parallelism benchmarks: To explore the scaling behavior of DistBelief model parallelism (Section 3), we measured the mean time to process a single mini-batch for simple SGD training as a function of the number of partitions (machines) used in a single model instance. In Figure 3 we quantify the impact of parallelizing across N machines by reporting the average training speed-up: the ratio of the time taken using only a single machine to the time taken using N. Speedups for inference steps in these models are similar and are not shown here. Key Results The moderately sized speech model runs fastest on 8 machines, computing 2.2 faster than using a single machine. (Models were conﬁgured to use no more than 20 cores per machine.)⇥ Partitioning

Model Parallelism 15 Speech: 42M parameters Images: 80M parameters Ø Measured speedup to Images: 330M parameters Images: 1.7B parameters up compute a single mini-batch 10 Ø Is this a good metric?

Ø Results are not that strong… 5 Training speed

0 1 16 32 64 128 Machines per model instance

Figure 3: Training speed-up for four different deep networks as a function of machines allocated to a single DistBelief model instance. Models with more parameters beneﬁt more from the use of additional machines than do models with fewer parameters.

6 Key Results: Training and Test Error Which would Weird 20K you use? Error Metric Accuracy on Training Set Accuracy on Test Set 25 25

20 20

15 15

10 Looks like learning 10 rate reset SGD [1] GPU [1] 5 SGD [1] 5 DownpourSGD [20]

Average Frame Accuracy (%) DownpourSGD [20] Average Frame Accuracy (%) DownpourSGD [20] w/Adagrad DownpourSGD [200] w/Adagrad DownpourSGD [200] w/Adagrad Sandblaster L−BFGS [2000] Sandblaster L−BFGS [2000] 0 0 0 20 40 60 80 100 120 0 20 40 60 80 100 120 Time (hours) Time (hours) Wall clock Figuretime 4: Left: is good. Training accuracy (on a portion of the training set) for different optimization methods. Right: Classiﬁcation accuracy on the hold out test set as a function of training time. Downpour and Sandblaster experiments initialized using the same 10 hour warmstart of simple SGD. ⇠

the model on more than 8 machines actually slows training, as network overhead starts to dominate in the fully-connected network structure and there is less work for each machine to perform with more partitions. In contrast, the much larger, locally-connected image models can beneﬁt from using many more machines per model replica. The largest model, with 1.7 billion parameters beneﬁts the most, giving a speedup of more than 12 using 81 machines. For these large models using more machines continues to increase speed,⇥ but with diminishing returns.

Optimization method comparisons: To evaluate the proposed distributed optimization procedures, we ran the speech model described above in a variety of configurations. We consider two baseline optimization procedures: training a DistBelief model (on 8 partitions) using conventional (single replica) SGD, and training the identical model on a GPU using CUDA [27]. The three distributed optimization methods we compare to these baseline methods are: Downpour SGD with a fixed learning rate, Downpour SGD with Adagrad learning rates, and Sandblaster L-BFGS. Figure 4 shows classification performance as a function of training time for each of these methods on both the training and test sets. Our goal is to obtain the maximum test set accuracy in the minimum amount of training time, regardless of resource requirements. Conventional single replica SGD (black curves) is the slowest to train. Downpour SGD with 20 model replicas (blue curves) shows a significant improvement. Downpour SGD with 20 replicas plus Adagrad (orange curve) is modestly faster. Sandblaster L-BFGS using 2000 model replicas (green curves) is considerably faster yet again. The fastest, however, is Downpour SGD plus Adagrad with 200 model replicas (red curves). Given access to sufficient CPU resourses, both Sandblaster L-BFGS and Downpour SGD with Adagrad can train models substantially faster than a high performance GPU. Though we did not confine the above experiments to a fixed resource budget, it is interesting to consider how the various methods trade off resource consumption for performance. We analyze this by arbitrarily choosing a fixed test set accuracy (16%), and measuring the time each method took to reach that accuracy as a function of machines and utilized CPU cores, Figure 5. One of the four points on each traces corresponds to a training configuration shown in Figure 4, the other three points are alternative configurations. In this plot, points closer to the origin are preferable in that they take less time while using fewer resources. In this regard Downpour SGD using Adagrad appears to be the best trade-off: For any fixed budget of machines or cores, Downpour SGD with Adagrad takes less time to reach the accuracy target than either Downpour SGD with a fixed learning rate or Sandblaster L-BFGS. For any allotted training time to reach the accuracy target, Downpour SGD with Adagrad used few resources than Sandblaster L-BFGS, and in many cases Downpour SGD with a fixed learning rate could not even reach the target within the deadline. The Sandblaster L-BFGS system does show promise in terms

7 Why are they in the NY Times

Ø Trained a 1.7 billion parameter model (30x larger than state-of-the-art) (was it necessary?) Ø Using 16,000 cores (efficiently?) Ø Achieves 15.8 accuracy on ImageNet 20K (70% improvement over state of the art). Ø Non-standard benchmark Ø Qualitatively interesting results Long-term Impact

Ø The parameter server appears in many later machine learning systems Ø Downpour (Asynchronous) SGD has been largely replaced by synchronous systems for supervised training Ø Asynchrony is still popular in RL research Ø Why? Ø Model parallelism is still used for large language models Ø Predated this work Ø The neural network architectures studied here have been largely replaced by convolutional networks 2018 (Unpublished on Arxiv)

Second Paper Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

Priya Goyal Piotr Dollar´ Ross Girshick Pieter Noordhuis Lukasz Wesolowski Aapo Kyrola Andrew Tulloch Yangqing Jia Kaiming He Ø Generated a lot of press Facebook Abstract 40

Deep learning thrives with large neural networks and 35 Ø Recently (Aug) surpassed by large datasets. However, larger networks and larger datasets result in longer training times that impede re- 30 search and development progress. Distributed synchronous Fast.ai: “Now anyone can train SGD offers a potential solution to this problem by dividing 25 SGD minibatches over a pool of parallel workers. Yet to make this scheme efficient, the per-worker workload must 20 ImageNet top-1 validation error be large, which implies nontrivial growth in the SGD mini- 64 128 256 512 1k 2k 4k 8k 16k 32k 64k ImageNet in 18 minutes for $40.” mini-batch size batch size. In this paper, we empirically show that on the ImageNet dataset large minibatches cause optimization dif- Figure 1. ImageNet top-1 validation error vs. minibatch size. Error range of plus/minus two standard deviations is shown. We ficulties, but when these are addressed the trained networks blog post present a simple and general technique for scaling distributed syn- exhibit good generalization. Specifically, we show no loss chronous SGD to minibatches of up to 8k images while maintain- of accuracy when training with large minibatch sizes up to ing the top-1 error of small minibatch training. For all minibatch 8192 images. To achieve this result, we adopt a hyper- sizes we set the learning rate as a linear function of the minibatch parameter-free linear scaling rule for adjusting learning size and apply a simple warmup phase for the first few epochs of rates as a function of minibatch size and develop a new training. All other hyper-parameters are kept fixed. Using this warmup scheme that overcomes optimization challenges simple approach, accuracy of our models is invariant to minibatch early in training. With these simple techniques, our Caffe2- size (up to an 8k minibatch size). Our techniques enable a lin- based system trains ResNet-50 with a minibatch size of 8192 ear reduction in training time with ⇠90% efficiency as we scale on 256 GPUs in one hour, while matching small minibatch to large minibatch sizes, allowing us to train an accurate 8k mini- Ø Popularized linear accuracy. Using commodity hardware, our implementation batch ResNet-50 model in 1 hour on 256 GPUs. achieves ⇠90% scaling efficiency when moving from 8 to 256 GPUs. Our findings enable training visual recognition tation [8, 10, 28]. Moreover, this pattern generalizes: larger learning rate scaling models on internet-scale data with high efficiency. datasets and neural network architectures consistently yield improved accuracy across all tasks that benefit from pre- arXiv:1706.02677v2 [cs.CV] 30 Apr 2018 training [22, 41, 34, 35, 36, 16]. But as model and data 1. Introduction scale grow, so does training time; discovering the potential and limits of large-scale deep learning requires developing Scale matters. We are in an unprecedented era in AI novel techniques to keep training time manageable. research history in which the increasing data and model The goal of this report is to demonstrate the feasibility of, scale is rapidly improving accuracy in computer vision and to communicate a practical guide to, large-scale train- [22, 41, 34, 35, 36, 16], speech [17, 40], and natural lan- ing with distributed synchronous stochastic gradient descent guage processing [7, 38]. Take the profound impact in com- (SGD). As an example, we scale ResNet-50 [16] training, puter vision as an example: visual representations learned originally performed with a minibatch size of 256 images by deep convolutional neural networks [23, 22] show excel- (using 8 Tesla P100 GPUs, training time is 29 hours), to lent performance on previously challenging tasks like Ima- larger minibatches (see Figure 1). In particular, we show geNet classification [33] and can be transferred to difficult that with a large minibatch size of 8192, we can train perception problems such as object detection and segmen- ResNet-50 in 1 hour using 256 GPUs while maintaining

1 Contrasting to the first paper

Ø Synchronous SGD Ø Much of the recent work has focused on synchronous setting Ø Easier to reason about Ø Focus exclusively on data parallelism: batch-size scaling Ø Focuses on the generalization gap problem How do you distribute SGD?

✓(0) initial vector (random, zeros …)

Descent For t from 0 to convergence: Random subset of indices B ⇠ (t+1) (t) 1 ✓ ✓ ⌘t ✓L(yi,f(xi; ✓)) r ✓=✓(t) !

Ø Increase the batch size by adding machines

k (t+1) (t) 1 1 ✓ ✓ ⌘ˆ ✓L(yi,f(xi; ✓)) 0k j r ✓=✓(t) 1 j=1 i j AAACyHicbVFdi9QwFE3r11o/dtRHX4KD0EEdWhUUZGFZEUR8WMHZXZjMljSTttlJ05LcupZsX/yJvvnkXzEzLbofXggczjmXe29OWkthIIp+ef616zdu3tq6Hdy5e+/+9ujBwwNTNZrxGatkpY9SargUis9AgORHtea0TCU/TFfv1/rhN66NqNRXaGu+KGmuRCYYBUclo98ECg702IaAn+F40mEieQZU6+oU/9Mc/wKTgoIljhlMIck0ZTbu7KojpikTe7ITd8er4C9/RkoKBaPS7nXJyZnr29gEEQpfkJyiaCpp0o/sxTSzn7uwTcRznIXfE/FuWGgyCUgq8lwT7U6DxPb0zvl1O6JFXsAkGY2jabQpfBXEAxijofaT0U+yrFhTcgVMUmPmcVTDwlINgkneBaQxvKZsRXM+d1DRkpuF3QTR4aeOWeKs0u4pwBv2fIelpTFtmTrn+kBzWVuT/9PmDWRvF1aougGuWD8oaySGCq9TxUuhOQPZOkCZFm5XzArqUgCXfeA+Ib588lVw8HIav5rGX16Pd/eG79hCj9ETFKIYvUG76CPaRzPEvA/eygOv8T/5tX/qt73V94aeR+hC+T/+ALw/32s= X |B | X2B @ A Ø Each server processes a fixed batch size (e.g., n=32) Ø As more servers are added (k) the effective overall batch size increases linearly Ø Why do these additional servers help? Bigger isn’t Always Better

Ø Motivation for larger batch sizes Ø More opportunities for parallelism à but is it useful? Ø Recall (1/n variance reduction): 1 1 L(y ,f(x ; ✓)) L(y ,f(x ; ✓)) n r✓ i i ⇡ r✓ i i

AAACLnicbZDLSsNAFIYn3q23qks3g0WoICVRQUEEUQQXLhSsCk0JJ9OJHTqZhJkTsYQ8kRtfRReCirj1MZxeFt4ODHz8/znMOX+YSmHQdV+ckdGx8YnJqenSzOzc/EJ5cenSJJlmvM4SmejrEAyXQvE6CpT8OtUc4lDyq7Bz1POvbrk2IlEX2E15M4YbJSLBAK0UlI+pH2lguVfkqqC+yeIgF/ueRQWhhMDHNkegfgzYDqP8tKh2A7FBo+pdIPbowF1fD8oVt+b2i/4FbwgVMqyzoPzktxKWxVwhk2BMw3NTbOagUTDJi5KfGZ4C68ANb1hUEHPTzPvnFnTNKi0aJdo+hbSvfp/IITamG4e2s7e2+e31xP+8RobRbjMXKs2QKzb4KMokxYT2sqMtoTlD2bUATAu7K2VtsPGhTbhkQ/B+n/wXLjdr3lbNO9+uHBwO45giK2SVVIlHdsgBOSFnpE4YuSeP5JW8OQ/Os/PufAxaR5zhzDL5Uc7nF34aqCU= i=1 AAACTnicbZFNaxsxEIa1bpO4zked9tiLqAk4EMxuWmggl5BeeughhToJeM0yK8/GIlrtIs0GG2V/YS4ht/6MXnpoCK38cUiTDghe3mcGjV6lpZKWwvBH0HjxcmV1rfmqtb6xufW6vf3m1BaVEdgXhSrMeQoWldTYJ0kKz0uDkKcKz9LLzzN+doXGykJ/p2mJwxwutMykAPJW0sYYytIUEx5nBoSLancd50BjAcod19c1j22VJ07GUvNHwPsaUgVJTGMkWKA0c1/r7jSRezzrThJ5yBd0dzdpd8JeOC/+XERL0WHLOknad/GoEFWOmoQCawdRWNLQgSEpFNatuLJYgriECxx4qSFHO3TzOGq+450RzwrjjyY+dx9POMitneap75ytbZ+ymfk/NqgoOxg6qcuKUIvFRVmlOBV8li0fSYOC1NQLEEb6XbkYg8+V/A+0fAjR0yc/F6f7vehDL/r2sXN0vIyjyd6x96zLIvaJHbEv7IT1mWA37Cf7ze6D2+BX8BD8WbQ2guXMW/ZPNZp/AaOPtas= i X |B| X2B Ø Is a variance reduction helpful? Ø Only if it let’s you take bigger steps (move faster) Ø Doesn’t affect the final answer… Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

Priya Goyal Piotr Dollar´ Ross Girshick Pieter Noordhuis Lukasz Wesolowski Aapo Kyrola Andrew Tulloch Yangqing Jia Kaiming He FacebookGeneralization Gap Problem

Abstract 40

35 Larger batch sizes harm Deep learning thrives with large neural networks and generalization performance. large datasets. However, larger networks and larger datasets result in longer training times that impede re- 30 search and development progress. Distributed synchronous SGD offers a potential solution to this problem by dividing 25 SGD minibatches over a pool of parallel workers. Yet to make this scheme efficient, the per-worker workload must 20 ImageNet top-1 validation error be large, which implies nontrivial growth in the SGD mini- 64 128 256 512 1k 2k 4k 8k 16k 32k 64k mini-batch size batch size. In this paper, we empirically show that on the ImageNet dataset large minibatches cause optimization dif- Figure 1. ImageNet top-1 validation error vs. minibatch size. Error range of plus/minus two standard deviations is shown. We ficulties, but when these are addressed the trained networks present a simple and general technique for scaling distributed syn- exhibit good generalization. Specifically, we show no loss chronous SGD to minibatches of up to 8k images while maintain- of accuracy when training with large minibatch sizes up to ing the top-1 error of small minibatch training. For all minibatch 8192 images. To achieve this result, we adopt a hyper- sizes we set the learning rate as a linear function of the minibatch parameter-free linear scaling rule for adjusting learning size and apply a simple warmup phase for the first few epochs of rates as a function of minibatch size and develop a new training. All other hyper-parameters are kept fixed. Using this warmup scheme that overcomes optimization challenges simple approach, accuracy of our models is invariant to minibatch early in training. With these simple techniques, our Caffe2- size (up to an 8k minibatch size). Our techniques enable a lin- based system trains ResNet-50 with a minibatch size of 8192 ear reduction in training time with ⇠90% efficiency as we scale on 256 GPUs in one hour, while matching small minibatch to large minibatch sizes, allowing us to train an accurate 8k mini- accuracy. Using commodity hardware, our implementation batch ResNet-50 model in 1 hour on 256 GPUs. achieves ⇠90% scaling efficiency when moving from 8 to 256 GPUs. Our findings enable training visual recognition tation [8, 10, 28]. Moreover, this pattern generalizes: larger models on internet-scale data with high efficiency. datasets and neural network architectures consistently yield improved accuracy across all tasks that benefit from pre- arXiv:1706.02677v2 [cs.CV] 30 Apr 2018 training [22, 41, 34, 35, 36, 16]. But as model and data 1. Introduction scale grow, so does training time; discovering the potential and limits of large-scale deep learning requires developing Scale matters. We are in an unprecedented era in AI novel techniques to keep training time manageable. research history in which the increasing data and model The goal of this report is to demonstrate the feasibility of, scale is rapidly improving accuracy in computer vision and to communicate a practical guide to, large-scale train- [22, 41, 34, 35, 36, 16], speech [17, 40], and natural lan- ing with distributed synchronous stochastic gradient descent guage processing [7, 38]. Take the profound impact in com- (SGD). As an example, we scale ResNet-50 [16] training, puter vision as an example: visual representations learned originally performed with a minibatch size of 256 images by deep convolutional neural networks [23, 22] show excel- (using 8 Tesla P100 GPUs, training time is 29 hours), to lent performance on previously challenging tasks like Ima- larger minibatches (see Figure 1). In particular, we show geNet classification [33] and can be transferred to difficult that with a large minibatch size of 8192, we can train perception problems such as object detection and segmen- ResNet-50 in 1 hour using 256 GPUs while maintaining

1 Rough “Intuition”

Small batch gradient descent acts as a regularizer Loss

Sharp Minima Hypothesis

Parameter values along some direction

Key problem: Addressing the generalization gap for large batch sizes. Solution: Linear Scaling Rule

Ø Scale the learning rate linearly with the batch size k ⌘ k (t+1) (t) =AAAB8HicdVDLSsNAFL2pr1pfVZduBovgKiQ++lgIRTcuK9iHtKFMppN26EwSZiZCCf0KNy4UcevnuPNvnLYRVPTAhcM593LvPX7MmdKO82HllpZXVtfy64WNza3tneLuXktFiSS0SSIeyY6PFeUspE3NNKedWFIsfE7b/vhq5rfvqVQsCm/1JKaewMOQBYxgbaQ7dIF6VGM07hdLjl2r1srnDnJsZw5DKrWqW64gN1NKkKHRL773BhFJBA014VipruvE2kux1IxwOi30EkVjTMZ4SLuGhlhQ5aXzg6foyCgDFETSVKjRXP0+kWKh1ET4plNgPVK/vZn4l9dNdFD1UhbGiaYhWSwKEo50hGbfowGTlGg+MQQTycytiIywxESbjAomhK9P0f+kdWK7p7Z7c1aqX2Zx5OEADuEYXKhAHa6hAU0gIOABnuDZktaj9WK9LlpzVjazDz9gvX0CxiCPwA== 1 1 ✓ ✓ ⌘ˆ ✓L(yi,f(xi; ✓)) 0k j r ✓=✓(t) 1 j=1 i j AAACyHicbVFdi9QwFE3r11o/dtRHX4KD0EEdWhUUZGFZEUR8WMHZXZjMljSTttlJ05LcupZsX/yJvvnkXzEzLbofXggczjmXe29OWkthIIp+ef616zdu3tq6Hdy5e+/+9ujBwwNTNZrxGatkpY9SargUis9AgORHtea0TCU/TFfv1/rhN66NqNRXaGu+KGmuRCYYBUclo98ECg702IaAn+F40mEieQZU6+oU/9Mc/wKTgoIljhlMIck0ZTbu7KojpikTe7ITd8er4C9/RkoKBaPS7nXJyZnr29gEEQpfkJyiaCpp0o/sxTSzn7uwTcRznIXfE/FuWGgyCUgq8lwT7U6DxPb0zvl1O6JFXsAkGY2jabQpfBXEAxijofaT0U+yrFhTcgVMUmPmcVTDwlINgkneBaQxvKZsRXM+d1DRkpuF3QTR4aeOWeKs0u4pwBv2fIelpTFtmTrn+kBzWVuT/9PmDWRvF1aougGuWD8oaySGCq9TxUuhOQPZOkCZFm5XzArqUgCXfeA+Ib588lVw8HIav5rGX16Pd/eG79hCj9ETFKIYvUG76CPaRzPEvA/eygOv8T/5tX/qt73V94aeR+hC+T/+ALw/32s= X |B | X2B @ A Ø Addresses generalization performance by taking larger steps (also improves training convergence) Ø Sub-problem: Large learning rates can be destabilizing in the beginning. Why? Ø Gradual warmup solution: increase learning rate scaling from constant to linear in first few epochs Ø Doesn’t help for very large k… Other Details

Ø Independent Batch Norm: Batch norm calculation applies only to local batch size (n). Ø All-Reduce: Recursive halving and doubling algorithm Ø Used instead of popular ring reduction (fewer rounds) Ø Gloo a library for efficient collective communications Ø Big Basin GPU Servers: Designed for deep learning workloads Ø Analysis of communication requirements à latency bound Ø No discussion on straggler or fault-tolerance Ø Why?! Key Results

Training vs Validation 100 100 kn=256, =0.1 [train] kn=256, = 0.1, 23.60% 0.12 kn=256, =0.1 [val] kn=256, = 0.2, 23.68% 0.09 80 kn=8k, =3.2 [train] 80 kn=8k, =3.2 [val] All curves closely 60 match using60 the linear

error % scaling rule. training error % 40 Note learning40 rate schedule drops. 20 20 0 20 40 60 80 0 20 40 60 80 epochs epochs Figure 4. Training and validation curves for large minibatch Figure 5. Training curves for small minibatches with different SGD with gradual warmup vs. small minibatch SGD. Both sets learning rates ⌘. As expected, changing ⌘ results in curves that do of curves match closely after training for sufficient epochs. We not match. This is in contrast to changing batch-size (and linearly note that the BN statistics (for inference only) are computed us- scaling ⌘), which results in curves that do match, e.g. see Figure 3. ing running average, which is updated less frequently with a large minibatch and thus is noisier in early training (this explains the larger variation of the validation error in early epochs). ⌘ =0.1 gives best error but slightly smaller or larger ⌘ also work well. When applying the linear scaling rule with a minibatch of 8k images, the optimum error is also achieved 5.3. Analysis Experiments with ⌘ =0.1 32, showing the successful application of the · Minibatch size vs. error. Figure 1 (page 1) shows top- linear scaling rule. However, in this case results are more 1 validation error for models trained with minibatch sizes sensitive to changing ⌘. In practice we suggest to use a ranging from of 64 to 65536 (64k). For all models we used minibatch size that is not close to the breaking point. the linear scaling rule and set the reference learning rate Figure 5 shows the training curves of a 256 minibatch as ⌘ =0.1 kn . For models with kn > 256, we used using ⌘ =0.1 or 0.2. It shows that changing the learning · 256 the gradual warmup strategy always starting with ⌘ =0.1 rate ⌘ in general changes the overall shapes of the train- and increasing linearly to the reference learning rate after ing curves, even if the final error is similar. Contrasting 5 epochs. Figure 1 illustrates that validation error remains this result with the success of the linear scaling rule (that stable across a broad range of minibatch sizes, from 64 to can match both the final error and the training curves when 8k, after which it begins to increase. Beyond 64k training minibatch sizes change) may reveal some underlying invari- diverges when using the linear learning rate scaling rule.5 ance maintained between small and large minibatches. We also show two alternative strategies: keeping ⌘ fixed Training curves for various minibatch sizes. Each of the at 0.1 or using 0.1 p32 according to the square root scaling · nine plots in Figure 3 shows the top-1 training error curve rule that was justified theoretically in [21] on grounds that it for the 256 minibatch baseline (orange) and a second curve scales ⌘ by the inverse amount of the reduction in the gradi- corresponding to different size minibatch (blue). Valida- ent estimator’s standard deviation. For fair comparisons we tion errors are shown in the plot legends. As minibatch size also use gradual warmup for 0.1 p32. Both policies work · increases, all training curves show some divergence from poorly in practice as the results show. the baseline at the start of training. However, in the cases where the final validation error closely matches the base- Batch Normalization initialization. Table 2b controls line (kn 8k), the training curves also closely match after for the impact of the new BN initialization introduced in  the initial epochs. When the validation errors do not match §5.1. We show results for minibatch sizes 256 and 8k with (kn 16k), there is a noticeable gap in the training curves the standard BN initialization ( =1for all BN layers) for all epochs. This suggests that when comparing a new and with our initialization ( =0for the final BN layer setting, the training curves can be used as a reliable proxy of each residual block). The results show improved per- for success well before training finishes. formance with =0for both minibatch sizes, and the improvement is slightly larger for the 8k minibatch size. Alternative learning rate rules. Table 2a shows results for This behavior also suggests that large minibatches are more multiple learning rates. For small minibatches (kn = 256), easily affected by optimization difficulties. We expect that 5We note that because of the availability of hardware, we simulated dis- improved optimization and initialization methods will help tributed training of very large minibatches ( 12k) on a single server by us- push the boundary of large minibatch training. ing multiple gradient accumulation steps between SGD updates. We have thoroughly verified that gradient accumulation on a single server yields ResNet-101. Results for ResNet-101 [16] are shown in Ta- equivalent results relative to distributed training. ble 2c. Training ResNet-101 with a batch-size of kn =8k

9 Key Results

100 100 0.3 kn=256, =0.1 [train] kn=256, = 0.1, 23.60% 0.1216 32k ideal kn=256, =0.1 [val] kn=256, = 0.2, 23.68% 0.09 0.28 actual 80 80 8 kn=8k, =3.2 [train] 16k kn=8k, =3.2 [val] 0.26 4 8k 60 60 0.24 error % 2 4k training error % 40 0.22 40 1 images / second 2k time per epoch (mins) time per iteration (secs) 0.2 0.5 20 25620 512 1k 2k 4k 8k 11k 8 16 32 64 128 256 352 0 20 40 60 80 0 20mini-batch 40size 60 80 # GPUs epochs Figure 7. Distributed synchronousepochs SGD timing. Time per itera- Figure 8. Distributed synchronous SGD throughput. The small Figure 4. Training and validation curves for large minibatch tionFigure (seconds) 5. Training and time curves per ImageNet for small epoch minibatches (minutes) with for training different overhead when moving from a single server with 8 GPUs to multi- SGD with gradual warmup vs. small minibatch SGD. Both sets withlearning different rates minibatch⌘. As expected, sizes. The changing baseline⌘ results (kn =256 in curves) uses that 8 do server distributed training (Figure 7, blue curve) results in linear GPUs in a single server , while all other training runs distribute of curves match closely“Learning” after training for sufficient epochs. We not match. This is in contrastEpoch to changing batch-size (and linearly throughput scaling that is marginally below ideal scaling (⇠90% note that the BN statistics (for inference only) are computed us- trainingscaling over⌘), ( whichkn/256 results) server. in curves With that 352do GPUs match (44, e.g servers). see Figure our 3. efficiency). Most of the allreduce communication time is hid- ing running average, which is updated less frequently with a large implementation completes one pass over all ⇠1.28 million Ima- den by pipelining allreduce operations with gradient computation. minibatch and thus is noisierEpoch in early training (this explains the geNet training images in aboutSecond 30 seconds. Moreover, this is achieved with commodity Ethernet hardware. larger variation of the validation error in early epochs). ⌘ =0.1 gives best error but slightly smaller or larger ⌘ also Machine Learning work well. When applyingSystem the linear scaling rule with a ingminibatch good features of 8k that images, transfer, the optimum or generalize error is well, also achieved to re- tection matches the AP of the small minibatch baseline. We 5.3. Analysis Experiments latedwith tasks.⌘ =0 A.1 question32, showing of key the importance successful is application if the features of the emphasize that we observed no generalization issues when learned with large· minibatches generalize as well as the fea- transferring across datasets (from ImageNet to COCO) and Minibatch size vs. error. Figure 1 (page 1) shows top- linear scaling rule. However, in this case results are more tures learned with small minibatches? across tasks (from classification to detection/segmentation) 1 validation error for models trained with minibatch sizes sensitive to changing ⌘. In practice we suggest to use a To test this, we adopt the object detection and in- using models trained with large minibatches. ranging from of 64 to 65536 (64k). For all models we used minibatch size that is not close to the breaking point. stance segmentation tasks on COCO [27] as these advanced the linear scaling rule and set the reference learning rate Figure 5 shows the training curves of a 256 minibatch Linear scaling rule applied to Mask R-CNN. We also perception tasks benefit substantially from ImageNet pre- as ⌘ =0.1 kn . For models with kn > 256, we used using ⌘ =0.1 or 0.2. It shows that changing the learning show evidence of the generality of the linear scaling rule us- · 256 training [10]. We use the recently developed Mask R-CNN the gradual warmup strategy always starting with ⌘ =0.1 rate ⌘ in general changes the overall shapes of the training Mask R-CNN. In fact, this rule was already used with- [14] system that is capable of learning to detect and segment and increasing linearly to the reference learning rate after ing curves, even if the final error is similar. Contrasting out explicit discussion in [16] and was applied effectively object instances. We follow all of the hyper-parameter set- 5 epochs. Figure 1 illustrates that validation error remains this result with the success of the linear scaling rule (that as the default Mask R-CNN training scheme when using 8 tings used in [14] and only change the ResNet-50 model stable across a broad range of minibatch sizes, from 64 to can match both the final error and the training curves when GPUs. Table 3b provides experimental results showing that used to initialize Mask R-CNN training. We train Mask R- 8k, after which it begins to increase. Beyond 64k training minibatch sizes change) may reveal some underlying invari- when training with 1, 2, 4, or 8 GPUs the linear learning rate CNN on the COCO trainval35k split and report results diverges when using the linear learning rate scaling rule.5 ance maintained between small and large minibatches. rule results in constant box and mask AP. For these experi- on theWe 5k alsoimage showminival two alternativesplit used strategies: in [14]. keeping ⌘ fixed ments, we initialize Mask R-CNN from the released MSRA Training curves for various minibatch sizes. Each of the atIt 0.1 is interesting or using 0.1 top note32 according that the conceptto the square of minibatch root scaling ResNet-50 model, as was done in [14]. · nine plots in Figure 3 shows the top-1 training error curve sizerule in that Mask was R-CNN justified is theoretically different from in [21 the] on classification grounds that it for the 256 minibatch baseline (orange) and a second curve setting.scales As⌘ by an the extension inverse amount of the image-centric of the reductionFast/Faster in the gradi- 5.5. Run Time corresponding to different size minibatch (blue). Valida- R-CNNent estimator’s [9, 31], Mask standard R-CNN deviation. exhibits Fordifferent fair comparisons minibatch we Figure 7 shows two visualizations of the run time char- tion errors are shown in the plot legends. As minibatch size sizesalso for use different gradual layers warmup: the for network0.1 p32 backbone. Both policies uses two work acteristics of our system. The blue curve is the time per increases, all training curves show some divergence from · imagespoorly (per in GPU), practice but as each the results image contributes show. 512 Regions- iteration as minibatch size varies from 256 to 11264 (11k). the baseline at the start of training. However, in the cases of-Interest for computing classification (multinomial cross- Notably this curve is relatively flat and the time per itera- where the final validation error closely matches the base- entropy),Batch bounding-box Normalization regressioninitialization. (smooth-L1/Huber),Table 2b controls and tion increases only 12% while scaling the minibatch size by line (kn 8k), the training curves also closely match after pixel-wisefor the impact mask ( of28 the28 newbinomial BN initialization cross-entropy) introduced losses. in  ⇥ 44 . Visualized another way, the orange curve shows the the initial epochs. When the validation errors do not match This§5.1 diverse. We show set of results minibatch for minibatch sizes and sizesloss functions 256 and 8k pro- with ⇥ =1 approximately linear decrease in time per epoch from over (kn 16k), there is a noticeable gap in the training curves videsthe a standard good test BN case initialization to the robustness ( of ourfor approach. all BN layers) 16 minutes to just 30 seconds. Run time performance can for all epochs. This suggests that when comparing a new and with our initialization ( =0for the final BN layer also be viewed in terms of throughput (images / second), as setting, the training curves can be used as a reliable proxy Transferof each learning residual block). from large The minibatch results show pre-training. improved per- shown in Figure 8. Relative to a perfectly efficient extrapo- for success well before training finishes. Toformance test how large with minibatch=0forpre-training both minibatcheffects sizes, Mask and R- the CNN,improvement we take ResNet-50 is slightly models larger trained for the on 8k ImageNet-1k minibatch size. lation of the 8 GPU baseline, our implementation achieves 90% scaling efficiency. Alternative learning rate rules. Table 2a shows results for withThis 256 behavior to 16k minibatches also suggests and that use large them minibatches to initialize are Mask more ⇠ multiple learning rates. For small minibatches (kn = 256), R-CNNeasily training. affected by For optimization each minibatch difficulties. size we We pre-train expect 5 that Acknowledgements. We would like to thank Leon Bottou for 5We note that because of the availability of hardware, we simulated dis- modelsimproved and then optimization train Mask and R-CNN initialization using all methods 5 models will on help helpful discussions on theoretical background, Jerry Pan and tributed training of very large minibatches ( 12k) on a single server by us- COCOpush (35 the models boundary total). of large We report minibatch the mean training. box and mask Christian Puhrsch for discussions on efficient data loading, An- ing multiple gradient accumulation steps between SGD updates. We have APs, averaged over the 5 trials, in Table 3a. The results drew Dye for help with debugging distributed training, and Kevin thoroughly verified that gradient accumulation on a single server yields ResNet-101. Results for ResNet-101 [16] are shown in Ta- show that as long as ImageNet validation error is kept low, Lee, Brian Dodds, Jia Ning, Koh Yew Thoon, Micah Harris, and equivalent results relative to distributed training. ble 2c. Training ResNet-101 with a batch-size of kn =8k which is true up to 8k batch size, generalization to object de- John Volk for Big Basin and hardware support.

9 11 Key Results

Ø Train ResNet-50 to state-of-the-art on 256 GPUs in 1 hour Ø 90% scaling efficiency

Ø Fairly careful study of the linear scaling rule Ø Observed limits to linear scaling do not depend on dataset size Ø Cannot scale parallelism with dataset size Long-term Impact

Ø Still early (this paper is not yet published) Ø Ideas that will appear in other papers Ø Linear rate scaling Ø Independent batch norm Ø Paper points to limits of Synchronous SGD à need new methods to accelerate training Ø Eg.: Fast.ai à curriculum learning though scale variation Questions for Discussion

Ø Distributed model training is not very common. Why? Ø Should we return to asynchrony? Ø What is needed to address issues with asynchronous training? Ø How will changes in hardware affect distributed training Ø E.g., faster GPUs à larger batches, faster networks à smaller batches … Ø How will the emergence of “dynamic models” and large “mixture of expert models” affect distributed training?