Panel – The Impact of AI Workloads on Datacenter Compute and Memory

Marc Tremblay Sumit Gupta Cliff Young Maxim Naumov Greg Diamos Rob Ober Anand Iyer IBM Facebook Baidu Samsung

Distringuished Engineer, Vice President Software Engineer Research Scientist AI Research Lead Chief Platform Architect Dir, Planning/Technology, Azure H/W AI, ML and HPC business Google Brain team Facebook AI Research AI Labs Datacenter products Semiconductor products

Responsible for silicon / Responsible for strategy HW/SW Codesign and Deep Learning, Parallel Co-developed Deep GPU deployments for AI Data center and AI systems roadmap for AI and HW/SW products for Machine Learning Algorithms and Systems Speech and Deep Voice and DL @ Hyperscalers HBM/Accelerators/CPU Watson ML accelerator Full stack expert from and Spectrum compute One of the TPU designers Co-developed many of Contributor to Volta SIMT CPU architecture, storage CPU, Network infra, near application requirement Nvidia GPU-accelerated scheduler, Compiler and systems, SSD, N/W, memory compute to silicon Previously GM AI & GPU Previously at D.E. Shaw libraries Microarchitecture wireless and pwr mgmt. data center business, and Bell Labs Previously at Broadcom, Previously CTO of Nvidia Previously at Nvidia Previously at Nvidia Previously at Sandisk, LSI, Cavium and Digital Microelectronics @ Sun Research and Apple etc.

Samsung @ The Heart of Your Data Unparalleled Product Breadth & Technology Leadership Virtuous cycle of big data vs compute growth chasm

Data

Compute

350M photos 2 Trillion searches / year 4PB / day Data is being created faster than our ability to make sense of it. Flu Outbreak: Google knows before CDC ML/DL Systems – Market Demand and Key Application Domains

probability of a click

image/video NNs convolutions attention p1 p2 p3 (мне) (нравятся) (коты)

cat

Interactions NNs hidden state

Lookup g Embeddin I like cats \start p1 dense features sparse features \end p2

DL inference in data centers [1] Vision Recommenders NMT

[1] “Deep Learning Inference in Data Centers: Characterization, Performance Optimizations and Hardware Implications”, ArXiv, 2018 https://code.fb.com/ai-research/scaling-neural-machine-translation-to-bigger-data-sets-with-faster-training-and-inference https://code.fb.com/ml-applications/expanding-automatic-machine-translation-to-more-languages Resource Requirements

Categor Model Types Model Size Max. Op. Intensity (W) Op. Intensity (X and y (W) Activations W) FCs 1-10M > 10K 20-200 20-200 RecSys Embeddings >10 Billion > 10K 1-2 1-2 Avg. 380/Min. Avg. 188/Min. 28 ResNeXt101-32x4-48 43-829M 2-29M 100 Faster-RCNN- Avg.3.5K/Min. Avg. 145/Min. 4 CV 6M 13M ShuffleNet 2.5K ResNeXt3D-101 21M 58M Avg. 22K/Min. 2K Avg. 172/ Min. 6 NLP seq2seq 100M-1B >100K 2-20 2-20 AI Application Roofline Performance For the foreseeable future, off-chip memory System balance bandwidth will often be the constraining Memory <-> Compute <-> Communication resource in system performance.

Research + Development Memory Access for: Time to train and accuracy Network / program config and control flow Multiple runs for exploration, sometimes overnight Training data mini-batch compute flow Production Compute consumes: Optimal work/$ Mini-batch data Optimal work/watt Communication for: Time to train All reduce Embedding table insertion Hardware Trends

Figure 4: CPU time breakdown across data centers.

Time spent in caffe2 operators in data centers [1] FigureFigure4: CPU3: Runtimetime rbreakdooofline analwnysisacrof diffosserentdataML centermodels s. varying on-chip memory capacity of a hypothetical accelerator with 100 int8 Top/s compute and 100 GB/s DRAM bandwidth. Figure 4: CPU time breakdown across data centers. (a) Activation The importance of on-chip 1 TB/s (solid) and 10 TB/s (dashed) bandwidth is showcased under a variety of workloads.

• High memory bandwidth and capacity for embeddings 800• Support⇥600 input for powerfulimages and matrixResNeXt-3D and vectorfor videos). engines The FC• Largelayers onin-chiprecommendation memory forand inferenceNMT models with smalluse small batchbatchessizes so performance isbound by off-chip memory band- width unless parameters can fit on-chip. The batch size can be• Supportincreased forwhile halfmaintaining-precision latencfloatingy with-pointhigher computationcompute Figure 3: Runtime roofline analysis of different ML models throughput of accelerators [34], but only up to a point due to varying on-chip memory capacity of a hypothetical accelerator other application requirements. The number of operations per with 100 int8 Top/s computeFigureand3:100RuntimeGB/s DRAMrooflinebandwidth.analysis of different ML models weight in CV models are generally high, but the number of (a) Activation The importance of on-cvarhipying1 TB/son-c(solid)hipandmemor10 TB/s y(dashed)capacity of a hypothetical accelerator operations per activation is not as high (some layers in the (b) Weight ShuffleNet and ResNeXt-3D models are as low as 4 or 6). T bandwidth is showcased under a variety of workloads. Common activation and weight matrix shapes (XMxKFigureW KxN5: Common) [1] activation and weight matrix shapes. Here with 100 int8 Top/s compute and 100 GB/s DRAM bandwidth. This iswhy the performance of ShuffleNet and ResNeXt-3D (a) Activation 4 : FCs. ⇥: group convolutions with few channels per group The importance of on-chip 1 TB/s (solid) and 10 TB/s (dashed) varies considerably depending on on-chip memory bandwidth (depth-wise convolution is an extreme case with 1 channel per bandwidth is showcased under a variety[1] “Deepof Learningworkloads. Inference in Data Centers: Characterization, Performanceas sho wnOptimizationsin Figure and3. HardwareHad we Implications”,only considered ArXiv,their 2018 minimum group). : all other operations. 800⇥600 input images and ResNeXt-3D for videos). The 2K operations per weight, we would expect that 1 TB/s of FC layers in recommendation and NMT models use small on-chip memory issufficient to saturate the peak 100 Top/s 2.3. Computation Ker nels batch sizes so performance isbound by off-chip memory band- compute throughput of the hypothetical accelerator. As the width unless parameters can fit on-chip. The batch size can application would be compute bound with 1 TB/s of on-chip Figure 4 shows the breakdown of operations across all data be increased while maintaining latency with higher compute memory bandwidth, we would expect there to be no perfor- centers. We count CPU operations because for inference we throughput of accelerators800⇥[34600], butinputonly upimagesto apoint dueandto ResNeXt-3D for videos). The mance difference between 1 TB/s and 10 TB/s. often work with a small batch size in order to meet latency other application requiFCrements.layersTheinnumberrecommendationof operations per and NMT models use small constraints and therefore GPUs are not widely used (Table 1). weight in CV modelsbatchare generallysizessohigh,performancebut the numberisof bound by off-chip memory band- Third, common primitive operations are not just canoni- Notice that FCs are the most time consuming operation, fol- operations per activation is not as high (some layers in the (b) Weight cal multiplications of square matrices, but often involve tall- lowed by tensor manipulations and embedding lookups. ShuffleNet and ResNeXt-3Dwidth unlessmodels aparametersre as low as 4 orcan6). fit Figureon-chip.5: CommonTheactivbatchationsizeand weightcan matrix shapes. Hereand-skinny matrices or vectors. These problem shapes arise Figure 5 shows common matrix shapes encountered in our This iswhy the performancebe increasedof ShuffleNetwhileand maintainingResNeXt-3D latenc4 : FCs.y⇥with: grouphigherconvolutionscomputewith few channels per groupfrom group/depth-wise convolutions that have recently be- DL inference workloads. For activation matrices in convo- varies considerably depending on on-chip memory bandwidth comepopular inCV, and from small batch sizesinRecommen- lution layers, we put dimensions of lowered (i.e. i m2col ’d) throughput of accelerators [34],(depth-wisebut onlyconupvolutionto a pointis an extremedue tocase with 1 channel per as shown in Figure 3. Had we only considered their minimum group). : all other operations. dation/NMT models due to their latency constraints. There- matricesbut it doesn’t necessarily mean lowering isused. That 2K operations per weight,otherweapplicationwould expectrequithat 1rements.TB/s of The number of operations per fore, it isdesired to have a combination of 1) matrix-matrix isthe reduction dimension ismultiplied by the filter size (e.g., on-chip memory issufficient to saturate the peak 100 Top/s 2.3. Computation Ker nels engines to execute the bulk of FLOPs from compute-intensive 9 for 3⇥3 filters). In convolution layers, the number of rows compute throughputweightof the hypotheticalin CV modelsacceleratorare. Asgenerallythe high, but the number of models in an energy-efficient manner and 2) powerful enough of activation matrices is batch_size⇥H_out⇥W_out, where application would beoperationscompute boundperwithacti1 TB/svationof on-chipis not Figureas high4 sho(somews the breakdolayerswn inof operationsthe across all datavector engines to handle(b)theWeightremaining of operations. More H_out⇥W_out is the size of each output channel. We call memory bandwidth,Shufwe wofleNetuld expectandthereResNeXt-3Dto be no perfor- modelscenters. Waerecountas CPUlowoperationsas 4 orbecause6). for inference wedetails are described in the next section. this number of rows ef fective batch size or batch/spatial di- mance difference between 1 TB/s and 10 TB/s. often work with a small batch size in order toFiguremeet latenc5: Commony activation and weight matrix shapes. Here This iswhy theperformance of ShufconstraintsfleNetand thereforeand ResNeXt-3DGPUs are not widely used (Table 1). 4 : FCs. ⇥: group convolutions with few channels per5group Third, common primitivevariesoperconsiderablyations are notdependingjust canoni- onNoticeon-chipthat FCsmemoryare the mostbandwidthtime consuming operation, fol- cal multiplications of square matrices, but often involve tall- lowed by tensor manipulations and embedding(depth-wiselookups. convolution is an extreme case with 1 channel per and-skinny matricesasor vectorshowns. Thesein Fiproblemgure 3shapes. Hadarisewe onlyFigureconsidered5 shows commontheir minimummatrix shapes encounteredgroup).in our: all other operations. from group/depth-wise2Kconoperationsvolutions thatperhaveweight,recently be-we wDLouldinferenceexpectworkloads.that 1ForTB/sactivationof matrices in convo- comepopular inCV, andon-chipfrom smallmemorybatch sizesisinsufRecommen-ficient tolutionsaturatelayers,thewe putpeakdimensions100 Tofop/slowered (i.e. i m2col ’d) dation/NMT models due to their latency constraints. There- matrices but it doesn’t necessarily mean lowering2.3.isused.ComputationThat Ker nels fore, it isdesired to hacomputeve a combinationthroughputof 1) matrix-matrixof the hypotheticalisthe reductionacceleratordimension is.multipliedAs theby the filter size (e.g., engines to execute theapplicationbulk of FLOPswfromouldcompute-intensibe computeve bound9 for 3⇥with3 filters).1 TB/sIn convofolutionon-chiplayers, the numberFigureof ro4wsshows the breakdown of operations across all data models in an energy-efficient manner and 2) powerful enough of activation matrices is batch_size⇥H_out⇥W_out, where vector engines to handlememorythe remainingbandwidth,of operations.we Morewould eH_outxpect⇥W_outthereistothebesizenoof eachperforoutput- channel.centers.We callWe count CPU operations because for inference we details are describedmancein the nextdifsection.ference between 1 TB/sthisandnumber10 ofTB/s.rows ef fective batch size or batch/spatialoften workdi- with a small batch size in order to meet latency constraints and therefore GPUs are not widely used (Table 1). Third, common primitive oper5 ations are not just canoni- Notice that FCs are the most time consuming operation, fol- cal multiplications of square matrices, but often involve tall- lowed by tensor manipulations and embedding lookups. and-skinny matrices or vectors. These problem shapes arise Figure 5 shows common matrix shapes encountered in our from group/depth-wise convolutions that have recently be- DL inference workloads. For activation matrices in convo- comepopular inCV, and from small batch sizesinRecommen- lution layers, we put dimensions of lowered (i.e. i m2col ’d) dation/NMT models due to their latency constraints. There- matricesbut it doesn’t necessarily mean lowering isused. That fore, it isdesired to have a combination of 1) matrix-matrix isthe reduction dimension ismultiplied by the filter size (e.g., engines to execute the bulk of FLOPs from compute-intensive 9 for 3⇥3 filters). In convolution layers, the number of rows models in an energy-efficient manner and 2) powerful enough of activation matrices is batch_size⇥H_out⇥W_out, where vector engines to handle the remaining of operations. More H_out⇥W_out is the size of each output channel. We call details are described in the next section. this number of rows ef fective batch size or batch/spatial di-

5 Sample Workload Characterization

Embedding table hit rates and access histograms [2]

[2] “Bandana: Using Non-Volatile Memory for Storing Deep Learning Models”, SysML, 2019 MLPerf The Move to The Edge

CPU

Memory

By 2022, 7 out of every 10 bytes of data created will never see a data center. Storage

Considerations Let Data Speak for Itself! • Compute closer to Data • Smarter Data Movement • Faster Time to Insight Panel – The Impact of AI Workloads on Datacenter Compute and Memory

Marc Tremblay Sumit Gupta Cliff Young Maxim Naumov Greg Diamos Rob Ober Anand Iyer Microsoft IBM Google Facebook Baidu Nvidia Samsung

Distringuished Engineer, Vice President Software Engineer Research Scientist AI Research Lead Chief Platform Architect Dir, Planning/Technology, Azure H/W AI, ML and HPC business Google Brain team Facebook AI Research Baidu SVAIL Datacenter products Semiconductor products

Responsible for silicon / Responsible for strategy HW/SW Codesign and Deep Learning, Parallel Co-developed Deep GPU deployments for AI Data center and AI systems roadmap for AI and HW/SW products for Machine Learning Algorithms and Systems Speech and Deep Voice and DL @ Hyperscalers HBM/Accelerators/CPU Watson ML accelerator Full stack expert from and Spectrum compute One of the TPU designers Co-developed many of Contributor to Volta SIMT CPU architecture, storage CPU, Network infra, near application requirement Nvidia GPU-accelerated scheduler, Compiler and systems, SSD, N/W, memory compute to silicon Previously GM AI & GPU Previously at D.E. Shaw libraries Microarchitecture wireless and pwr mgmt. data center business, and Bell Labs Previously at Broadcom, Previously CTO of Nvidia Previously at Nvidia Previously at Nvidia Previously at Sandisk, LSI, Cavium and Digital Microelectronics @ Sun Research and Intel Apple etc.

Samsung @ The Heart of Your Data Unparalleled Product Breadth & Technology Leadership Cloud AI Edge computing AI On-device AI

6x faster training time 2x faster data access 1.5x bandwidth 8x training cost effectiveness 2x hot data feeding privacy & fast response HBM GDDR6 LPDDR5

Samsung @ The Heart of Your Data Visit us at Booth #726