Machine Learning at Scale

AccML 2020: HiPEAC Accelerated Machine Learning

$.056;2#2.?;6;4A)0.92 *(& %!*- #!#!*-5.992;42@3

.?<92!2.;-B .02/<<8 (2@2.?05I )F@$# Machine Learning at Facebook’s Scale

TRANSLATION NEWS FEED ( Machine learning is used extensively FACE ( Ranking posts in Newsfeed TAGGING

( Content understanding ADS ( Object detection, segmentation, and tracking ( Speech recognition/translation

( From data centers to the edge Keypoints Augmented Reality Segmentation with Smart Camera $#E20BA6<;9

.A. *?.6;6;4 ;32?2;02 $.;6=B9.A6<;

! "! $ $#$<129*?.6;6;4.A.02/<<8

D228@ (%)#*!&%

1.F@ ,) 5

:6;BA2@ 5B2;0F -5.A./

5-//-21 27%/5)(-'7-216)5%; -//-21 %1+8%+) 5%16/%7-216)5%; -//-216 %.)''28176)029)(52%'7-9)/; &;8720%7)( ;67)069)5;%; 6?@AD6A5B@A<:2@64;21)F@A2:)<9BA6<;@ .02/<<8L@=569<@<=5F6@A< ( 5.?.0A2?6G2.;1/B082A6G2$#D

&"$ &)"$ " #'&

! "! $

6459F)0.9./92 ;3?.@A?B0AB?2

'%/-1+!3 '%/-1+87

# # &BA96;2

( &C2?C62D3

Support Vector Gradient-Boosted Multi-Layer Convolutional Recurrent Machines Decision Trees Perceptron Neural Nets Neural Nets

SVM GBDT MLP CNN RNN

Facer Sigma News Feed Facer Language Translation Ads Speech Rec. Search Content Sigma Understanding

Hazelwood et al., “Applied Machine learning at Facebook: A Datacenter Infrastructure Perspective”, HPCA 2018. 6C2?@6AF6;$#$<129@.A.02/<<8

Support Vector Gradient-Boosted Multi-Layer Convolutional Recurrent Machines Decision Trees Perceptron Neural Nets Neural Nets

SVM GBDT MLP CNN RNN

Facer Sigma News Feed Facer Language Translation Ads Speech Rec. Search Content Sigma Understanding

Hazelwood et al., “Applied Machine learning at Facebook: A Datacenter Infrastructure Perspective”, HPCA 2018. 6C2?@6AF6;%%+@2.@2@

?6A5:2A60 ;A2;@6AF %%&"! %

$2:

&%(&$$%*!&%+))) *

27%/1*)5)1');'/)6 )'200)1(%7-212()/6 $#*<=60@<3 ;A2?2@A/FA52(2@2.?05 <::B;6AF

5AA=@ DDD@64.?05B2@)AB1621/FA52 (2@2.?05<::B;6AF %2B?.9'2?@<;.96G21 (20<::2;1.A6<;)F@A2:@ *52+@2.@25.992;42 ;E.:=92<3(20<::2;1.A6<;

"" !&"!" !&"!" %" !"#! &&& #!")

(20<::2;1.A6<; -5.A6@22=#2.?;6;4'2?@<;.96G21

-7)0-7)

5)'200)1(%7-2148)5; &$$%*!&%%'+*) 2;@22.AB?2@ 2.AB?2@ 2.AB?2@ )=.?@2 )=.?@2

!&"$% $!% % %) ) :/2116;4 :/2116;4 2;@2%%@ *./92 *./92

Memory Capacity Dominated

Memory Bandwidth:/2116;444?24.A6<;:/211 6;4 Dominated

Communication Dominated)=.?@22;@2)=.?@2 ;A24?.A6<; Computation Dominated '?2160A

)5-7)0 6C2?@6AF6;(20<::2;1.A6<;$<129@

Facebook Google RMC3 RMC2

RMC1

Alibaba $#&=2?.A

B=A.2A.9J*52?056A20AB?.9 :=960.A6<;@<3.02/<<8L@%%/.@21 '2?@<;.96G21(20<::2;1.A6<;)F@A2:@K ' :/2116;4*./92002@@2@ ;0B? 645## $'" D6A5#

H : :

B=A.2A.9J*52?056A20AB?.9 :=960.A6<;@<3.02/<<8L@%%/.@21 '2?@<;.96G21(20<::2;1.A6<;)F@A2:@K ' $.7

$" $ ($!! #

#-()5 21

AdIndexer

%-//%7)1';-6,)%9-/;-1*/8)1')(&;'%',),-)5%5',; # <6-1'/96 <6):'/

( &C2?C62D3

How do we 2000+ optimize system SoCs designs for real- time ML inference? FRAGMENTED SMARTPHONE ECOSYSTEM POSES UNIQUE CHALLENGES FOR EDGE INFERENCE

$.056;2#2.?;6;4.A.02/<<8+;12?@A.;16;4 ;32?2;02.AA52142-B2A.9 ' Lay of the Land FRAGMENTATION

Taking a Closer Look at Smartphones Facebook Runs on

1 95% 0.9 ( Qualcomm Snapdragon 0.8 ( Samsung Exynos 0.7 65% ( MediaTek Helio 0.6 225 SoCs ( 0.5 HiSilicon Kirin et al.

0.4 50 SoCs 0.3

CDF of SoCs 0.2

0.1

0 1 53 105 157 209 261 313 365 417 469 521 573 625 677 729 781 833 885 937 989 1041 1093 1145 1197 1249 1301 1353 1405 1457 1509 1561 1613 1665 1717 1769 1821 1873 1925 1977 2029 2081 Unique SoCs

IS THERETHERE A DOMINATINGIS NO STANDARD MOBILE SOC SOC TO OPTIMIZE TO OPTIMIZE FOR FOR? Lay of the Land FRAGMENTATION

In 2018, ~28% of SoCs Use CPUs Designed in 2013 or Later

ARM Cortex A53 72% OF THE WORLD’S CELL PHONES ARE MORE THAN 7 YEARS OLD

MOBILE CPUS SHOW LITTLE DIVERSITY Lay of the Land PERFORMANCE

The Performance Difference between a Mobile CPU and GPU is Narrow

SoC GPU flops/CPU GPU SoC flops

15% Marketshare ON A MEDIAN SMARTPHONE, THE GPU PROVIDES AS MUCH THEORETICAL PEAK PERFORMANCE AS ITS CPU

LESS THAN 15% SMARTPHONES HAVE A GPU THAT IS 3 TIMES AS POWERFUL AS ITS CPU Lay of the Land PROGRAMMABILITY Programmability is a Primary Roadblock for Using Mobile Co-processors ( OpenCL, OpenGL ES, Vulkan for Android GPUs

! OpenGL 3.2 " % ! ##'& OpenGL 3.1 ! ! ##' " $ ! #$$ OpenGL 3.0 !& " OpenGL 2.0 % !& %

ANDROID GPUS HAVE FRAGILE USABILITY AND POOR PROGRAMMABILITY WHILE IOS HAS BETTER SUPPORT WITH METAL Quantitative Approach to Edge Inference Designs

State of the Practice for Mobile Inference is Using CPUs

FRAGMENTATION PERFORMANCE PROGRAMMABILITY

( There are more than 2000+ ( Performance difference between ( Programmability is a major road different SoCs but mobile mobile CPUs and GPUs is block for co-processors (e.g. CPUs show little diversity narrow Android GPUs) with ARM’s Cortex A53 dominating the market

MOBILE INFERENCE OPTIMIZATION IS TARGETED FOR THE COMMON DENOMINATOR OF Machine Learning at Facebook: THEUnderstanding FRAGMENTED Inference SOC at theECOSYSTEM Edge. Wu et al. HPCA-2019. $

( $.056;2#2.?;6;4.A.02/<<8+;12?@A.;16;4 ;32?2;02.AA52142+ *# (

'+ )'

;32?2;02@=2?)20<;1

)24:2;A.A6<; .;1*?.086;4 '<@2@A6:.A6<; :.42$<129$ :.42$<129$ 5283 )37,#-6)219 Vertical Integrated Systems Making Inference on DSPs Leads to Consistent Performance

DSP

CPU FPS

CPU thermal throttling CPU causes sudden FPS drop

Power (W) DSP The primary reason for using

co-processors and accelerators Thermal are for lower power and Threshold CPU more stable performance

Temp (C) Temp DSP

0 100 200 300 400 500 Time (Sec) &BA96;2

( &C2?C62D3

( Diversity of ML Models in Facebook’s Datacenter DeepRecSys ( A Variety of Neural Personalized Recommendation Models Dominate AI Inference Cycles MLPerf Benchmark Suite ( Legacy Devices Matter; Performance Differences at the Edge Are Huge ( ItWithout is important solid performanceper toformance consider analysis full-picture fforor ML models, and systemwe effects for aarere in tthehe darkdark efficient, practical at-scale ML infrastructure designs

K. Hazelwood et al., “Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective,” HPCA 2018. C.-J. Wu et al., “Machine Learning at Facebook: Understanding Inference at the Edge,” HPCA 2019. U. Gupta et al., “The Architectural Implications of Facebook’s DNN-based Personalized Recommendation,” HPCA 2020.