AccML 2020: HiPEAC Accelerated Machine Learning
$.056;2#2.?;6;4A)0.92 *(& %!*- #!#!*-5.992;42@3
.?<92!2.;-B .02/<<8 (2@2.?05I )F@$# Machine Learning at Facebook’s Scale
TRANSLATION NEWS FEED ( Machine learning is used extensively FACE ( Ranking posts in Newsfeed TAGGING
( Content understanding ADS ( Object detection, segmentation, and tracking ( Speech recognition/translation
( From data centers to the edge Keypoints Augmented Reality Segmentation with Smart Camera $#E20BA6<;9 .A. *?.6;6;4 ;32?2;02 $.;6=B9.A6<; ! "! $ $#$<129*?.6;6;4.A.02/<<8 D228@ (%)#*!&% 1.F@ ,) 5 :6;BA2@ 5B2;0F -5.A./ 5-//-21 27%/5)(-'7-216)5%; -//-21 %1+8%+) 5%16/%7-216)5%; -//-216 %.)''28176)029)(52%'7-9)/; &;8720%7)( ;67)069)5;%; 6?@AD6A5B@A<:2@64;21)F@A2:)<9BA6<;@ .02/<<8L@=569<@<=5F6@A< ( 5.?.0A2?6G2.;1/B082A6G2$#D &"$ &)"$ " #'& ! "! $ 6459F)0.9./92 ;3?.@A?B0AB?2 '%/-1+!3 '%/-1+87 # # &BA96;2 ( &C2?C62D3 Support Vector Gradient-Boosted Multi-Layer Convolutional Recurrent Machines Decision Trees Perceptron Neural Nets Neural Nets SVM GBDT MLP CNN RNN Facer Sigma News Feed Facer Language Translation Ads Speech Rec. Search Content Sigma Understanding Hazelwood et al., “Applied Machine learning at Facebook: A Datacenter Infrastructure Perspective”, HPCA 2018. 6C2?@6AF6;$#$<129@.A.02/<<8 Support Vector Gradient-Boosted Multi-Layer Convolutional Recurrent Machines Decision Trees Perceptron Neural Nets Neural Nets SVM GBDT MLP CNN RNN Facer Sigma News Feed Facer Language Translation Ads Speech Rec. Search Content Sigma Understanding Hazelwood et al., “Applied Machine learning at Facebook: A Datacenter Infrastructure Perspective”, HPCA 2018. 6C2?@6AF6;%%+@2.@2@ ?6A5:2A60 ;A2;@6AF %%&"! % $2: &%(&$$%*!&%+))) * 27%/1*)5)1');'/)6 )'200)1(%7-212()/6 $#*<=60@<3 ;A2?2@A/FA52(2@2.?05 <::B;6AF 5AA=@ DDD@64.?05B2@)AB1621/FA52 (2@2.?05<::B;6AF %2B?.9'2?@<;.96G21 (20<::2;1.A6<;)F@A2:@ *52+@2.@25.992;42 ;E.:=92<3(20<::2;1.A6<; "" !&"!" !&"!" %" !"#! &&& #!") (20<::2;1.A6<; -5.A6@22=#2.?;6;4'2?@<;.96G21 -7)0-7) 5)'200)1(%7-2148)5; &$$%*!&%%'+*) 2;@22.AB?2@ 2.AB?2@ 2.AB?2@ )=.?@2 )=.?@2 !&"$% $!% % %) ) :/2116;4 :/2116;4 2;@2%%@ *./92 *./92 Memory Capacity Dominated Memory Bandwidth:/2116;444?24.A6<;:/211 6;4 Dominated Communication Dominated)=.?@22;@2)=.?@2 ;A24?.A6<; Computation Dominated '?2160A )5-7)0 6C2?@6AF6;(20<::2;1.A6<;$<129@ Facebook Google RMC3 RMC2 RMC1 Alibaba $#&=2?.A B=A.2A.9J*52?056A20AB?.9 :=960.A6<;@<3.02/<<8L@%%/.@21 '2?@<;.96G21(20<::2;1.A6<;)F@A2:@K ' :/2116;4*./92002@@2@ ;0B? 645## $'" D6A5# H : : B=A.2A.9J*52?056A20AB?.9 :=960.A6<;@<3.02/<<8L@%%/.@21 '2?@<;.96G21(20<::2;1.A6<;)F@A2:@K ' $.7 $" $ ($!! # #-()5 21 AdIndexer %-//%7)1';-6,)%9-/;-1*/8)1')(&;'%',),-)5%5',; # <6-1'/96 <6):'/ $ ( &C2?C62D3 How do we 2000+ optimize system SoCs designs for real- time ML inference? FRAGMENTED SMARTPHONE ECOSYSTEM POSES UNIQUE CHALLENGES FOR EDGE INFERENCE $.056;2#2.?;6;4.A.02/<<8+;12?@A.;16;4 ;32?2;02.AA52142-B2A.9 ' Lay of the Land FRAGMENTATION Taking a Closer Look at Smartphones Facebook Runs on 1 95% 0.9 ( Qualcomm Snapdragon 0.8 ( Samsung Exynos 0.7 65% ( MediaTek Helio 0.6 225 SoCs ( 0.5 HiSilicon Kirin et al. 0.4 50 SoCs 0.3 CDF of SoCs 0.2 0.1 0 1 53 105 157 209 261 313 365 417 469 521 573 625 677 729 781 833 885 937 989 1041 1093 1145 1197 1249 1301 1353 1405 1457 1509 1561 1613 1665 1717 1769 1821 1873 1925 1977 2029 2081 Unique SoCs IS THERETHERE A DOMINATINGIS NO STANDARD MOBILE SOC SOC TO OPTIMIZE TO OPTIMIZE FOR FOR? Lay of the Land FRAGMENTATION In 2018, ~28% of SoCs Use CPUs Designed in 2013 or Later ARM Cortex A53 72% OF THE WORLD’S CELL PHONES ARE MORE THAN 7 YEARS OLD MOBILE CPUS SHOW LITTLE DIVERSITY Lay of the Land PERFORMANCE The Performance Difference between a Mobile CPU and GPU is Narrow SoC GPU flops/CPU GPU SoC flops 15% Marketshare ON A MEDIAN SMARTPHONE, THE GPU PROVIDES AS MUCH THEORETICAL PEAK PERFORMANCE AS ITS CPU LESS THAN 15% SMARTPHONES HAVE A GPU THAT IS 3 TIMES AS POWERFUL AS ITS CPU Lay of the Land PROGRAMMABILITY Programmability is a Primary Roadblock for Using Mobile Co-processors ( OpenCL, OpenGL ES, Vulkan for Android GPUs ! OpenGL 3.2 " % ! ##'& OpenGL 3.1 ! ! ##' " $ ! #$$ OpenGL 3.0 !& " OpenGL 2.0 % !& % ANDROID GPUS HAVE FRAGILE USABILITY AND POOR PROGRAMMABILITY WHILE IOS HAS BETTER SUPPORT WITH METAL Quantitative Approach to Edge Inference Designs State of the Practice for Mobile Inference is Using CPUs FRAGMENTATION PERFORMANCE PROGRAMMABILITY ( There are more than 2000+ ( Performance difference between ( Programmability is a major road different SoCs but mobile mobile CPUs and GPUs is block for co-processors (e.g. CPUs show little diversity narrow Android GPUs) with ARM’s Cortex A53 dominating the market MOBILE INFERENCE OPTIMIZATION IS TARGETED FOR THE COMMON DENOMINATOR OF Machine Learning at Facebook: THEUnderstanding FRAGMENTED Inference SOC at theECOSYSTEM Edge. Wu et al. HPCA-2019. $ ( $.056;2#2.?;6;4.A.02/<<8+;12?@A.;16;4 ;32?2;02.AA52142+ *# ( '+ )' ;32?2;02@=2?)20<;1 )24:2;A.A6<; .;1*?.086;4 '<@2@A6:.A6<; :.42$<129$ :.42$<129$ 5283 )37,#-6)219 Vertical Integrated Systems Making Inference on DSPs Leads to Consistent Performance DSP CPU FPS CPU thermal throttling CPU causes sudden FPS drop Power (W) DSP The primary reason for using co-processors and accelerators Thermal are for lower power and Threshold CPU more stable performance Temp (C) Temp DSP 0 100 200 300 400 500 Time (Sec) &BA96;2 ( &C2?C62D3 ( Diversity of ML Models in Facebook’s Datacenter DeepRecSys ( A Variety of Neural Personalized Recommendation Models Dominate AI Inference Cycles MLPerf Benchmark Suite ( Legacy Devices Matter; Performance Differences at the Edge Are Huge ( ItWithout is important solid performanceper toformance consider analysis full-picture fforor ML models, and systemwe effects for aarere in tthehe darkdark efficient, practical at-scale ML infrastructure designs K. Hazelwood et al., “Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective,” HPCA 2018. C.-J. Wu et al., “Machine Learning at Facebook: Understanding Inference at the Edge,” HPCA 2019. U. Gupta et al., “The Architectural Implications of Facebook’s DNN-based Personalized Recommendation,” HPCA 2020. 37