Path Towards Better Deep Learning Models and Implications for Hardware
Total Page:16
File Type:pdf, Size:1020Kb
Path towards better deep learning models and implications for hardware Natalia Vassilieva Product Manager, ML Cerebras Systems AI has massive potential, but is compute-limited today. Cerebras Systems © 2020 “Historically, progress in neural networks and deep learning research has been greatly influenced by the available hardware and software tools.” Yann LeCun, 1.1 Deep Learning Hardware: Past, Present and Future. 2019 IEEE International Solid-State Circuits Conference (ISSCC 2019), 12-19 “Historically, progress in neural networks and deep learning research has been greatly influenced by the available hardware and software tools.” “Hardware capabilities and software tools both motivate and limit the type of ideas that AI researchers will imagine and will allow themselves to pursue. The tools at our disposal fashion our thoughts more than we care to admit.” Yann LeCun, 1.1 Deep Learning Hardware: Past, Present and Future. 2019 IEEE International Solid-State Circuits Conference (ISSCC 2019), 12-19 “Historically, progress in neural networks and deep learning research has been greatly influenced by the available hardware and software tools.” “Hardware capabilities and software tools both motivate and limit the type of ideas that AI researchers will imagine and will allow themselves to pursue. The tools at our disposal fashion our thoughts more than we care to admit.” “Is DL-specific hardware really necessary? The answer is a resounding yes. One interesting property of DL systems is that the larger we make them, the better they seem to work.” Yann LeCun, 1.1 Deep Learning Hardware: Past, Present and Future. 2019 IEEE International Solid-State Circuits Conference (ISSCC 2019), 12-19 Cerebras Systems Founded in 2016 ● Visionary leadership ● Smart, strategic backing ● 200+ world-class engineering team Building systems to drastically change the landscape of compute for AI Cerebras Systems © 2020 The Cerebras Wafer Scale Engine (WSE) The most powerful processor for AI 46,225 mm2 silicon 1.2 trillion transistors 400,000 AI optimized cores 18 Gigabytes of On-chip Memory 9 PByte/s memory bandwidth 100 Pbit/s fabric bandwidth TSMC 16nm process Hosted in CS-1 Cerebras Systems © 2020 Cerebras CS-1: Cluster-Scale DL Performance in a Single System Powered by the WSE Programming accessibility of a single node, using TensorFlow, PyTorch, and other frameworks Deploys into standard datacenter infrastructure Multiple units delivered, installed, and in-use today across multiple verticals Built from the ground up for AI acceleration Cerebras Systems © 2020 Let’s look at modern DL models… Cerebras Systems © 2020 Modern models: computer vision Image classification accuracy and model sizes 105 440 100 480 95 144 ● Accuracy steadily 829 11.2 improves over time 90 5 ● Model sizes grows 85 60 46 gradually 80 1.2 5 accuracy on ImageNet, % - 75 2010 2012 2014 2016 2018 2020 2022 Top Years Data from https://paperswithcode.com/ Cerebras Systems © 2020 Modern models: computer vision Image classification accuracy and model sizes 105 440 100 480 95 144 ● Accuracy steadily 829 11.2 improves over time 90 5 AlexNet ● Model sizes grows 85 60 46 gradually 80 1.2 5 accuracy on ImageNet, % - 75 2010 2012 2014 2016 2018 2020 2022 Top Years Data from https://paperswithcode.com/ Cerebras Systems © 2020 Modern models: computer vision Image classification accuracy and model sizes 105 440 100 480 95 144 ● Accuracy steadily 829 11.2 improves over time 90 ZFNet 5 AlexNet ● Model sizes grows 85 60 46 gradually 80 1.2 5 accuracy on ImageNet, % - 75 2010 2012 2014 2016 2018 2020 2022 Top Years Data from https://paperswithcode.com/ Cerebras Systems © 2020 Modern models: computer vision Image classification accuracy and model sizes 105 440 100 480 95 144 ● Accuracy steadily VGG-19 829 11.2 improves over time 90 ZFNet 5 AlexNet ● Model sizes grows Inception-v1 85 60 46 gradually 80 1.2 5 accuracy on ImageNet, % - 75 2010 2012 2014 2016 2018 2020 2022 Top Years Data from https://paperswithcode.com/ Cerebras Systems © 2020 Modern models: computer vision Image classification accuracy and model sizes 105 440 100 480 ResNet-152 95 144 ● Accuracy steadily VGG-19 829 11.2 improves over time 90 ZFNet 5 AlexNet ● Model sizes grows Inception-v1 85 60 46 gradually 80 1.2 5 accuracy on ImageNet, % - 75 2010 2012 2014 2016 2018 2020 2022 Top Years Data from https://paperswithcode.com/ Cerebras Systems © 2020 Modern models: computer vision Image classification accuracy and model sizes 105 440 100 Inception-ResNet 480 ResNet-152 95 144 ● Accuracy steadily VGG-19 829 11.2 improves over time 90 ZFNet 5 AlexNet ● Model sizes grows Inception-v1 85 60 46 gradually 80 1.2 SqueezeNet 5 accuracy on ImageNet, % - 75 2010 2012 2014 2016 2018 2020 2022 Top Years Data from https://paperswithcode.com/ Cerebras Systems © 2020 Modern models: computer vision Image classification accuracy and model sizes 105 SENet-154 FixEfficientNet 440 100 Inception-ResNet 480 ResNet-152 95 144 ● Accuracy steadily VGG-19 829 11.2 improves over time 90 AmoebaNet-A ResNeXt-101 ZFNet 5 AlexNet ● Model sizes grows Inception-v1 85 60 46 gradually (from tens to hundreds millions of 80 1.2 parameters) SqueezeNet 5 accuracy on ImageNet, % - 75 2010 2012 2014 2016 2018 2020 2022 Top Years Data from https://paperswithcode.com/ Cerebras Systems © 2020 Modern Models: Natural Language Processing Language modelling on WikiText-103 60 ● Quality of models 50 improves linearly with model size growing 40 exponentially, confirming a power law reported by OpenAI 30 20 Test perplexity 10 0 1 10 100 1000 10000 100000 1000000 Model size, millions of parameters (log-scale) Data from https://paperswithcode.com/ Cerebras Systems © 2020 Modern Models: Natural Language Processing Language modelling on WikiText-103 60 Tansformer-XL 2-layer skip-LSTM ● Quality of models 50 improves linearly with model size growing Mogrifier LSTM exponentially, confirming 40 GPT-2 a power law reported by OpenAI 30 BERT-Large-CAS GPT-3 20 Test perplexity 10 0 24M 1.5B 175B 1 10 100 1000 10000 100000 1000000 Model size, millions of parameters (log-scale) Data from https://paperswithcode.com/ Cerebras Systems © 2020 Modern Models: Natural Language Processing Sentiment analysis on SST-2 98 ● Similar trend is on 97 downstream tasks 96 95 94 93 Accuracy, % Accuracy, 92 91 90 1 10 100 1000 10000 100000 Model size, millions of parameters (log-scale) Data from https://paperswithcode.com/ Cerebras Systems © 2020 Modern Models: Natural Language Processing Sentiment analysis on SST-2 98 T5-3B T5-11B ● Similar trend is on 97 downstream tasks 96 95 94 Bert Large 93 Tiny Bert Accuracy, % Accuracy, 92 Bert Base 91 90 110M 340M 3B 11B 1 10 14.5M 100 1000 10000 100000 Model size, millions of parameters (log-scale) Data from https://paperswithcode.com/ Cerebras Systems © 2020 Modern Models: Natural Language Processing Sentiment analysis on SST-2 98 T5-3B ALBERT XLNet T5-11B ● Similar trend is on 97 downstream tasks 96 ● Further improvements 95 are possible with similar model sizes, but with 94 more compute Bert Large 93 Tiny Bert Accuracy, % Accuracy, 92 Bert Base 91 90 110M 340M 3B 11B 1 10 14.5M 100 1000 10000 100000 Model size, millions of parameters (log-scale) Data from https://paperswithcode.com/ Cerebras Systems © 2020 Memory and compute requirements Memory and compute requirements 100,000 ● 1 PFLOP-day is about days - 10,000 1 x DGX-2H or 1 x DGX-A100 busy for a day 1,000 100 10 Total training compute, PFLOP 1 1 10 100 1,000 10,000 100,000 Model memory requirement , GB Cerebras Systems © 2020 Memory and compute requirements Memory and compute requirements 100,000 Microsoft 1T-param ● 1 PFLOP-day is about days - 10,000 GPT-3 1 x DGX-2H or 1 x DGX-A100 busy for a day T5-11B 1,000 T-NLG Megatron-LM 100 GPT-2 10 BERT Large BERT Base Total training compute, PFLOP 2GB 272GB 1 5.4GB 2.8TB 16TB 1 10 100 1,000 10,000 100,000 Model memory requirement , GB Cerebras Systems © 2020 Memory and compute requirements Memory and compute requirements 100,000 Microsoft 1T-param 23,220 ● 1 PFLOP-day is about days - 10,000 GPT-3 1 x DGX-2H or 1 x DGX-A100 3,738 busy for a day T5-11B 1,000 ● NVIDIA Megatron-LM: trained on 512 V100 (32 DGX- T-NLG Megatron-LM 2H) for about 10 days 10096 GPT-2 ● OpenAI GPT-3: trained on 1024 V100 (64 DGX- 10 BERT Large 2H) for about 116 days 6 BERT Base Total training compute, PFLOP 2GB 272GB 1 5.4GB 2.8TB 16TB 1 10 100 1,000 10,000 100,000 Model memory requirement , GB Cerebras Systems © 2020 Future path: larger and smarter models Brute-force scaling is historical path to better models. • This is challenging • Memory needs to grow • Compute needs to grow Algorithmic innovations give more efficient models. • These are promising but challenge existing hardware. Cerebras Systems © 2020 Dynamic sparsity and conditional computations are key Training data Model parameters Sample 1 Sample 2 Sample N Ability to choose which input element should influence which model parameter, and which model parameter should influence the output for a given input element Cerebras Systems © 2020 Dynamic sparsity and conditional computations are key Training data Model parameters Sample 1 Sample 2 Sample N Ability to choose which input element should influence which model parameter, and which model parameter should influence the output for a given input element Cerebras Systems © 2020 Dynamic sparsity and conditional computations are key Training data Model parameters Sample 1 Sample 2 Sample N Ability to choose which input element should influence which model parameter, and which model parameter should