Path towards better models and implications for hardware

Natalia Vassilieva Product Manager, ML Cerebras Systems AI has massive potential, but is compute-limited today.

Cerebras Systems © 2020 “Historically, progress in neural networks and deep learning research has been greatly influenced by the available hardware and software tools.”

Yann LeCun, 1.1 Deep Learning Hardware: Past, Present and Future. 2019 IEEE International Solid-State Circuits Conference (ISSCC 2019), 12-19 “Historically, progress in neural networks and deep learning research has been greatly influenced by the available hardware and software tools.”

“Hardware capabilities and software tools both motivate and limit the type of ideas that AI researchers will imagine and will allow themselves to pursue. The tools at our disposal fashion our thoughts more than we care to admit.”

Yann LeCun, 1.1 Deep Learning Hardware: Past, Present and Future. 2019 IEEE International Solid-State Circuits Conference (ISSCC 2019), 12-19 “Historically, progress in neural networks and deep learning research has been greatly influenced by the available hardware and software tools.”

“Hardware capabilities and software tools both motivate and limit the type of ideas that AI researchers will imagine and will allow themselves to pursue. The tools at our disposal fashion our thoughts more than we care to admit.”

“Is DL-specific hardware really necessary? The answer is a resounding yes. One interesting property of DL systems is that the larger we make them, the better they seem to work.”

Yann LeCun, 1.1 Deep Learning Hardware: Past, Present and Future. 2019 IEEE International Solid-State Circuits Conference (ISSCC 2019), 12-19 Cerebras Systems

Founded in 2016 ● Visionary leadership ● Smart, strategic backing ● 200+ world-class engineering team

Building systems to drastically change the landscape of compute for AI

Cerebras Systems © 2020 The Cerebras Wafer Scale Engine (WSE) The most powerful processor for AI

46,225 mm2 silicon 1.2 trillion transistors 400,000 AI optimized cores 18 Gigabytes of On-chip Memory 9 PByte/s memory bandwidth 100 Pbit/s fabric bandwidth TSMC 16nm process

Hosted in CS-1

Cerebras Systems © 2020 Cerebras CS-1: Cluster-Scale DL Performance in a Single System

Powered by the WSE

Programming accessibility of a single node, using TensorFlow, PyTorch, and other frameworks

Deploys into standard datacenter infrastructure

Multiple units delivered, installed, and in-use today across multiple verticals

Built from the ground up for AI acceleration

Cerebras Systems © 2020 Let’s look at modern DL models…

Cerebras Systems © 2020 Modern models:

Image classification accuracy and model sizes 105 440 100 480

95 144 ● Accuracy steadily 829 11.2 improves over time 90 5 ● Model sizes grows

85 60 46 gradually

80 1.2 5 accuracy on ImageNet, %

- 75 2010 2012 2014 2016 2018 2020 2022 Top Years

Data from https://paperswithcode.com/ Cerebras Systems © 2020 Modern models: computer vision

Image classification accuracy and model sizes 105 440 100 480

95 144 ● Accuracy steadily 829 11.2 improves over time 90 5 AlexNet ● Model sizes grows 85 60 46 gradually

80 1.2 5 accuracy on ImageNet, %

- 75 2010 2012 2014 2016 2018 2020 2022 Top Years

Data from https://paperswithcode.com/ Cerebras Systems © 2020 Modern models: computer vision

Image classification accuracy and model sizes 105 440 100 480

95 144 ● Accuracy steadily 829 11.2 improves over time 90 ZFNet 5 AlexNet ● Model sizes grows 85 60 46 gradually

80 1.2 5 accuracy on ImageNet, %

- 75 2010 2012 2014 2016 2018 2020 2022 Top Years

Data from https://paperswithcode.com/ Cerebras Systems © 2020 Modern models: computer vision

Image classification accuracy and model sizes 105 440 100 480

95 144 ● Accuracy steadily VGG-19 829 11.2 improves over time 90 ZFNet 5 AlexNet ● Model sizes grows Inception-v1 85 60 46 gradually

80 1.2 5 accuracy on ImageNet, %

- 75 2010 2012 2014 2016 2018 2020 2022 Top Years

Data from https://paperswithcode.com/ Cerebras Systems © 2020 Modern models: computer vision

Image classification accuracy and model sizes 105 440 100 480 ResNet-152 95 144 ● Accuracy steadily VGG-19 829 11.2 improves over time 90 ZFNet 5 AlexNet ● Model sizes grows Inception-v1 85 60 46 gradually

80 1.2 5 accuracy on ImageNet, %

- 75 2010 2012 2014 2016 2018 2020 2022 Top Years

Data from https://paperswithcode.com/ Cerebras Systems © 2020 Modern models: computer vision

Image classification accuracy and model sizes 105 440 100 Inception-ResNet 480 ResNet-152 95 144 ● Accuracy steadily VGG-19 829 11.2 improves over time 90 ZFNet 5 AlexNet ● Model sizes grows Inception-v1 85 60 46 gradually

80 1.2 SqueezeNet 5 accuracy on ImageNet, %

- 75 2010 2012 2014 2016 2018 2020 2022 Top Years

Data from https://paperswithcode.com/ Cerebras Systems © 2020 Modern models: computer vision

Image classification accuracy and model sizes 105 SENet-154 FixEfficientNet 440 100 Inception-ResNet 480 ResNet-152 95 144 ● Accuracy steadily VGG-19 829 11.2 improves over time 90 AmoebaNet-A ResNeXt-101 ZFNet 5 AlexNet ● Model sizes grows Inception-v1 85 60 46 gradually (from tens to hundreds millions of 80 1.2 parameters) SqueezeNet 5 accuracy on ImageNet, %

- 75 2010 2012 2014 2016 2018 2020 2022 Top Years

Data from https://paperswithcode.com/ Cerebras Systems © 2020 Modern Models: Natural Language Processing

Language modelling on WikiText-103 60 ● Quality of models 50 improves linearly with model size growing 40 exponentially, confirming a power law reported by OpenAI 30

20 Test perplexity

10

0 1 10 100 1000 10000 100000 1000000 Model size, millions of parameters (log-scale)

Data from https://paperswithcode.com/ Cerebras Systems © 2020 Modern Models: Natural Language Processing

Language modelling on WikiText-103 60 Tansformer-XL 2- skip-LSTM ● Quality of models 50 improves linearly with model size growing Mogrifier LSTM exponentially, confirming 40 GPT-2 a power law reported by OpenAI 30 BERT-Large-CAS GPT-3 20 Test perplexity

10

0 24M 1.5B 175B 1 10 100 1000 10000 100000 1000000 Model size, millions of parameters (log-scale)

Data from https://paperswithcode.com/ Cerebras Systems © 2020 Modern Models: Natural Language Processing

Sentiment analysis on SST-2 98 ● Similar trend is on 97 downstream tasks 96

95

94

93 Accuracy, % Accuracy, 92

91

90 1 10 100 1000 10000 100000 Model size, millions of parameters (log-scale)

Data from https://paperswithcode.com/ Cerebras Systems © 2020 Modern Models: Natural Language Processing

Sentiment analysis on SST-2 98 T5-3B T5-11B ● Similar trend is on 97 downstream tasks 96

95

94 Bert Large 93 Tiny Bert Accuracy, % Accuracy, 92 Bert Base 91

90 110M 340M 3B 11B 1 10 14.5M 100 1000 10000 100000 Model size, millions of parameters (log-scale)

Data from https://paperswithcode.com/ Cerebras Systems © 2020 Modern Models: Natural Language Processing

Sentiment analysis on SST-2 98 T5-3B ALBERT XLNet T5-11B ● Similar trend is on 97 downstream tasks 96 ● Further improvements 95 are possible with similar model sizes, but with 94 more compute Bert Large 93 Tiny Bert Accuracy, % Accuracy, 92 Bert Base 91

90 110M 340M 3B 11B 1 10 14.5M 100 1000 10000 100000 Model size, millions of parameters (log-scale)

Data from https://paperswithcode.com/ Cerebras Systems © 2020 Memory and compute requirements

Memory and compute requirements 100,000

● 1 PFLOP-day is about days - 10,000 1 x DGX-2H or 1 x DGX-A100 busy for a day

1,000

100

10 Total training compute, PFLOP 1 1 10 100 1,000 10,000 100,000 Model memory requirement , GB

Cerebras Systems © 2020 Memory and compute requirements

Memory and compute requirements 100,000 Microsoft 1T-param ● 1 PFLOP-day is about days - 10,000 GPT-3 1 x DGX-2H or 1 x DGX-A100 busy for a day T5-11B 1,000 T-NLG Megatron-LM 100 GPT-2

10 BERT Large

BERT Base Total training compute, PFLOP 2GB 272GB 1 5.4GB 2.8TB 16TB 1 10 100 1,000 10,000 100,000 Model memory requirement , GB

Cerebras Systems © 2020 Memory and compute requirements

Memory and compute requirements 100,000 Microsoft 1T-param 23,220 ● 1 PFLOP-day is about days - 10,000 GPT-3 1 x DGX-2H or 1 x DGX-A100 3,738 busy for a day T5-11B 1,000 ● NVIDIA Megatron-LM: trained on 512 V100 (32 DGX- T-NLG Megatron-LM 2H) for about 10 days 10096 GPT-2 ● OpenAI GPT-3: trained on 1024 V100 (64 DGX- 10 BERT Large 2H) for about 116 days 6 BERT Base Total training compute, PFLOP 2GB 272GB 1 5.4GB 2.8TB 16TB 1 10 100 1,000 10,000 100,000 Model memory requirement , GB

Cerebras Systems © 2020 Future path: larger and smarter models

Brute-force scaling is historical path to better models. • This is challenging • Memory needs to grow • Compute needs to grow

Algorithmic innovations give more efficient models. • These are promising but challenge existing hardware.

Cerebras Systems © 2020 Dynamic sparsity and conditional computations are key

Training data Model parameters

Sample 1 Sample 2

Sample N

Ability to choose which input element should influence which model parameter, and which model parameter should influence the output for a given input element

Cerebras Systems © 2020 Dynamic sparsity and conditional computations are key

Training data Model parameters

Sample 1 Sample 2

Sample N

Ability to choose which input element should influence which model parameter, and which model parameter should influence the output for a given input element

Cerebras Systems © 2020 Dynamic sparsity and conditional computations are key

Training data Model parameters

Sample 1 Sample 2

Sample N

Ability to choose which input element should influence which model parameter, and which model parameter should influence the output for a given input element

Cerebras Systems © 2020 Dynamic sparsity and conditional computations are key

Training data Model parameters

Sample 1 Sample 2

Sample N

Ability to choose which input element should influence which model parameter, and which model parameter should influence the output for a given input element

Cerebras Systems © 2020 Dynamic sparsity and conditional computations are key

Training data Model parameters

Sample 1 Sample 2

Sample N

Ability to choose which input element should influence which model parameter, and which model parameter should influence the output for a given input element

Cerebras Systems © 2020 Dynamic sparsity and conditional computations are key

Training data Model parameters

Sample 1 Sample 2

Sample N

Ability to choose which input element should influence which model parameter, and which model parameter should influence the output for a given input element

Cerebras Systems © 2020 Dynamic sparsity and conditional computations are key

Training data Model parameters

Sample 1 Sample 2

Sample N

Ability to choose which input element should influence which model parameter, and which model parameter should influence the output for a given input element

Cerebras Systems © 2020 This requires flexible hardware…

Cerebras Systems © 2020 WSE Designed for Flexibility

Each of the 400,000 cores is individually programmable • Full array of general instructions with ML extensions • Optimized tensor ops for data processing • Perform different ops on different data

Dataflow scheduling in hardware • Data and control arrival triggers instruction lookup • State machine schedules datapath cycles • Output is written back to memory or fabric

Intrinsic sparsity harvesting • Sender filters out sparse zero data • Receiver skips unnecessary processing

Cerebras Systems © 2020 Sparsity --> Speed-up

34-layer FC performance with induced sparsity 3

2 Relative Perf (x factor) 1

0 Dense, Identity Dense, ReLU 25% sparse, ReLU 50% sparse, ReLU GPU CS-1

● 1.7x perf gain with ReLU ● 2.4x perf gain with ReLU+50% sparsity 35 Cerebras Systems © 2020 Natural Sparsity in Transformer

• Feed Transformer uses ReLU and Dropout Forward • ReLU is 90% naturally sparse • Dropout is 30% naturally sparse Feed • 1.2x perf gain vs. dense non-linear Forward and no dropout

Cerebras Systems © 2020 Natural Sparsity in Transformer

• Transformer uses ReLU and Dropout Feed-Forward • ReLU is 90% naturally sparse • Dropout is 30% naturally sparse DropoutBackward Sparse

• 1.2x perf gain vs. dense non-linear FC2 and no dropout ReLU Sparse

FC1

Cerebras Systems © 2020 Inducing Sparsity in BERT

• BERT has no natural sparsity BERT Performance vs. Induced Sparsity • But sparsity can be induced on most 2 layers in both fwd and bwd pass 1.8 • Up to 1.5x perf gain with 50% sparsity 1.6

and no or minimal accuracy loss 1.4

• ML user has control on perf/accuracy 1.2 trade-off (x factor) Perf Relative 1 0% 25% 50% 75% Induced Sparsity

Cerebras Systems © 2020 WSE is designed to unlock smarter techniques and scale

WSE has a data flow architecture • Flexibility to stream token by token • Inherent sparsity harvesting

WSE is a MIMD architecture • Can program each core independently • Perform different operations on different data

WSE was built to enable the next generation of models otherwise limited today.

Cerebras Systems © 2020 WSE + new techniques à efficient, extreme-scale models

Cerebras Systems © 2020 Summarizing…

• We need larger and smarter models • We need powerful and flexible hardware to explore new models • Cerebras has build a special-purpose chip (WSE), system (CS-1) and software • CS-1 is the most powerful single node: performance of a cluster, ease of use of a single device • WSE is flexible and dynamic: enables fine-grained sparsity, conditional & dynamic ML techniques • We are keen to see new AI ideas from researchers which can leverage this new hardware • If you are a researcher and interested, check Neocortex Early User Program! https://www.cmu.edu/psc/aibd/neocortex/early-user-program.html

Cerebras Systems © 2020 Thank you [email protected]