1 V100
Cerebras Groq Habana
GraphCore A100 SambaNova Cambrion
2 Deep Learning Accelerator Craze: The Tale of Two Trends Fast growing computation demand Slow-down of the Moore’s law
Compute requirements Number of transistors doubles every 3.5 on chip doubles every months! 24 months
Source: https://blog.openai.com/ai-and-compute/
3 If we don’t do anything
Wait for 40 years to train 100 times larger models!
4 Alternatively…
• Explore Parallelism • Model • Data • Pipeline • Hybrid • Design Specialized Hardware Accelerator • 99+ Hardware Startup companies
5 Embarrassment of Riches
Parallelism Strategy Exploration Specialized Hardware Accelerator (99+) • HyPar • Nvidia V100 • FlexFlow • Nvidia A100 • MeshTensorFlow • Cerebras • … • SambaNova • Groq • GraphCore • Habana Fixed Hardware Parallelism-Agnostic • …
6 Low Utilization at Scale 17 b parameters 1000 GPUs 6% efficiency
8 b Parameters 512 GPUs 20% efficiency
Data Source: Microsoft blog (https://syncedreview.com/2020/02/12/17-billion-parameters-microsoft-deepspeed-breeds-worlds-largest-nlp-model/) and Kunle Olukton’s presentation at ScaledML 7 Analysis Paralysis
8 What do we need? Application
Hardware Config. Magic Box Parallelism Strategy
Execution Time
9 What do we need? Application
Hardware Config. Magic Box Parallelism Strategy
Execution Time
Best* Time Best* Hardware
Best* Parallelism Strategy 10 What do we need? Application Design Constraints
Technology Parameters
Power Budget Hardware Config. Magic Box Parallelism Strategy
Area Budget
Execution Time
Best* Time Best* Hardware
Best* Parallelism Strategy 11 What do we need? Applications Design Constraints
Technology Parameters
Power Budget Hardware Config. Magic Box Parallelism Strategy • Today: Which accelerator meets my need? Area Budget • Tomorrow: Which technology is the most promising?
Execution Time
Best* Time Best* Hardware
Best* Parallelism Strategy 12 MechaFlow: A Software/Hardware/Technology Co-design Space Exploration Framework
13 MechaFlow: Telescopic View
Applications
Technology Parameters
Power Budget Hardware Config. Magic Box Parallelism Strategy
Area Budget MechaFlow
Execution Time
Best* Time Best* Hardware
Best* Parallelism Strategy 14 Case Studies
1. How much performance gain from co-designing hardware and parallelism strategy? 2. How much performance gain from new upcoming packaging technologies?
15 Methodology
• Language Modeling • Word-language model (RNN-based LSTM) • SOTA: 18 Billion parameters • Desired: 256 Billion parameters (hidden: 19968, layers:2, vocab:800K, seq.: 20) • Parallelism: • 64-way parallelism • Data parallelism • Model/Kernel parallelism: Row-Column (RC) and Column-Row (CR) • Pipeline/Layer parallelism • {RC or CR}-k{i}-k{j}-d{k}-l{m}: e.g. RC-k8-k2-d4-l1 • Hardware: • Baseline: V100 • Design constraints: 300 watt, 1230 mm2/node, 815 mm2 /core • Technology: 14 nm
16 Q1.Co-design Hardware and Parallelism Strategy How much performance gain?
17 Co-design Parallelism Strategy and HW Design?
12 10 8 6 4 2
(Sec.) 0 Execution Time/Step Time/Step Execution
Best Hardware per Parallelism Strategy V100
18 Not so much gain from specialization to parallelism strategy 15 10 5
(Sec.) 0 Execution Time/Step Time/Step Execution
Best Hardware per Parallelism Strategy V100 Best HW for Best Parallelism
19
No Single “Best”
. V100 .
wrt
. Specialized Hardware Hardware Specialized .
wrt
Relative Parameter Parameter Relative
Parallelism Strategy Parallelism
Specialized Hardware Configurations PPS Configurations Hardware Specialized
Per Parallelism Strategy (PPS) Strategy Parallelism Per Rel. Speedup Rel.
Specialized Hardware Configurations PPS Hardware Parameters 20 Q1 Summary
• Observation 1: Not so much gain from hardware specialization to each parallelism strategy • Observation 2: There is no single best hardware; There are many distinct and universally good hardware design configurations.
21 Q2.Technology Trends Which packing technology is most promising?
22 To “SiIF” or Not to “SiIF”?
• SiIF: 64 nodes/wafer, 1 wafer • MCM: 4 nodes/wafer, 16 wafers • Single: 1 node/wafer, 64 wafers 8.00
7.00 SiIF Single MCM
6.00
5.00
4.00
Time/Step (Sec.) Time/Step 3.00
2.00
1.00 0 10 20 30 40 50 Parallelism Strategy (Sorted for SiIF) 23 Conclusion
Slow-down of the Moore’s law
A100
V100
SambaNova Habana Groq Cerebras
GraphCore Fast growing computation demand Cambrion
24 Joel Hestness Greg Diamos Kenneth Church
Saptadeep Pal Puneet Gupta
25