1 V100

Cerebras Groq Habana

GraphCore A100 SambaNova Cambrion

2 Deep Learning Accelerator Craze: The Tale of Two Trends Fast growing computation demand Slow-down of the Moore’s law

Compute requirements Number of transistors doubles every 3.5 on chip doubles every months! 24 months

Source: https://blog.openai.com/ai-and-compute/

3 If we don’t do anything

Wait for 40 years to train 100 times larger models!

4 Alternatively…

• Explore Parallelism • Model • Data • Pipeline • Hybrid • Design Specialized Hardware Accelerator • 99+ Hardware Startup companies

5 Embarrassment of Riches

Parallelism Strategy Exploration Specialized Hardware Accelerator (99+) • HyPar • Nvidia V100 • FlexFlow • Nvidia A100 • MeshTensorFlow • Cerebras • … • SambaNova • Groq • GraphCore • Habana Fixed Hardware Parallelism-Agnostic • …

6 Low Utilization at Scale 17 b parameters 1000 GPUs 6% efficiency

8 b Parameters 512 GPUs 20% efficiency

Data Source: blog (https://syncedreview.com/2020/02/12/17-billion-parameters-microsoft-deepspeed-breeds-worlds-largest-nlp-model/) and Kunle Olukton’s presentation at ScaledML 7 Analysis Paralysis

8 What do we need? Application

Hardware Config. Magic Box Parallelism Strategy

Execution Time

9 What do we need? Application

Hardware Config. Magic Box Parallelism Strategy

Execution Time

Best* Time Best* Hardware

Best* Parallelism Strategy 10 What do we need? Application Design Constraints

Technology Parameters

Power Budget Hardware Config. Magic Box Parallelism Strategy

Area Budget

Execution Time

Best* Time Best* Hardware

Best* Parallelism Strategy 11 What do we need? Applications Design Constraints

Technology Parameters

Power Budget Hardware Config. Magic Box Parallelism Strategy • Today: Which accelerator meets my need? Area Budget • Tomorrow: Which technology is the most promising?

Execution Time

Best* Time Best* Hardware

Best* Parallelism Strategy 12 MechaFlow: A Software/Hardware/Technology Co-design Space Exploration Framework

13 MechaFlow: Telescopic View

Applications

Technology Parameters

Power Budget Hardware Config. Magic Box Parallelism Strategy

Area Budget MechaFlow

Execution Time

Best* Time Best* Hardware

Best* Parallelism Strategy 14 Case Studies

1. How much performance gain from co-designing hardware and parallelism strategy? 2. How much performance gain from new upcoming packaging technologies?

15 Methodology

• Language Modeling • Word-language model (RNN-based LSTM) • SOTA: 18 Billion parameters • Desired: 256 Billion parameters (hidden: 19968, layers:2, vocab:800K, seq.: 20) • Parallelism: • 64-way parallelism • • Model/Kernel parallelism: Row-Column (RC) and Column-Row (CR) • Pipeline/Layer parallelism • {RC or CR}-k{i}-k{j}-d{k}-l{m}: e.g. RC-k8-k2-d4-l1 • Hardware: • Baseline: V100 • Design constraints: 300 watt, 1230 mm2/node, 815 mm2 /core • Technology: 14 nm

16 Q1.Co-design Hardware and Parallelism Strategy How much performance gain?

17 Co-design Parallelism Strategy and HW Design?

12 10 8 6 4 2

(Sec.) 0 Execution Time/Step Time/Step Execution

Best Hardware per Parallelism Strategy V100

18 Not so much gain from specialization to parallelism strategy 15 10 5

(Sec.) 0 Execution Time/Step Time/Step Execution

Best Hardware per Parallelism Strategy V100 Best HW for Best Parallelism

19

No Single “Best”

. V100 .

wrt

. Specialized Hardware Hardware Specialized .

wrt

Relative Parameter Parameter Relative

Parallelism Strategy Parallelism

Specialized Hardware Configurations PPS Configurations Hardware Specialized

Per Parallelism Strategy (PPS) Strategy Parallelism Per Rel. Speedup Rel.

Specialized Hardware Configurations PPS Hardware Parameters 20 Q1 Summary

• Observation 1: Not so much gain from hardware specialization to each parallelism strategy • Observation 2: There is no single best hardware; There are many distinct and universally good hardware design configurations.

21 Q2.Technology Trends Which packing technology is most promising?

22 To “SiIF” or Not to “SiIF”?

• SiIF: 64 nodes/wafer, 1 wafer • MCM: 4 nodes/wafer, 16 wafers • Single: 1 node/wafer, 64 wafers 8.00

7.00 SiIF Single MCM

6.00

5.00

4.00

Time/Step (Sec.) Time/Step 3.00

2.00

1.00 0 10 20 30 40 50 Parallelism Strategy (Sorted for SiIF) 23 Conclusion

Slow-down of the Moore’s law

A100

V100

SambaNova Habana Groq Cerebras

GraphCore Fast growing computation demand Cambrion

24 Joel Hestness Greg Diamos Kenneth Church

Saptadeep Pal Puneet Gupta

25