Specialized Hardware Accelerator • 99+ Hardware Startup Companies

1 V100 Cerebras Groq Habana GraphCore A100 SambaNova Cambrion 2 Deep Learning Accelerator Craze: The Tale of Two Trends Fast growing computation demand Slow-down of the Moore’s law Compute requirements Number of transistors doubles every 3.5 on chip doubles every months! 24 months Source: https://blog.openai.com/ai-and-compute/ 3 If we don’t do anything Wait for 40 years to train 100 times larger models! 4 Alternatively… • Explore Parallelism • Model • Data • Pipeline • Hybrid • Design Specialized Hardware Accelerator • 99+ Hardware Startup companies 5 Embarrassment of Riches Parallelism Strategy Exploration Specialized Hardware Accelerator (99+) • HyPar • Nvidia V100 • FlexFlow • Nvidia A100 • MeshTensorFlow • Cerebras • … • SambaNova • Groq • GraphCore • Habana Fixed Hardware Parallelism-Agnostic • … 6 Low Utilization at Scale 17 b parameters 1000 GPUs 6% efficiency 8 b Parameters 512 GPUs 20% efficiency Data Source: Microsoft blog (https://syncedreview.com/2020/02/12/17-billion-parameters-microsoft-deepspeed-breeds-worlds-largest-nlp-model/) and Kunle Olukton’s presentation at ScaledML 7 Analysis Paralysis 8 What do we need? Application Hardware Config. Magic Box Parallelism Strategy Execution Time 9 What do we need? Application Hardware Config. Magic Box Parallelism Strategy Execution Time Best* Time Best* Hardware Best* Parallelism Strategy 10 What do we need? Application Design Constraints Technology Parameters Power Budget Hardware Config. Magic Box Parallelism Strategy Area Budget Execution Time Best* Time Best* Hardware Best* Parallelism Strategy 11 What do we need? Applications Design Constraints Technology Parameters Power Budget Hardware Config. Magic Box Parallelism Strategy • Today: Which accelerator meets my need? Area Budget • Tomorrow: Which technology is the most promising? Execution Time Best* Time Best* Hardware Best* Parallelism Strategy 12 MechaFlow: A Software/Hardware/Technology Co-design Space Exploration Framework 13 MechaFlow: Telescopic View Applications Technology Parameters Power Budget Hardware Config. Magic Box Parallelism Strategy Area Budget MechaFlow Execution Time Best* Time Best* Hardware Best* Parallelism Strategy 14 Case Studies 1. How much performance gain from co-designing hardware and parallelism strategy? 2. How much performance gain from new upcoming packaging technologies? 15 Methodology • Language Modeling • Word-language model (RNN-based LSTM) • SOTA: 18 Billion parameters • Desired: 256 Billion parameters (hidden: 19968, layers:2, vocab:800K, seq.: 20) • Parallelism: • 64-way parallelism • Data parallelism • Model/Kernel parallelism: Row-Column (RC) and Column-Row (CR) • Pipeline/Layer parallelism • {RC or CR}-k{i}-k{j}-d{k}-l{m}: e.g. RC-k8-k2-d4-l1 • Hardware: • Baseline: V100 • Design constraints: 300 watt, 1230 mm2/node, 815 mm2 /core • Technology: 14 nm 16 Q1.Co-design Hardware and Parallelism Strategy How much performance gain? 17 Co-design Parallelism Strategy and HW Design? 12 10 8 6 4 2 (Sec.) 0 Execution Time/Step Time/Step Execution Best Hardware per Parallelism Strategy V100 18 Not so much gain from specialization to parallelism strategy 15 10 5 (Sec.) 0 Execution Time/Step Time/Step Execution Best Hardware per Parallelism Strategy V100 Best HW for Best Parallelism 19 Parallelism Strategy No Single “Best” Specialized Hardware Configurations PPS Configurations Hardware Specialized Rel. Speedup wrt. Specialized Hardware Per Parallelism Strategy (PPS) Specialized Hardware Configurations PPS Hardware Parameters Hardware Relative Parameter wrt. V100 20 Q1 Summary • Observation 1: Not so much gain from hardware specialization to each parallelism strategy • Observation 2: There is no single best hardware; There are many distinct and universally good hardware design configurations. 21 Q2.Technology Trends Which packing technology is most promising? 22 To “SiIF” or Not to “SiIF”? • SiIF: 64 nodes/wafer, 1 wafer • MCM: 4 nodes/wafer, 16 wafers • Single: 1 node/wafer, 64 wafers 8.00 7.00 SiIF Single MCM 6.00 5.00 4.00 Time/Step (Sec.) Time/Step 3.00 2.00 1.00 0 10 20 30 40 50 Parallelism Strategy (Sorted for SiIF) 23 Conclusion Slow-down of the Moore’s law A100 V100 SambaNova Habana Groq Cerebras GraphCore Fast growing computation demand Cambrion 24 Joel Hestness Greg Diamos Kenneth Church Saptadeep Pal Puneet Gupta 25.

Specialized Hardware Accelerator • 99+ Hardware Startup Companies

A Taxonomy of Accelerator Architectures and Their

WWW 2013 22Nd International World Wide Web Conference

Matrox Imaging Library (MIL) 9.0 Update 58

An FPGA-Accelerated Embedded Convolutional Neural Network

AI Chips: What They Are and Why They Matter

The Developer's Guide to Azure

Scheduling Dataflow Execution Across Multiple Accelerators

The Jabberwocky Programming Environment for Structured Social Computing

Windows GUI Context Extraction

Final Copy 2021 06 24 Foyer

SDP Memo 50: the Accelerator Support of Execution Framework

Memory-Efficient Pipeline-Parallel DNN Training