<<

The Need for New Hardware for

‘DOTA’ Neural network visualization from POPLARTM OUR IPU LETS INNOVATORS CREATE THE NEXT BREAKTHROUGHS IN MACHINE INTELLIGENCE MACHINE INTELLIGENCE COMPUTE IS FAST GROWING

300,000x growth

Source: OpenAI # Parameters (Millions) 1000 1200 1400 1600 1800 200 400 600 800 0 2016 Jan ResNet50 25M MODELS AREMODELS ONLY BIGGER GETTING BERT 330M - Large 2018 Oct 1.55Bn GPT2 2019 Mar COLOSSUS IPU the worlds most complex processor chip with 23.6Bn transistors

- 125TFlops @ 120W

- 1216 independent processor cores

- Complete model held inside processor

- 45TB/s memory bandwidth

- 8TB/s on-chip exchange between cores

- 2.5Tbps chip-to-chip IPU-Links C2 IPU PROCESSOR CARD

2 – COLOSSUS GC2 IPU PROCESSORS CARD-TO-CARD IPU-LINKSTM (2.5TBps) 250 TERA-FLOP MIXED PRECISION IPU COMPUTE @ 300W DELL-EMC IPU SERVER 8x C2 IPU PCIe CARDS | 2PFLOP IPU COMPUTE COMPUTE 2.0 DEVELOPMENT FLOW

COMPUTE 1.0 COMPUTE 2.0

Integrated Design Environment: ML Graph Framework: e.g. Visual Studio/ Eclipse e.g. TensorFlow/ PyTorch

Toolchain: e.g. ICC/ Cuda Graph Toolchain: POPLAR™

Compiler: e.g. GCC/ LLVM Graph Compiler

Debugger: e.g. GDB Graph Engine

Profiler: e.g. VTune Graph Debug / Optimization

Program/ Learn Knowledge from Model Data DeepVoice GRAPH VISUALIZATION FROM POPLAR®

POPLAR® expands the ML Framework output to a full compute graph. THE POPLAR® SDK

High-level graph APIs built on Poplar® native graph

abstraction ™)

Other frames PopART to be ML supported Poplar Graph

With Training RUNTIME ( ADV. POPLAR Control Program POPLAR®

STANDARD FRAMEWORKS

POPLAR® : Graph Toolchain for IPU

Graph Libraries Graph Framework Graph Compiler popnn poputils poplin popops poprandom popsparse

IPU-Processor IPU Servers and PCIe Card System DEVELOPMENT ENVIRONMENT

High Level Abstraction Poplar SDK Native RunTime Lower Level Coding

PopART™ POPLAR® API

/linear algebra • Poplar Advanced RunTime • API for close to the metal • Tailored for efficiency on • Supports ONNX file format development mainstream deep learning algos • Supports both inference and • Anything is possible (e.g. supervised SGD on a neural training • Can create new paradigms net) • High performance and flexible • Also used for custom operators • Has flexibility for different • Can be used as backend to other within other frameworks optimizers/modifications ML frameworks BULK SYNCHRONOUS PARALLEL (BSP) Software bridging model for parallel computing

Compute BSP Sync Exchange

10,000s of compute threads All threads are Data is exchanged so that every thread all operating in parallel synchronized has all the data that it needs for the each with all the data that next phase of Compute they need, held locally

BSP phase.1 BSP phase.2 BSP phase.3 THE IMPORTANCE OF BSP

CPU/GPU

- Optimization is difficult DRAM (Knowledge model held in DRAM) - Timing hazards occur Convolution Max Convolution Max Convolution Max - Hard to modify or change PROCESSOR pooling Layer pooling Layer pooling - Data handling challenging Step.1 Step.2 Step.3 Step.4

BSP Sync between layers IPU

- High performance IPU Convolution Max Convolution Max Convolution Max straight out of the box Layer pooling Layer pooling Layer pooling

- No timing hazards BSP compute phases - Easy to modify makes modifying layers easy

- Whole Knowledge Model Convolution Max Convolution NEW Convolution Max held inside the IPU IPU Layer pooling Layer LAYER Layer pooling POPLAR™ : Graph Toolchain

POPLAR Graph Libraries POPLAR Framework

popnn poputils poplin

popops poprandom popsparse POPLAR® C++ / Python Graph Framework

Graph Toolchain POPLAR POPLAR Visualization & Debug Tools Compute Graph

POPLAR Graph Compiler POPLAR Graph Engine IPU-Processor IPU Servers and PCIe Card System OPEN-SOURCE GRAPH LIBRARIES

> 50 open-source GRAPH FUNCTIONS example GRAPH 32in_32out_Fully_Connected_Layer available including (matmul, conv, etc) built from…

> 750 optimized COMPUTE ELEMENTS such as (ReduceAdd, AddToChannel, Zero, etc) easily create new GRAPH FUNCTIONS using the library of COMPUTE ELEMENTS modify and create new COMPUTE ELEMENTS

share library elements and new innovations POPLAR GRAPH FRAMEWORK

C++ / PYTHON – POPLAR GRAPH FRAMEWORK LETS YOU EASILY MODIFY OR CREATE YOUR OWN GRAPH FUNCTIONS POPLAR® MAPS AND COMPILES GRAPH TO IPUs

POPLAR® GRAPH COMPILER:

Load balances code across IPU-CORES

Allocates data to IN-PROCESSOR-MEMORY Orchestrates data exchanges POPLAR® GRAPH ENGINE:

Executes graph under BSP on IPU or multiple IPUs ResNet50 network visualization from POPLARTM

WE HAVE DEVELOPED NEW HARDWARE AND SOFTWARE THAT LETS INNOVATORS CREATE THE NEXT GENERATION OF MACHINE INTELLIGENCE THANK YOU [email protected] @fleurdevie