TECHNOLOGIES FOR AI

Marc Duranton CEA Tech (Leti and List)

Considérations énergétiques de l’IA et perspectives

Marc Duranton CEA Fellow 22 Septembre 2020 Séminaire INS2I- Systèmes et architectures intégrées matériel-logiciel pour l’intelligence artificielle KEY ELEMENTS OF

“…as soon as it works, no one calls it AI anymore.” “AI is whatever hasn't been done yet”

John McCarthy, AI D. Hofstadter (1980) who coined the term “Artificial Intelligence” in 1956 Traditional AI, Analysis of Optimization of energy symbolic, “big data” in datacenters algorithms Data rules… analytics “Classical” approaches ML-based AI: Our focus today: to reduce energy consumption Bayesian, … - Considerations on state of the art systems Deep (learning + inference) Learning* - Edge computing ( inference, federated learning, * , One-shot Learning, Generative Adversarial Networks, etc… neuromorphic) From Greg. S. Corrado, team co-founder: – “Traditional AI systems are programmed to be clever

– Modern ML-based AI systems learn to be clever. | 2 CONTEXT AND HISTORY: STATE-OF-THE-ART SYSTEMS 2012: DEEP NEURAL NETWORKS RISE AGAIN

They give the state-of-the-art performance e.g. in image classification • ImageNet classification (Hinton’s team, hired by Google) • 14,197,122 images, 1,000 different classes • Top-5 17% error rate (huge improvement) in 2012 (now ~ 3.5%)

“Supervision” network Year: 2012 650,000 neurons 60,000,000 parameters 630,000,000 synapses

• Facebook’s ‘DeepFace’ Program (labs headed by Y. LeCun) • 4.4 millionThe images, 2018 Turing4,030 identities Award recipients are Google VP , • 97.35% accuracy,Facebook's vs.Yann 97.53% LeCun humanand performanceYoshua Bengio, Scientific Director of AI research center Mila.

From:Y. Taigman, M. Yang, M.A. Ranzato, “DeepFace: Closing the Gap to Human-Level Performance in Face Verification”

| 4 | 5

Name of the Date Error on test set algorithm

Supervision 2012 15.3% COMPETITION ON Clarifai 2013 11.7% IMAGENET ! GoogLeNet 2014 6.66%

Humain level 5% (Adrej Karpathy)

Microsoft 05/02/2015 4.94%

Google 02/03/2015 4.82%

Baidu/ Deep Image 10/05/2015 4.58%

Shenzhen Institutes of 10/12/2015 3.57% Advanced Technology, (le CNN a 152 Chinese Academy of couches!) Sciences

Google Inception-v3 2015 3.5% (Arxiv)

WMW (Momenta) 2017 2.2%

Now ? | 6 SPECIALIZATION LEADS TO MORE EFFICIENCY EFFICIENCY

Type of device Energy / Operation CPU 1690 pJ GPU 140 pJ Fixed function 10 pJ

Source from Bill Dally (nVidia) « Challenges for Future Computing Systems » HiPEAC conference 2015 | 7 2017: GOOGLE’S CUSTOMIZED HARDWARE… … required to increase energy efficiency with accuracy adapted to the use (e.g. float 16)

Google’s TPU2 : training and inference in a 180 teraflops16 board (over 200W per TPU2 chip according to the size of the heat sink) | 8 DEEP LEARNING AND VOICE RECOGNITION " The need for TPUs really emerged about six years ago, when we started using computationally expensive deep learning models in more and more places throughout our products. The computational expense of using these models had us worried. If we considered a scenario where people use Google voice search for just three minutes a day and we ran deep neural nets for our system on the processing units we were using, we would have had to double the number of Google data centers!"

[https://cloudplatform.googleblog.com/2017/04/quantifying-the-performance-of-the-TPU-our-first-machine-learning-chip.html]

| 9 2017: GOOGLE’S CUSTOMIZED HARDWARE… … required to increase energy efficiency with accuracy adapted to the use (e.g. float 16)

Google’s TPU2 : 11.5 petaflops16 of number crunching (and guessing about 400+ KW…, 100+ GFlops16/W) 15 From Google Peta = 10 = million of milliard | 10 GOOGLE’S CUSTOMIZED TPU (V3) HARDWARE…

GOOGLE’S CUSTOMIZED TPU (V3) HARDWARE…

From https://www.nextplatform.com/2018/05/10/tearing-apart-googles-tpu-3-0-ai-coprocessor/

| 11 2017: THE GAME OF GO

Ke Jie (human world champion in the “Go” game), after being defeated by AlphaGo on May 27th 2017, will work with Deepmind to make a tool from AlphaGo to further help Go players to enhance their game.

| 12 ALPHAGO ZERO: SELF-PLAYING TO LEARN

From doi:10.1038/nature24270 (Received 07 April 2017) | 13 ALPHAZERO: COMPUTING RESOURCES

Follow-up is called MuZero (2019)

X 5000 = 200 KW* X 40 days…

AlphaZero is a generic reinforcement learning algorithm Peta = 1015 = million of milliard From Google Deepmind * https://cloud.google.com/blog/products/gcp/an-in-depth-look-at-googles-first-tensor-processing-unit-tpu | 14 2019: OPENAI WINS 2 ESPORT GAME

OpenAI Five Defeats World Champions

OpenAI Five is the first AI to beat the world champions in an esports game, having won two back-to-back games versus the world champion Dota 2 team. It is also the first time an AI has beaten esports pros on livestream.

Cooperative mode

OpenAI Five’s ability to play with humans presents a compelling vision for the future of human-AI interaction, one where AI systems collaborate and enhance the human experience. Our testers reported feeling supported by their bot teammates, that they learned from playing alongside these advanced systems, and that it was generally a fun experience overall.

| 15 2019: OPENAI WINS DOTA 2 ESPORT GAME

OpenAI Five Defeats Dota 2 World Champions

In total, the current version of OpenAI Five has consumed 800 petaflop/s-days and experienced about 45,000 years of Dota self-play over 10 realtime months (up from about 10,000 years over 1.5 realtime months as of The International), for an average of 250 years of simulated experience per day. The Finals version of OpenAI Five has a 99.9% winrate versus the TI (tournament) version.

Peta = 1015 = million of milliard | 16 2019: OPENAI WINS DOTA 2 ESPORT GAME

Green500 List - June 2020 OpenAI Five Defeats Dota 2 World Champions

In total, the current version of OpenAI Five ~ 38 MW has consumed 800 petaflop/s-days and experienced about 45,000 years of Dota self-play over 10 realtime months (up from about 10,000 years over 1.5 realtime months as of The International), for an average of 250 years of simulated experience per day. The Finals version of OpenAI Five has a 99.9% winrate versus the TI (tournament) version.

Giga = 109 Tera = 10 12 Peta = 1015 = million of milliard

| 17 EXPONENTIAL INCREASE OF COMPUTING POWER FOR AI TRAINING

“Since 2012, the amount of compute used in the largest AI training runs has been increasing exponentially with a 3.5 month-doubling time (by comparison, Moore’s Law had an 18-month doubling period)*”

* https://blog.openai.com/ai-and-compute/ | 18 ALGORITHMIC IMPROVEMENT IS A KEY FACTOR DRIVING THE ADVANCE OF AI

“Since 2012 the amount of compute needed to train a neural net to the same performance on ImageNet1 classification has been decreasing by a factor of 2 every 16 months. *”

* https://openai.com/blog/ai-and-efficiency/ | 19 2020: OPENAI’S GPT-3, AN AUTOREGRESSIVE LANGUAGE MODEL WITH 175 BILLION PARAMETERS “GPT-3 shows that language model performance scales as a power-law of model size, dataset size, and the amount of computation. GPT-3 demonstrates that a language model trained on enough data can solve NLP tasks that it has never encountered. That is, GPT-3 studies the model as a general solution for many downstream jobs without fine-tuning. The size of state-of-the-art (SOTA) language models is growing by at least a factor of 10 every year.*” GPT-3 175B is trained with 499 Billion tokens:

“GPT-3 175B model required 3.14x1023 FLOPS (100 trilliard (1021) so about 87h of exaflop (1018) machine), 302 days on Tera-1000) for computing for training. Even at theoretical 28 TFLOPS for V100 and lowest 3 year reserved cloud pricing we could find, this will take 355 GPU-years and cost $4.6M for a single training run. Similarly, a single RTX 8000, assuming 15 TFLOPS, would take 665 years to run*”.

* https://lambdalabs.com/blog/demystifying-gpt-3/ | 20 HOUSTON, WE HAVE A PROBLEM…

| 21 The problem: IT projected to challenge future electricity supply

Exponential power consumption

From “Total Consumer Power Consumption Forecast”, Anders S.G. Andrae, October 2017

| 22 PROCESSOR ARCHITECTURE EVOLUTION

Moore’s law slow-down Quantum Heterogeneous End of Dennard’s scaling Computing architecture for Disruptive Many-core energy efficiency Architecture for architecture for data management Mono-core parallelism Far architecture for Mem. NIC NIC Data Centric single thread Interconnect performance Cache Cache Generic processing NoC + LLC

CPU In Memory Computing, Cache Cache Memory Neuromorphic Computing. invades

Cache Coherent Link Close Close logic Bus Mem. Mem Close Close Mem Mem Memory Memory Memory

NIC ~2006 ~2016 HW accelerator ~2026… (Network InterConnect) From Denis Dutoit | 23 OPEN QUESTIONS…

• Only GAFAM or BAITX able to train those gigantic systems? • Use in transfer learning: • Once learnt, the system can be tuned for a particular application • This can be done with minimal computing resources (only the last layers to train) • The energy cost of learning is spread by the many usages and instances in inference • Like the NRE for building chips • Will we have access to the resulting systems for adaptation? • Is it possible to optimize those systems by post-processing to be able to use them for inference/transfer learning in embedded system?

| 24 EDGE COMPUTING THE COMPUTING CONTINUUM

Intelligent sensor (pre) processing data locally

Smart sensors Enabling Intelligent data processing at CyberPhysical Physical Internet of the edge: EntanglementSystems Things Fog computing Processing, Edge computing Stream analytics New Understanding Fast data… services as soon as possible Data Analytics / Cognitive Big Data computing True collaboration between edge devices Cloud / HPC and the HPC/cloud ➪ creating a ENABLING EDGE INTELLIGENCE Transforming data into information as early as possible continuum

| 26 This initiative currently involves: ETP4HPC, ECSO, BDVA, 5GIA, EUMATHS, CLAIRE, HiPEAC… More to come…

Edge processing, Fog, … | 27 Embedded intelligence needs local high-end computing

System should be autonomous to make good decisions in all conditions

Safety will impose that basic autonomous functions should not rely on “always

Should I brake? connected” or “always available”

please retry later Transmission error

And should not consume most power (and space!) of an electric car!

| 28 N2D2, NEURAL NETWORK DESIGN & DEPLOYMENT

A unique platform for the design and exploration of DNN applications with : + Easy database access to facilitate the learning phase + Automatic code generation on various hardware targets thanks to the export modules + Multiple comparison criteria : latency, hardware cost, memory need, power consumption…

Considered criteria Software DNN libraries COTS •Applicative performance metrics • C/OpenMP (+ custom SIMD) • STM (STM32, ASMP) •Memory requirement • C++/TensorRT (+ CUDA/CuDNN) • Nvidia (all GPUs) •Computational complexity • C++/OpenCL • Renesas (RCar) • C/C++ (+ custom API) • Kalray (MPPA) Learning & • Assembly Test Custom accelerators Hardware DNN libraries databases Optimization • ASIC (PNeuro) • C/HLS • FPGA (DNeuro) • Custom RTL Traine Data d DNN Code cross- Code execution Modeling Learning Test Quantization conditioning generation on target

ONNX model 2 to 8 bit integers + rescaling •Based on dataset distribution •Quantized applicative performance metrics validation | 29 EDGE INTELLIGENCE NEEDS LOCAL HIGH-END COMPUTING

if you need a response in less than 1ms, the server Fog computing has to be in less than 150 Km Federated learning

Dumb sensors Smart sensors: Streaming and distributed data analytics

Bandwidth (and cost) will require more local processing Embedded intelligence needs local high-end computing Example: detecting elderly people falling in their home

Privacy will impose that some processing should be done locally and not be sent to the cloud.

With minimum power and wiring!

| 31 Detecting elderly people falling in their home

• Ultra low power local processing detecting lying people in a room

• Raw data (before post- processing):

• Standing • Crouching • Lying

| 32 ENERGY EFFICIENT PROGRAMMABLE H/W ACCELERATOR FOR DEEP LEARNING: PNEURO

Target Frequency Performance Energy Efficiency Elementary Cluster Computing Units Quad ARM A7 900 MHz 480 images/s 380 images/s/W Quad ARM A15 2000 MHz 870 images/s 350 images/s/W Cluster Neuro Controller Cores Tegra K1 850 MHz 3 550 images/s 600 images/s/W PNeuro (FPGA) 100 MHz 5 000 images/s 2 000 images/s/W Cluster Interconnect PNeuro (ASIC) 500 MHz 25 000 images/s 125 000 images/s/W FD-SOI28nm 1.8 TOPS/W @500MHz PNeuro (ASIC) 1 000 MHz 50 000 images/s 125 000 images/s/W One cluster of P-Neuro NPU 0.4mm², 6mW n Designed for Deep Neural Network processing chains Performance of NPU with 4 clusters n Also supporting traditional image processing chains (filtering, etc.) – No need of external image processing n Clustered and modular SIMD architecture PNeuro Neural Processing Unit for ASIC n Clustered SIMD architecture @1 GHz is 14 times faster & 200 times − Optimized operators for MAC & Non-Linearity approximation more energy efficient than an embedded − Optimized memory accesses to perform efficient data transfers to operators − ISA including ~50 instructions (control + computing) GPU n Can be sized to fit the best area/performance per application Entirely developed in Europe n 571 GMACs/s/W/cluster

| 33 PNEURO IN A ULTRA LOW POWER SOC: SAMURAI

Always Responsive Sub-System SAMURAI ARCHITECTUREOn-Demand Sub-System OVERVIEW

ULV CPU SRAM PNeuro SRAM GPIO TP SRAM RISC V WuC

Interconnect

SPI_m GPIO I2C UART Syncho Security IPs WuR DBB FLL

• Always responsive running @ 0.4V • Ultra fast wake-up time with ultra-low power modes • WuR with DBB decoder to wake-up on radio signaling • 8KB of ULV TP SRAM • AI accelerator with embedded 256KB SRAM • Embedded security

| 34 SAMURAI MAIN FIGURES

• Technology: ST28nm FDSOI • Die area: ~4.52mm² (2564µm x 1764µm) • Memories • 424KB SRAM (incl. 32KB with retention mode, 264KB in PNeuro) • 8KB TPSRAM • 5 power domains • 64-pin QFN package • Performances • 100 MHz @ 0.7V (220MHz @ 1,0V) • PNeuro: 6.4GMAC/s @ 100MHz • 5895 image class. per second @100MHz

Ref: 2020 Symposia on VLSI Technology and Circuits | 35 SOLVING THE ENERGY CHALLENGE: COST OF MOVING DATA

x800 more!

Avoiding data movement: Computing in/near memory

Source: Bill Dally, « To ExaScale and Beyond » www.nvidia.com/content/PDF/sc_2010/theater/Dally_SC10.pdf | 36 HOW TO REDUCE ENERGY AND POWER

Image sensor Processor

Using energy efficient technology: FDSOI Fully Depleted – Silicon on Insulator LETI’s ISSCC demonstrator 2014: Fclk < 2.6GHz @ 1.3V « sequential » Fclk < 450MHz @ 0.39V transmission x 10 less power at 0.39V (at same frequency)

x 2 resolution ➔ x 2 transmission frequency (f) P = C.V2.f but lower f allows lower V

« classical » system

| 37 HOW TO REDUCE ENERGY AND POWER

Image sensor

Simpler processors running at lower f (they have less to do) P = n.C.V2.(f/n) and lower f allows lower V

| 38 IR SENSOR WITH BUILT-IN 1ST LEVEL IMAGE PROCESSING

• Power/pixel reduced by over 90% compared to best in class low power IR product • (7,03 µW/pixel => 0,54 µW/pixel) • 13 times less power for the same system functionality Adding 128 RISC

Processors

• Dedicated algorithms for • Presence sensing • Localization • People counting

• Privacy respected • Pre-processing on the sensor • Video is not transmitted

Image Synthetic data | 39 SMART RETINA USING 3D STACKING

Advantages: Sensor (Back Side Illumination)

Low frequency processor Lens Direct interconnect pixel ➔ processor ➪ energy efficient system L1

➪ very fast pixel processing L2 Processing ➪ perfect scalability, no size limitation due to bandwidth Layer ➪ complex processing in the sensor ➪ suitable for privacy sensible applications Passive interposer or PCB Each pixel can be processed independently Heterogeneous integration possible ➪ opening a new range of algorithms | 40 SMART RETINA USING 3D STACKING

• Demonstrator “Retine” • L1@130 nm / L2@130 nm; • Die size : 160 mm²

• Image sensor: • 192x256 @ 5500 fps or 768x1024 @ 60 fps • 12 µm pixel, 75% fill factor,

• 192 processors (3072 PE) • Processing : 72 GOPS, 11.7 MOPS/mW • 1 k instructions / pixel @ 1000 fps X 100 computing power, • Distributed memory • Each processor can execute a different X 10 energy efficiency, code in a set / 15 processing latency

| 41 RETINE: A 3D STACKED SMART RETINA

Note: the board is only an interface allowing to drive an HDMI display and to debug the chip. All processing is done inside the retina

| 42 SPIKE CODING OF DNN FOR IMPROVED EFFICIENCY

1) Standard CNN topology, offline learning 2) Lossless spike transcoding

Conv. layer Conv. layer 16 maps 24 maps Pixel Spiking frequency 11x11 4x4 neurons FC layer brightness Input: digit Output: V 24x24 pixels neurons 150 neurons fMIN 10 neurons Rate-based input f coding MAX t

16 kernels layer 1 layer 2 layer 3 layer 4 4x4 synapses

Correct Output 90 kernels 5x5 synapses

Time

“Standard” MAC Spiking ACC MULT 16 bits 1.17 pJ ADD 8 0.03 pJ • The promises of spike-coding NN: (interpolated) bits • Reduced computing complexity ADD 32 bits 0.1 pJ è replace 32 bits MAC with 8 bits ACC per operation Natural spatial and temporal parallelism • Total 1.27 pJ Total 0.03 pJ • Simple and efficient performance tunability capabilities è 42x reduction • Spiking NN best exploit NVMs, for massively parallel synaptic memory *Mark Horowitz, “Computing’s Energy Problem (and what we can do about it)”, Energy per operation comparison (estimate for 45nm 0.9V)* Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 2014 IEEE International | 43 SPIKE-CODING RESULTS

• Sparcity: results on MNIST and GTSRB datasets Performance vs computing time tenability on GTSRB (approximated computing) Standard Spiking 8 4 ACCs 7 3,5 Score MACs ∆� Score (spikes/MAC) 6 3 5 2,5 2 98.24% 40.30k (0.32) 4 2 MNIST 99.42% 126k 3 99.33% 51.32k (0.41) 3 1,5 ReLU 2 1 4 99.42% 63.02k (0.50)

Test error1 rate (%) 0,5 ACCs (spike/MAC) ACCs 5 98.18% 1.63M (2.00) 0 0 GTSRB 3 4 5 6 7 8 9 10 11 12 13 14 15 98.42% 818k 10 98.38% 2.30M (2.82) ReLU Decision threshold ΔS 15 98.39% 2.87M (3.51) Standard DNN Spiking DNN è Best reported* spike recognition rate with 50% Quantization Yes Yes less op. and >40x energy efficiency/op. for MNIST! Approx. computing No Yes Base operation Multiply-Accumulate Accumulate only è Tunable error rate vs compute time (MAC) Non-linear function Threshold (+ refractory period)* N2D2: first public deep learning framework to Parallelism Spatial Spatial and temporal integrate spiking DNNs transcoding and simulation Memory reutilisation Yes No

*Not required for ReLU activation function *Previous state-of-the-art: D. Neil, M. Pfeier, and S.-C. Liu. “Learning to be efficient: algorithms for training low-latency, low-compute deep spiking neural networks.” In ACM SAC, 2016 | 44 SPIKING NEURAL NETWORKS

COMMUNICATION SPIKING NEURONS

ADDRESS EVENT SPIKE TIMING DEPENDENT REPRESENTATION PLASTICITY LEAKY INTEGRATE-AND-FIRE

| 45 NEW ELEMENT: RRAM AS SYNAPSES

Analog computing: using some aspects of physical phenomena to model the problem being solved

Thermal Electrochemical effect effect PCM CBRAM

GST Electronic effect Ag / GeS2 GeTe oxygen vacancies GST + HfO 2 OXRAM

TiN/HfO2/Ti/TiN M.Suri, et. al, IEDM 2011 M.Suri, et. al, IMW 2012 , JAP 2012 O.Bichler et al. IEEE TED 2012 M.Suri et al., EPCOS 2013 D.Garbin et al., IEEE Nano 2013 D.Garbin et al. IEDM 2014 D.Garbin et al., IEEE TED 2015 | 46 MONOLITHIC 3D PRINCIPLE

CMOS/CMOS: 14nm vs 2D: Area gain=55% Perf gain = 23% Power gain = 12%

LETI, DAC 2014 | 47 3D INTEGRATION COUPLED WITH RRAM FOR SYNAPTIC WEIGHTS

Introduction Short term structure à RRAM on top level to avoid contamination issue à Reuse of existing masks plus ebeam to build 1T1R

Ti/TiN Ti/TiN HfO2 HfO2 Ti/TiN Ti/TiN à ”Synapses” are integrated in the very fabric of communication

1 base ebeam required for RRAM definition RRAM based on HfO2/Ti/TiN low temp materials (~ 350°C) à no critical problems to integrate on the top level | 48 Managing complexity

Cognitive solutions for complex computing systems: • Using AI and optimization techniques for computing systems • Creating new hardware • Generating code • Optimizing systems • Similar to Generative design for mechanical engineering USING AI FOR MAKING CPS SYSTEMS: “GENERATIVE DESIGN” APPROACH

The user only states desired goals and constraints -> The complexity wall might prevent explaining the solution

“Autodesk”

Motorcycle swingarm: the piece that hinges the rear wheel to the bike’s frame | 50 EXAMPLE: DESIGN SPACE EXPLORATION FOR DESIGN MULTI-CORE PROCESSORS1 (2010) • Ne-XVP project – Follow-up of the TriMedia VLIW (https://en.wikipedia.org/wiki/Ne-XVP ) • 1,105,747,200 heterogeneous multicores in the design space • 2 millions years to evaluate all design points • AI inspired techniques allowed to reduce the induction time to only few days

=> x16 performance increase

1 M. Duranton et all., “Rapid Technology-Aware Design Space Exploration for Embedded HeterogeneousMultiprocessors” in Processor and System-on-Chip Simulation, Ed. R. Leupers, 2010

| 51 WHAT’S NEXT FOR DEEP LEANING AND AI? 1st Winter: 1970s The Lighthill report (published in 1973)

2nd Winter: 1987 (2nd ed) Minsky & Papert

3rd Winter: 1993 SVM Vapnik & Cortes (1963)

4th Winter or Plateau of Productivity

Or winter due to energy ? | 52 CONCLUSION: WE LIVE AN EXCITING TIME!

“The best way to predict the future is to invent it.” Alan Kay

| 53 | 54