World's First High- Performance X86 With

World’s First High- Performance x86 with Integrated AI Coprocessor Linley Spring Processor Conference 2020 April 8, 2020 G Glenn Henry Dr. Parviz Palangpour Chief AI Architect AI Software Deep Dive into Centaur’s New x86 AI Coprocessor (Ncore) • Background • Motivations • Constraints • Architecture • Software • Benchmarks • Conclusion Demonstrated Working Silicon For Video-Analytics Edge Server Nov, 2019 ©2020 Centaur Technology. All Rights Reserved Centaur Technology Background • 25-year-old startup in Austin, owned by Via Technologies • We design, from scratch, low-cost x86 processors • Everything to produce a custom x86 SoC with ~100 people • Architecture, logic design and microcode • Design, verification, and layout • Physical build, fab interface, and tape-out • Shipped by IBM, HP, Dell, Samsung, Lenovo… ©2020 Centaur Technology. All Rights Reserved Genesis of the AI Coprocessor (Ncore) • Centaur was developing SoC (CHA) with new x86 cores • Targeted at edge/cloud server market (high-end x86 features) • Huge inference markets beyond hyperscale cloud, IOT and mobile • Video analytics, edge computing, on-premise servers • However, x86 isn’t efficient at inference • High-performance inference requires external accelerator • CHA has 44x PCIe to support GPUs, etc. • But adds cost, power, another point of failure, etc. ©2020 Centaur Technology. All Rights Reserved Why not integrate a coprocessor? • Very low cost • Many components already on SoC (“free” to Ncore) Caches, memory, clock, power, package, pins, busses, etc. • There often is “free” space on complex SOCs due to I/O & pins • Having many high-performance x86 cores allows flexibility • The x86 cores can do some of the work, in parallel • Didn’t have to implement all strange/new functions • Allows fast prototyping of new things • For customer: nothing extra to buy ©2020 Centaur Technology. All Rights Reserved Coprocessor Challenges • Have to live with the constraints of the rest of the chip • Technology choice, clock speeds, power distribution, memory, etc. • Especially size and aspect ratio of available “hole” that may change during development • There is no good coprocessor architecture for x86 • We had to invent one. Will it work? Will it last? • Who is going to work on it? • Everyone already busy working on x86 cores and the rest of the SoC • RTL, physical design, verification, software must be done with very small team ©2020 Centaur Technology. All Rights Reserved Our Original Objectives/Focus • Target application is edge-server inference • Support applications without retraining • Best performance/total system $ • For systems with a “significant” level of performance (>1000 ResNet50) • Secondary goal is lowest latency – important for edge server • Key internal design metric: MAC efficiency with limited number of MACs • Will NOT have: • Best performance/W (attached to a processor & optimized for performance) • Best raw performance (limited by CHA die-size objectives) ©2020 Centaur Technology. All Rights Reserved CHA Chip Structure misc I/O 44 PCI lanes • Eight x86 cores Southbridge PCIE I/O functions • Similar IPC as Intel Haswell • Plus AVX512 instructions • Totally new core vs our previous x86 core L3 L3 x86 core • 16MB L3 cache x86 core L3 L3 x86 core • 2x512-bit ring interconnects x86 core L3 L3 x86 core • 4 DDR4 controllers • Lots of PCIe lanes x86 core L3 L3 x86 core • Multi-socket support 4x DDR4 DRAM controllers • Running at 2.5 GHz ©2020 Centaur Technology. All Rights Reserved CHA Chip Structure misc I/O 44 PCI lanes Southbridge PCIE I/O functions Ncore Ncore is attached to the ring x86 core L3 L3 x86 core x86 core L3 L3 x86 core Thus, it can easily communicate with cores, L3, x86 core x86 core L3 L3 DRAM & I/O x86 core L3 L3 x86 core 4x DDR4 DRAM controllers ©2020 Centaur Technology. All Rights Reserved I/O & PCI (44 lanes) CHA NCORE Interconnect rings RAM bus & debug socket dual 2 194 mm 4 4 in TMSC cores cores 16nm FFC NCORE COMPUTE 16MB L3 NCORE RAM 4x DDR controllers DRAM controllers Ncore DMA requests, tags & wait on tags Connectivity load/store 64B/clk to/from x86 core to Ncore PCI & I/O Two DMA engines (read,write) L3 L3L3 Ring 3 access via virtual address given Memory space allocated by cachecachecache device driver to process at open() L2 rd/wr controlled by Ncore . L2L2 . .. L1L1L1 intr PCI config corecorecore IPI or APIC new 64b x86 streaming insts to speed ncore read/write slave (to core) Interface (optional, can also use AVX512 mov) RAM status, control, compute instructions, etc. ©2020 Centaur Technology. All Rights Reserved Ncore Basic Architecture • Remember: efficient, scalable and flexible (and low latency) • Systolic array of MAC not very attractive • Hard to do (fast) non-MAC things • Can we achieve high utilization with our limited area? • Instead, we choose a very wide SIMD architecture organized into vertical “slices” • Very efficient wrt. area • Easy to subdivide into vertical slices for scalability • Allows easy add of non-MAC functions (like data functions) • Very good latency (all accumulators available in 1 clock) • We know how to make SIMD fast (GHz) and area efficient • Run as fast (GHz) as possible since it directly contributes to performance – but not faster than cores ©2020 Centaur Technology. All Rights Reserved 16 slices The x Final 256 bytes wide Ncore RAM = 4,096 bytes wide logic “AVX32768” Structure inst_memory Instruction All running at Buses sequencer 2.5 GHz 20 Teraops/sec 2,048 RAM “rows” x 4KB/row x 2 RAMs = 16MB of RAM ©2020 Centaur Technology. All Rights Reserved External buses (each 512-bits) Instruction ROM Bus Instruction memory zero-clock Ncore subsystem bus control branches Instruction decoder & sequencer Architecture RAM arbitration RAM controls D-RAM W-RAM 4096 byte-wide pipeline NDU Neural Data Unit controls Running at 2.5GHz NPU controls NPU accumulators OUT-unit OUT-unit controls ©2020 Centaur Technology. All Rights Reserved External buses (each 512-bits) Instruction ROM 256 insts Instruction Bus memory 2x256 insts Ncore subsystem bus control (banked) Instruction decoder & sequencer Architecture RAM arbitration One instruction controls entire pipeline Instructions are 128-bit low-level ucode-like RAM controls D-RAM W-RAM All instructions execute in one clock (incl. 0-clock branches, loops, call/ret, etc.) Each pipeline stage executes in one clock NDU Neural Data Unit Instructions can also control DMA controls 4096 byte-wide pipeline NPU controls NPU accumulators Except… some OUT functions are muxed down (pipelined) 8:1 but pipeline OUT-unit OUT-unit controls above can continue to run ©2020 Centaur Technology. All Rights Reserved Bus subsystem Neural Data Unit is attached to the The side-slice NDUs such that it can rotate an entire 4096-byte “row” in 1 clock Execution NDU functions have 8 possible inputs & 4 output registers (all 4,096B) Pipeline D-RAM W-RAM NDU performs 3 parallel functions from: RAM outputs -- Pass input to output reg(s) -- “Compress” blocks (for pooling) -- Rotate entire row (4,096B) NDU by 1, 2, 4, 8,16, 32, 64, -1 bytes -- “Broadcast” bytes (expand weights) MOV outputs -- “Edge swap” -- Merge input with NDU output reg using NPU another NDU reg as byte mask -- etc. accumulators All functions execute in one clock OUT-unit out data Bus subsystem The NPU supports 9b & 16b integer & bfloat16 Execution --9b integer MAC takes 1 clk --bfloat16 MAC takes 3 clks Pipeline D-RAM W-RAM Functions = MACs, adds, subtracts, RAM outputs min/max, logic, etc., lots of variations Each NPU has 8 predication registers & ops to conditionally update accumulator NDU Saturating 32b accumulator MOV outputs (32b integer or single-precision FP) Inputs come from NDU registers NPU & forwarded NPU inputs accumulators Built-in 8b→9b zero-quantizations (subtract offsets) OUT-unit out data Bus subsystem The OUT unit performs quantization Execution of 32b accumulator to 8b,16b & bfloat16, Pipeline D-RAM W-RAM Quantization based on Google, PyTorch, etc. RAM outputs Also does ReLU, tanh & sigmoid Results go into 2 “out” registers, that can be stored directly to RAM NDU or forwarded to NDU input MOV outputs Full 32b accumulator can also be stored “horizontally” or “vertically” NPU All functions execute in 1 or 3 clocks, accumulators but some functions are pipelined muxed 8:1 (≈10 clocks worst case OUT) OUT-unit out data Bus subsystem The Execution Pipeline D-RAM W-RAM RAM outputs NDU MOV outputs NPU OUTs are relatively rare accumulators But, the pipeline above (more MACs, etc) can continue to run overlapped with OUT OUT-unit out data Ncore Implementation Challenges • 32,768-bit datapath ! • Connecting data & controls must be fanned-out/stages many clocks ahead • Multiple clocks & thousands of clock enables • Datapath is not always regular (different bytes can have different logic) • Etc → lots of RTL just managing connections • Critical “loop” stage timing (2.5 GHz) • Sequencer… only ~150ps left to do anything (after inst RAM fetch) • The NPU = MAC + lots of other functions, predication, saturation… • The NDU = massive wiring and rotate connections • Large, tightly coupled 4KB-wide RAM (w/ECC) • Timing, managing arbitration • Getting data to/from 4 write sources and 4 read sinks • Tight interaction between RTL & physical build is critical ! • Also need a really good build person (we have one) ©2020 Centaur Technology. All Rights Reserved Non- NDU RAM Portion NPU of Ncore & OUT seq I/O processing 2.1K I/O ~11mm2 in 16nm ~1.3M regs & 19.5M gates Ncore Machine Instructions • Instruction = 128b of fields controlling various pieces of pipeline • For instance, a single instruction could have: 30b: control of 2 RAM reads & index operations 22b: branch control 20b: control of NPU 15b: control of NDU write to RAM 26b: control of NDU 15b: misc • Only about ½ – ¾ of instruction bits used on most instructions • But some important functions use more • Several hardware considerations dictate the 128b size • Hardware dependent → will change with new hardware version • Also, understanding of hardware required for efficient programming • Thus, detailed instruction definition will not be made public • But, nobody needs to know because we’ve got SW tools & a stack… ©2020 Centaur Technology.

World's First High- Performance X86 With

X86 Platform Coprocessor/Prpmc (PC on a PMC)

Convey Overview

Exploiting Free Silicon for Energy-Efficient Computing Directly

CUDA What Is GPGPU

Comparing the Power and Performance of Intel's SCC to State

MICROPROCESSOR REPORT the INSIDERS’ GUIDE to MICROPROCESSOR HARDWARE Slot Vs

MPR Article Template

Introduction to Cpu

AI Chips: What They Are and Why They Matter

Instruction Set Innovations for Convey's HC-1 Computer

CUDA C++ Programming Guide

Intel Xeon Phi Product Family Brief