<<

World’s First High- Performance with Integrated AI Coprocessor Linley Spring Conference 2020 April 8, 2020 G Glenn Henry Dr. Parviz Palangpour Chief AI Architect AI Software Deep Dive into Centaur’s New x86 AI Coprocessor (Ncore)

• Background • Motivations • Constraints • Architecture • Software • Benchmarks • Conclusion

Demonstrated Working Silicon For Video-Analytics Edge Nov, 2019

©2020 Centaur Technology. All Rights Reserved Centaur Technology Background

• 25-year-old startup in Austin, owned by Via Technologies • We design, from scratch, low-cost x86 processors • Everything to produce a custom x86 SoC with ~100 people • Architecture, logic design and • Design, verification, and layout • Physical build, fab interface, and tape-out • Shipped by IBM, HP, Dell, Samsung, Lenovo…

©2020 Centaur Technology. All Rights Reserved Genesis of the AI Coprocessor (Ncore)

• Centaur was developing SoC (CHA) with new x86 cores • Targeted at edge/cloud server market (high-end x86 features) • Huge inference markets beyond hyperscale cloud, IOT and mobile • Video analytics, edge computing, on-premise servers • However, x86 isn’t efficient at inference • High-performance inference requires external accelerator • CHA has 44x PCIe to support GPUs, etc. • But adds cost, power, another point of failure, etc.

©2020 Centaur Technology. All Rights Reserved Why not integrate a coprocessor?

• Very low cost • Many components already on SoC (“free” to Ncore) Caches, memory, clock, power, package, pins, busses, etc. • There often is “free” space on complex SOCs due to I/O & pins • Having many high-performance x86 cores allows flexibility • The x86 cores can do some of the work, in parallel • Didn’t have to implement all strange/new functions • Allows fast prototyping of new things • For customer: nothing extra to buy

©2020 Centaur Technology. All Rights Reserved Coprocessor Challenges • Have to live with the constraints of the rest of the chip • Technology choice, clock speeds, power distribution, memory, etc. • Especially size and aspect ratio of available “hole” that may change during development • There is no good coprocessor architecture for x86 • We had to invent one. Will it work? Will it last? • Who is going to work on it? • Everyone already busy working on x86 cores and the rest of the SoC • RTL, physical design, verification, software must be done with very small team

©2020 Centaur Technology. All Rights Reserved Our Original Objectives/Focus • Target application is edge-server inference • Support applications without retraining • Best performance/total system $ • For systems with a “significant” level of performance (>1000 ResNet50) • Secondary goal is lowest latency – important for edge server • Key internal design metric: MAC efficiency with limited number of MACs • Will NOT have: • Best performance/W (attached to a processor & optimized for performance) • Best raw performance (limited by CHA die-size objectives)

©2020 Centaur Technology. All Rights Reserved CHA Chip Structure misc I/O 44 PCI lanes • Eight x86 cores Southbridge PCIE I/O functions • Similar IPC as Haswell • Plus AVX512 instructions • Totally new core vs our previous x86 core L3 L3 x86 core • 16MB L3

x86 core L3 L3 x86 core • 2x512-bit ring interconnects

x86 core L3 L3 x86 core • 4 DDR4 controllers • Lots of PCIe lanes x86 core L3 L3 x86 core • Multi-socket support 4x DDR4 DRAM controllers • Running at 2.5 GHz

©2020 Centaur Technology. All Rights Reserved CHA Chip Structure misc I/O 44 PCI lanes

Southbridge PCIE I/O functions

Ncore Ncore is attached to the ring

x86 core L3 L3 x86 core

x86 core L3 L3 x86 core Thus, it can easily communicate with cores, L3, x86 core x86 core L3 L3 DRAM & I/O x86 core L3 L3 x86 core

4x DDR4 DRAM controllers

©2020 Centaur Technology. All Rights Reserved I/O & PCI (44 lanes) CHA NCORE Interconnect rings

RAM dual socket debug & 2 194 mm 4 4 in TMSC cores cores 16nm FFC NCORE COMPUTE 16MB L3

NCORE RAM

4x DDR controllers DRAM controllers Ncore DMA requests, tags & wait on tags Connectivity load/store 64B/clk to/from x86 core to Ncore PCI & I/O Two DMA engines (read,write) L3 L3L3 Ring 3 access via virtual address given Memory space allocated by cachecachecache device driver to at open() L2 rd/wr controlled by Ncore . L2L2 . . .. L1L1L1 intr PCI config corecorecore IPI or APIC

new 64b x86 streaming insts to speed ncore read/write slave (to core) Interface (optional, can also use AVX512 mov) RAM

status, control, compute instructions, etc. ©2020 Centaur Technology. All Rights Reserved Ncore Basic Architecture • Remember: efficient, scalable and flexible (and low latency) • Systolic array of MAC not very attractive • Hard to do (fast) non-MAC things • Can we achieve high utilization with our limited area? • Instead, we choose a very wide SIMD architecture organized into vertical “slices” • Very efficient wrt. area • Easy to subdivide into vertical slices for scalability • Allows easy add of non-MAC functions (like data functions) • Very good latency (all accumulators available in 1 clock) • We know how to make SIMD fast (GHz) and area efficient • Run as fast (GHz) as possible since it directly contributes to performance – but not faster than cores

©2020 Centaur Technology. All Rights Reserved 16 slices The x Final 256 bytes wide Ncore RAM = 4,096 bytes wide logic “AVX32768” Structure inst_memory Instruction All running at

Buses sequencer 2.5 GHz 20 Teraops/sec 2,048 RAM “rows” x 4KB/row x 2 RAMs = 16MB of RAM

©2020 Centaur Technology. All Rights Reserved External buses (each 512-bits) Instruction ROM Bus Instruction memory zero-clock Ncore subsystem bus control branches Instruction decoder & sequencer Architecture RAM arbitration

RAM controls D-RAM W-RAM

4096 byte-wide pipeline NDU Neural Data Unit controls Running at 2.5GHz NPU controls NPU accumulators

OUT-unit OUT-unit controls

©2020 Centaur Technology. All Rights Reserved External buses (each 512-bits) Instruction ROM 256 insts Instruction Bus memory 2x256 insts Ncore subsystem bus control (banked) Instruction decoder & sequencer Architecture RAM arbitration One instruction controls entire pipeline Instructions are 128-bit low-level ucode-like RAM controls D-RAM W-RAM All instructions execute in one clock (incl. 0-clock branches, loops, call/ret, etc.) Each pipeline stage executes in one clock NDU Neural Data Unit Instructions can also control DMA controls 4096 byte-wide pipeline NPU controls NPU accumulators Except… some OUT functions are muxed down (pipelined) 8:1 but pipeline OUT-unit OUT-unit controls above can continue to run

©2020 Centaur Technology. All Rights Reserved Bus subsystem Neural Data Unit is attached to the The side-slice NDUs such that it can rotate an entire 4096-byte “row” in 1 clock

Execution NDU functions have 8 possible inputs & 4 output registers (all 4,096B) Pipeline D-RAM W-RAM NDU performs 3 parallel functions from: RAM outputs -- Pass input to output reg(s) -- “Compress” blocks (for pooling) -- Rotate entire row (4,096B) NDU by 1, 2, 4, 8,16, 32, 64, -1 bytes -- “Broadcast” bytes (expand weights) MOV outputs -- “Edge swap” -- Merge input with NDU output reg using NPU another NDU reg as byte mask -- etc. accumulators All functions execute in one clock OUT-unit out data Bus subsystem The NPU supports 9b & 16b integer & bfloat16 Execution --9b integer MAC takes 1 clk --bfloat16 MAC takes 3 clks

Pipeline D-RAM W-RAM Functions = MACs, adds, subtracts, RAM outputs min/max, logic, etc., lots of variations

Each NPU has 8 predication registers & ops to conditionally update accumulator NDU Saturating 32b accumulator MOV outputs (32b integer or single-precision FP)

Inputs come from NDU registers NPU & forwarded NPU inputs accumulators Built-in 8b→9b zero-quantizations (subtract offsets) OUT-unit out data Bus subsystem The OUT unit performs quantization Execution of 32b accumulator to 8b,16b & bfloat16,

Pipeline D-RAM W-RAM Quantization based on Google, PyTorch, etc. RAM outputs Also does ReLU, tanh & sigmoid

Results go into 2 “out” registers, that can be stored directly to RAM NDU or forwarded to NDU input MOV outputs Full 32b accumulator can also be stored “horizontally” or “vertically” NPU All functions execute in 1 or 3 clocks, accumulators but some functions are pipelined muxed 8:1 (≈10 clocks worst case OUT) OUT-unit out data Bus subsystem The Execution

Pipeline D-RAM W-RAM RAM outputs

NDU

MOV outputs

NPU OUTs are relatively rare accumulators But, the pipeline above (more MACs, etc) can continue to run overlapped with OUT OUT-unit out data Ncore Implementation Challenges • 32,768-bit ! • Connecting data & controls must be fanned-out/stages many clocks ahead • Multiple clocks & thousands of clock enables • Datapath is not always regular (different bytes can have different logic) • Etc → lots of RTL just managing connections • Critical “loop” stage timing (2.5 GHz) • Sequencer… only ~150ps left to do anything (after inst RAM fetch) • The NPU = MAC + lots of other functions, predication, saturation… • The NDU = massive wiring and rotate connections • Large, tightly coupled 4KB-wide RAM (w/ECC) • Timing, managing arbitration • Getting data to/from 4 write sources and 4 read sinks • Tight interaction between RTL & physical build is critical ! • Also need a really good build person (we have one)

©2020 Centaur Technology. All Rights Reserved Non- NDU RAM

Portion NPU of Ncore & OUT seq I/O processing 2.1K I/O ~11mm2 in 16nm

~1.3M regs & 19.5M gates Ncore Machine Instructions • Instruction = 128b of fields controlling various pieces of pipeline • For instance, a single instruction could have: 30b: control of 2 RAM reads & index operations 22b: branch control 20b: control of NPU 15b: control of NDU write to RAM 26b: control of NDU 15b: misc • Only about ½ – ¾ of instruction bits used on most instructions • But some important functions use more • Several hardware considerations dictate the 128b size • Hardware dependent → will change with new hardware version • Also, understanding of hardware required for efficient programming • Thus, detailed instruction definition will not be made public

• But, nobody needs to know because we’ve got SW tools & a stack…

©2020 Centaur Technology. All Rights Reserved Standard TensorFlow In progress (Lite) application flow Support for more platforms: TF, PyTorch, etc. Current TFLite Software flatbuffer TL interpreter uses delegate interface GCL graph-level functions NKL low level, generates Ncore code this component knows the machine insts loadable

significant optimization of x86 code done run time lib run-time includes some x86 optimized functions (AVX512), new streaming x86 insts, etc. driver Linux device driver (we run on Ubuntu) This component knows the coprocessor interface & manages access to user process

ncore x86 MLPerf Benchmarking “Best benchmark effort I’ve seen (& I’m a real skeptic)” • Everyone has participated: ~100 companies & universities • Defined detailed scenarios of reference implementations • “Closed” category must run exact reference models (hard) • Minimum accuracy required (hard) • Multiple results reported: fps, accuracy, latency, etc. • Results have been audited (in detail) !

First submissions 10/11/19, submitting in “closed/Preview” category…1 month after working silicon! • Chip vendors: Intel, Nvidia, Qualcomm, Centaur (Preview) • Cloud services: Google, Alibaba • System integrators (use Intel): DellEMC, Inspur, Tencent • Chip startups: Habana Labs, FuriosaAI, Halo

First results published 11/6/19 https://mlperf.org/inference-results/ MLPerf name and logo are trademarks.

©2020 Centaur Technology. All Rights Reserved MLPerf CHA-NCORE Intel NVIDIA AGX Qualcomm Intel Intel (Preview) i3-1005g1 Xavier 855 2x CLX 9282 2x NNP_I Results (Available) (Available) (Available) (Available) 1000 (Preview) Submitted Results: Latency (ms) MobileNet 0.33 3.55 0.58 3.02 0.49 --

SSD 1.54 6.67 1.50 -- 1.40 --

Resnet50 1.05 13.58 2.04 8.95 1.37 --

GNMT ------

MLPerf v0.5 Inference Closed Single Stream. Retrieved from www.mlperf.org 27 January 2020, entries 0.5-22, 0.5-23, 0.5-24, 0.5-28, 0.5-29, 0.5-32, 0.5-33. MLPerf name and logo are trademarks.

©2020 Centaur Technology. All Rights Reserved MLPerf CHA-NCORE Intel NVIDIA Qualcomm Intel Intel (Preview) i3-1005g1 AGX Xavier 855 2x CLX 9282 2x NNP_I Results (Available) (Available) (Available) (Available) 1000 (Preview) Submitted Results: Thoughput (FPS) MobileNet 6,042 508 6,521 -- 29,203 --

SSD 652* 218 2,486 -- 9,468 -- * SSD submission was much lower than today due to (≈2,000) inadequate time on hardware Resnet50 1,218 101 2,159 -- 5,966 10,567

GNMT 12 ------

MLPerf v0.5 Inference Closed Offline. Retrieved from www.mlperf.org 27 January 2020, entries 0.5-22, 0.5-23, 0.5-24, 0.5-28, 0.5-29, 0.5-32, 0.5-33. MLPerf name and logo are trademarks.

©2020 Centaur Technology. All Rights Reserved MLPerf Results Analysis

• Top Intel1 = 112 cores, 154MB L3, 24 memory channels, UltraPath, etc • Plus, their cores have 2xAVX512 plus new VNNI insts (for 8/16b neural) • 5,966 fps (Resnet50) / 112  53fps/core • So, NCORE ≈ 24 world’s-most-powerful cores (1,228/53) • Important note: in NCORE case, the x86 cores are mainly free to other tasks! • Or, consider MobileNets… 29,203/112  260fps/core (Intel 9282) • So, again, NCORE ≈ 24 world’s-most-powerful core • Now, consider the 24-core Xeon 6252 part • 24 world’s-most-powerful cores (2xAVX512, VNNI) • Should be equivalent DL performance with NCORE

[1] MLPerf Inf-0.5-23. Dual Intel® Xeon® Platinum 9282 (112 total cores). Offline/Available category v0.5.

©2020 Centaur Technology. All Rights Reserved Conclusions / Observations • A very small team can do a lot ! • Centaur has proven this over and over for 25 years • A co-processor approach is the way to go • Minimizes cost, can off-load some ops to CPUs, etc. • A 32,768b-wide SIMD approach works very well ! • Lots of doubters when we started • Original goal of good density & scalability achieved • In retrospect, we could have done even wider • Our original architecture/function “guesses’ worked very well • Our utilization (“useful MACs per clock”) is very good • But, now that we have running applications…this design is not as efficient as it could be • The RAM subsystem is slightly overkill & not optimally shaped • A few additional sequencer/NDU functions would help performance • A different slice geometry may be better • We could likely stuff more performance in the same area

©2020 Centaur Technology. All Rights Reserved Thank You