<<

Pixel Visual Core: 's Fully Programmable Image, Vision, and AI Processor For Mobile Devices Jason Redgrave, Albert Meixner, Nathan Goulding-Hotta, Artem Vasilyev, and Ofer Shacham

1 Motivation: The Vision

No HDR+ HDR+

2 Software Motivation

● Imaging, Vision, and AI are fast evolving fields ● Productizing new algorithms is difficult ○ ASIC design adds long delay into Research→Product cycle ○ CPU efficiency limits complexity in mobile space ○ Reliance on external vendors limits changes to the stack ● Software wants control, flexibility, and efficiency

3 Hardware Motivation

CPU 10x

Programmable GPU >100 pJ/Op Programmable Image ~1-10 Giga Ops/Sec Processing Unit (IPU) 10-20x <1 pJ/Op 10-20x >1 Tera Ops/Sec

PVC’s IPU Not-programmable ASIC <1 pJ/Op >1 Tera Ops/Sec Energy per Op (pJ/Op)

Performance (Op/sec)

4 Domain-specific Software

● High-level programming model is Halide ○ Domain-specific language for Image Processing ● IPU supports a subset of Halide language ○ No floating point ○ Limits on available memory access patterns ● Halide backend generates kernels and all API calls ○ Proprietary API for resource allocation and execution control

5 Compilation

● Halide backend generates high-level, virtual ISA (vISA) ○ RISC ISA with streaming-friendly memory model ○ Cross-generational & Architecture-independent ● Final compilation into physical ISA (pISA) ○ Compilation can be offline or on-device ○ Generation-specific VLIW ISA ○ All memory movements are explicit (no caches)

6 Conceptual View of Hardware: DAG of Kernels

Camera 2D Stencil Processor

LineBuffer 2D Stencil 2D Stencil LineBuffer DRAM Processor Processor LineBuffer LineBuffer DRAM 2D Stencil

LineBuffer Processor LineBuffer LineBuffer

● Many cores, each fully programmable ● Configurable DAG topology

Source: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis non erat sem7 Hardware resources view

From DRAM Bus Ifc Line Buffer 0 Shift Reg Shift Reg

Line Buffer 1 Stencil processor 0

Line Buffer 2 Bus Ifc

Shift Reg Shift Reg

Stencil processor M-1 Line Buffer N

To DRAM

Source: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis non erat sem8 Visual Core Architecture MIPI A53 IPU IO Block ● A53 Chip Specs:

IPU IPU ● LPDDR4 TSMC 28nm Core 2 Core 1 6.0 x 7.2 mm ● MIPI 426 MHz IPU IPU ● PCIe 512 MB DRAM LPDDR4 Core 4 Core 3 ● IPU Power <4500 mW IPU IPU PCIe Core 6 Core 5 DRAM IPU SoC IPU IPU Core 8 Core 7

Source: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis non erat sem9 IPU Architecture

● 8x Cores ○ STP ○ LBP ● I/O ○ DMA ○ MMU ○ SSP (Buffer) ● Ring NoC

Source: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis non erat10 sem Stencil Processor (STP)

● Scalar lane ○ Instruction RAM ● SHG (Load/Store) ● 2D Array ○ 256 Compute lanes ○ 144 Halo lanes ○ Shared RAMs ○ Shift network

Source: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis non erat11 sem Compute Lane

● Single cycle! ● Dual 16b ALU ● 16b x 16b Multiply-Add ● Memory ○ Register File ○ Shared RAM (L0)

Source: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis non erat12 sem Read Neighbor Network

● Output pixel depends on neighboring inputs

● One vertical or horizontal ALU ALU shift per cycle ● Shifts are circular (toroidal) ● 1/2/3/4 hops in each ALU ALU direction

Source: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis non erat13 sem Line Buffer Pool (LBP)

● Data storage ● Synchronization ● 2D Line Buffer abstraction (2D FIFO)

Source: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis non erat14 sem Virtually Tall Line Buffer

Reused from ● Memory saving over full last pass img_width buffer fb_rows full width buffer memory (fb_memory)

● Elasticity set by SB width Read Sheet ● FB height set by kernel sb_rows Sliding buffer memory (sb_memory) Used only by the reuse current pass

fb_rows extended fullwidth Will be reused in buffer the next pass sb_cols

Source: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis non erat15 sem Results Performance 5.8x Energy Efficiency 16.7x

2.8x 2.8x 6.9x 7.1x

Align Merge Finish Align Merge Finish

Mobile SoC (10nm) Pixel Visual Core (28nm)

7-16x more energy-efficient despite 3-generation process gap!

16 T.J. Alumbaugh, Ke Bai, Fabrizio Basso, Trevor Bunker, Dinesh Chandrasekaran, Ed Chang, Robert Chapman, Arun Chauhan, Albert Chen, Chien-Yu Chen, Matt Cockrell, Jamison Collins, Joe Dao, Neeti Desai, Dusty DeWeese, Andrea Di Blas, Daniel Finchelstein, Arnd Geis, Filippo Gioachin, Richard Goe, Nathan Goulding-Hotta, Ben Gribstad, Cheng Gu, Penny Gur, Nico Hailey, Ashok Halambi, Adam Hampson, Mahdi Hamzeh, Vince Harron, Sam Hasinoff, Zhijun He, Viresh Hingarh, Sean Howarth, Richard Hsu, John Keen, Asif Khan, Ji Yun Kim, Timothy Knight, Victor Leung, Marc Levoy, Yuan Lin, Joshua Litt, Chenjie Luo, Bill Mark, Albert Meixner, Jon Michelson, Rob Mohr, Michael Moreno, Ben Mossawir, Rolf Mueller, Sean O'Boyle, Sarang Padalkar, Hyunchul Park, David Patterson, Alexander Perez, Karthika Periyathambi, Bob Pflederer, Theodore Popp, Todd Poynor, Jing Pu, Shahriar Rabii, Ramesh Ramarao, Jason Redgrave, Masumi Reynders, Shac Ron, Sabarish Sankaranarayanan, Vaidyanathan Seetharaman, Ofer Shacham, Dillon Sharlet, Sean Silva, Saravana Soundararajan, Don Stark, Eino-Ville Talvala, Pei Zhao Tang, David Tolnay, Michelle Tomasko, Artem Vasilyev, Jaakko Ventela, Nate Voorhies, David Warren, Kevin Watts, Huachang Xu, Catherine Yu, Rong Zhou, Qiuling Zhu

17 Questions?

18