The Bifrost GPU Architecture and the ARM Mali-G71 GPU
Total Page:16
File Type:pdf, Size:1020Kb
The Bifrost GPU architecture and the ARM Mali-G71 GPU Jem Davies ARM Fellow and VP of Technology Hot Chips 28 Aug 2016 Introduction to ARM Soft IP . ARM licenses Soft IP cores (amongst other things) to our Silicon Partners . They then make chips and sell to OEMs, who sell consumer devices . “ARM doesn’t make chips”… . We provide all the RTL, integration testbenches, memories lists, reference floorplans, example synthesis scripts, sometimes models, sometimes FPGA images, sometimes with implementation advice, always with memory system requirements/recommendations . Consequently silicon area, power, frequencies, performance, benchmark scores can therefore vary quite a bit in real silicon… 2 © ARM2016 ARM Mali: The world’s #1 shipping GPU 750M 65 27 Mali-based new Mali GPUs shipped 140 Total licenses in in 2015 Total licenses licensees FY15 Mali is in: Mali graphics based IC shipments (units) 750m ~75% ~50% ~40% 550m of of of 400m DTVs… tablets… smartphones 150m <50m 2011 2012 2013 2014 2015 3 © ARM2016 ARM Mali graphics processor generations Mali-G71 BIFROST … GPU Unified shader cores, scalar ISA, clause execution, full coherency, Vulkan, OpenCL MIDGARD Mali-T600 GPU series Mali-T700 GPU series Mali-T800 GPU series Unified shader cores, SIMD ISA, OpenGL ES 3.x, OpenCL, Vulkan Presented at HotChips 2015 Mali-200 Mali-300 Mali-400 Mali-450 Mali-470 UTGARD GPU GPU GPU GPU GPU Separate shader cores, SIMD ISA, OpenGL ES 2.x 4 © ARM2016 Bifrost features . Leverages Mali’s scalable architecture . Scalable to 32 shader cores 20% Scalable to 32 Higher energy Shader cores . Major shader core redesign efficiency* . New scalar, clause-based ISA . New quad-based arithmetic units . New geometry data flow . Reduces memory bandwidth and footprint 40% 20% Better Bandwidth performance Improvement* density* . Support for fine grain buffer sharing with the CPU *Compared to Mali-T880 on same 5 © ARM2016 process node under the same conditions. Bifrost features . Leverages Mali’s scalable architecture . Scalable to 32 shader cores 20% Scalable to 32 Higher energy Shader cores . Major shader core redesign efficiency* . New scalar, clause-based ISA . New quad-based arithmetic units . New geometry data flow . Reduces memory bandwidth and footprint 40% 20% Better Bandwidth performance Improvement* density* . Support for fine grain buffer sharing with the CPU *Compared to Mali-T880, on same 6 © ARM2016 process node under the same conditions. Bifrost GPU design Driver Software Up to 32 shader cores supported Shader Shader Shader Shader Job Manager Core 0 Core 1 Core 2 Core 31 GPU Fabric L2 Cache L2 Cache L2 Cache L2 Cache Tiler MMU Segment Segment Segment Segment ACE Memory Bus ACE Memory Bus ACE Memory Bus ACE Memory Bus 7 © ARM2016 Bifrost features . Leverages Mali’s scalable architecture . Scalable to 32 shader cores 20% Scalable to 32 Higher energy Shader cores . Major shader core redesign efficiency* . New quad-based arithmetic units . New scalar, clause-based ISA . New geometry data flow . Reduces memory bandwidth and footprint 40% 20% Better Bandwidth performance Improvement* density* . Support for fine grain buffer sharing with the CPU *Compared to Mali-T880, on same 8 © ARM2016 process node under the same conditions. Shader core improvements Driver Software Shader Shader Shader Shader Job Manager Core 0 Core 1 Core 2 Core 31 GPU Fabric L2 Cache L2 Cache L2 Cache L2 Cache Tiler MMU Segment Segment Segment Segment ACE Memory Bus ACE Memory Bus ACE Memory Bus ACE Memory Bus 9 © ARM2016 Mali-G71 shader core design Compute Frontend Fragment Frontend Quad Creator Quad Creator Execution Execution Execution Engine 0 Engine 1 Engine 2 Quad Quad Manager Quad State Quad State Quad State Control Shader Core Fabric ZS Memory Load/store Attribute Varying Texture Blender & Depth & Unit Unit Unit Unit Tile Access Stencil Tile Memory To L 2 Mem Sys To L 2 Mem Sys Tile Writeback To L 2 Mem Sys 10 © ARM2016 Quad execution Compute Frontend Fragment Frontend Quad Creator Quad Creator Execution Execution Execution Engine 0 Engine 1 Engine 2 Quad Manager Quad State Quad State Quad State Shader Core Fabric ZS Memory Load/store Attribute Varying Texture Blender & Depth & Unit Unit Unit Unit Tile Access Stencil Tile Memory To L 2 Mem Sys To L 2 Mem Sys Tile Writeback To L 2 Mem Sys 11 © ARM2016 Recap: SIMD vectorization Lane0 Lane1 Lane2 Lane3 . Midgard GPUs use SIMD vectorization . One thread at a time executes in each pipeline stage T0.x T0.y T0.z Idle Cycle 1 . Each thread must fill the width of the hardware T1.x T1.y T1.z Idle Cycle 2 T2.x T2.y T2.z Idle Cycle 3 . Sensitive to shader code T3.x T3.y T3.z Idle Cycle 4 . Code always evolving . Compiler vectorization is not perfect . Have to detect combinations of operations which can be merged to fill idle lanes . Scalar operations can not always be merged into vectors 12 © ARM2016 Quad vectorization Lane0 Lane1 Lane2 Lane3 . Bifrost uses quad-parallel execution . Four scalar threads executed in lockstep in a “quad” T0.x T1.x T2.x T3.x Cycle 1 . One quad at a time executes in each pipeline stage . Each thread fills one 32-bit lane of the hardware T0.y T1.y T2.y T3.y Cycle 2 . 4 threads doing a vec3 FP32 add takes 3 cycles . Improves utilization T0.z T1.x T2.z T3.z Cycle 3 . Quad vectorization is compiler friendly . Each thread only sees a stream of scalar operations . Vector operations can always be split into scalars 13 © ARM2016 Bifrost execution engine . Executes quad-parallel scalar operations . 4x32-bit multiplier FMA Main Regs Read . 4x32-bit adder ADD . Adder includes special function unit . Smaller and more area efficient FMA . Simplified layout eases compilation . Better scheduling in today’s code . Better utilization ADD/SF Registers Temp . One instruction word contains two instructions Main Regs Write 14 © ARM2016 Bifrost execution engine: Special arithmetic ops . Special function hardware is smaller than Main Regs Read Midgard VLUT equivalent . Many transcendental functions supported . Special functions provide building blocks for compiled shader code FMA . Part of the built-in function libraries ADD/SF Registers Temp Main Regs Write 15 © ARM2016 Bifrost execution engine: Functional units . Retains support for smaller width data types . 2x performance for FP16 useful for pixel shaders Main Regs Read int8 int8 int8 int8 8-bit integers FMA int16 int16 16-bit integers int32 32-bit integers float16 float16 16-bit floating point ADD/SF Registers Temp float32 32-bit floating point Main Regs Write 16 © ARM2016 Bifrost features . Leverages Mali’s scalable architecture . Scalable to 32 shader cores 20% Scalable to 32 Higher energy Shader cores . Major shader core redesign efficiency* . New quad-based arithmetic units . New scalar, clause-based ISA . New geometry data flow . Reduces memory bandwidth and footprint 40% 20% Better Bandwidth performance Improvement* density* . Support for fine grain buffer sharing with the CPU *Compared to Mali-T880, on same 17 © ARM2016 process node under the same conditions. Clause execution . A group of instructions which executes atomically . Architectural state visible after clause completion . Bypass path registers exposed to the compiler . Non-deterministic instructions on clause boundaries 18 © ARM2016 Bifrost shader clause example Linear Source LOAD.32 r0, [r10] FADD.32 r1,r0,r0 FADD.32 r2,r1,r1 FADD.32 r3,r2,r2 FADD.32 r4,r3,r3 FADD.32 r3,r3,r4 FADD.32 r0,r3,r3 STORE.32 r0, [r10] 19 © ARM2016 Bifrost shader clause example Linear Source Basic Clause Compile Load is variable LOAD.32 r0, [r10] length, so clause { FADD.32 r1,r0,r0 must be split LOAD.32 r0, [r10] FADD.32 r2,r1,r1 } from use of r0 FADD.32 r3,r2,r2 FADD.32 r4,r3,r3 wait(load) { FADD.32 r3,r3,r4 Next clause uses FADD.32 r1,r0,r0 FADD.32 r0,r3,r3 r0, so must wait FADD.32 r2,r1,r1 STORE.32 r0, [r10] for load to FADD.32 r3,r2,r2 complete FADD.32 r4,r3,r3 FADD.32 r3,r3,r4 FADD.32 r0,r3,r3 STORE.32 r0, [r10] } 20 © ARM2016 Bifrost shader clause example Linear Source Basic Clause Compile Optimized Clause Compile Load is variable LOAD.32 r0, [r10] length, so clause { { FADD.32 r1,r0,r0 must be split LOAD.32 r0, [r10] LOAD.32 r0, [r10] FADD.32 r2,r1,r1 } } from use of r0 FADD.32 r3,r2,r2 FADD.32 r4,r3,r3 wait(load) { wait(load) { FADD.32 r3,r3,r4 Next clause uses FADD.32 r1,r0,r0 FADD.32 t0,r0,r0 FADD.32 r0,r3,r3 r0, so must wait FADD.32 r2,r1,r1 Use temporaries: FADD.32 t1,t0,t0 STORE.32 r0, [r10] for load to FADD.32 r3,r2,r2 only 2 register FADD.32 t0,t1,t1 complete FADD.32 r4,r3,r3 file accesses in 6 FADD.32 t1,t0,t0 FADD.32 r3,r3,r4 cycles FADD.32 t0,t1,t1 FADD.32 r0,r3,r3 FADD.32 r0,t0,t0 STORE.32 r0, [r10] } } Store may also stall, so split to { help scheduling STORE.32 r0, [r10] } 21 © ARM2016 Bifrost features . Leverages Mali’s scalable architecture . Scalable to 32 shader cores 20% Scalable to 32 Higher energy Shader cores . Major shader core redesign efficiency* . New scalar, clause-based ISA . New quad-based arithmetic units . New geometry data flow . Reduces memory bandwidth and footprint 40% 20% Better Bandwidth performance Improvement* density* . Support for fine grain buffer sharing with the CPU *Compared to Mali-T880, on same 22 © ARM2016 process node under the same conditions. Tiler changes . Bifrost uses the same underlying hierarchical binning design as Midgard .