The Bifrost GPU architecture and the ARM Mali-G71 GPU
Jem Davies ARM Fellow and VP of Technology
Hot Chips 28 Aug 2016 Introduction to ARM Soft IP
. ARM licenses Soft IP cores (amongst other things) to our Silicon Partners . They then make chips and sell to OEMs, who sell consumer devices . “ARM doesn’t make chips”… . We provide all the RTL, integration testbenches, memories lists, reference floorplans, example synthesis scripts, sometimes models, sometimes FPGA images, sometimes with implementation advice, always with memory system requirements/recommendations . Consequently silicon area, power, frequencies, performance, benchmark scores can therefore vary quite a bit in real silicon…
2 © ARM2016 ARM Mali: The world’s #1 shipping GPU 750M 65 27 Mali-based new Mali GPUs shipped 140 Total licenses in in 2015 Total licenses licensees FY15
Mali is in: Mali graphics based IC shipments (units) 750m
~75% ~50% ~40% 550m of of of 400m DTVs… tablets… smartphones 150m <50m 2011 2012 2013 2014 2015
3 © ARM2016 ARM Mali graphics processor generations
Mali-G71 BIFROST … GPU Unified shader cores, scalar ISA, clause execution, full coherency, Vulkan, OpenCL
MIDGARD Mali-T600 GPU series Mali-T700 GPU series Mali-T800 GPU series
Unified shader cores, SIMD ISA, OpenGL ES 3.x, OpenCL, Vulkan Presented at HotChips 2015
Mali-200 Mali-300 Mali-400 Mali-450 Mali-470 UTGARD GPU GPU GPU GPU GPU Separate shader cores, SIMD ISA, OpenGL ES 2.x
4 © ARM2016 Bifrost features
. Leverages Mali’s scalable architecture . Scalable to 32 shader cores
20% Scalable to 32 Higher energy Shader cores . Major shader core redesign efficiency* . New scalar, clause-based ISA . New quad-based arithmetic units
. New geometry data flow
. Reduces memory bandwidth and footprint 40% 20% Better Bandwidth performance Improvement* density* . Support for fine grain buffer sharing with the
CPU *Compared to Mali-T880 on same
5 © ARM2016 process node under the same conditions.
Bifrost features
. Leverages Mali’s scalable architecture . Scalable to 32 shader cores
20% Scalable to 32 Higher energy Shader cores . Major shader core redesign efficiency* . New scalar, clause-based ISA . New quad-based arithmetic units
. New geometry data flow
. Reduces memory bandwidth and footprint 40% 20% Better Bandwidth performance Improvement* density* . Support for fine grain buffer sharing with the
CPU *Compared to Mali-T880, on same
6 © ARM2016 process node under the same conditions.
Bifrost GPU design
Driver Software Up to 32 shader cores supported
Shader Shader Shader Shader Job Manager Core 0 Core 1 Core 2 Core 31
GPU Fabric
L2 Cache L2 Cache L2 Cache L2 Cache Tiler MMU Segment Segment Segment Segment
ACE Memory Bus ACE Memory Bus ACE Memory Bus ACE Memory Bus
7 © ARM2016 Bifrost features
. Leverages Mali’s scalable architecture . Scalable to 32 shader cores
20% Scalable to 32 Higher energy Shader cores . Major shader core redesign efficiency* . New quad-based arithmetic units . New scalar, clause-based ISA
. New geometry data flow
. Reduces memory bandwidth and footprint 40% 20% Better Bandwidth performance Improvement* density* . Support for fine grain buffer sharing with the
CPU *Compared to Mali-T880, on same
8 © ARM2016 process node under the same conditions.
Shader core improvements
Driver Software
Shader Shader Shader Shader Job Manager Core 0 Core 1 Core 2 Core 31
GPU Fabric
L2 Cache L2 Cache L2 Cache L2 Cache Tiler MMU Segment Segment Segment Segment
ACE Memory Bus ACE Memory Bus ACE Memory Bus ACE Memory Bus
9 © ARM2016 Mali-G71 shader core design
Compute Frontend Fragment Frontend
Quad Creator Quad Creator Execution Execution Execution
Engine 0 Engine 1 Engine 2
Quad Quad Manager Quad State Quad State Quad State Control
Shader Core Fabric ZS Memory
Load/store Attribute Varying Texture Blender & Depth & Unit Unit Unit Unit Tile Access Stencil
Tile Memory To L2 Mem Sys To L2 Mem Sys Tile Writeback
To L2 Mem Sys
10 © ARM2016 Quad execution
Compute Frontend Fragment Frontend
Quad Creator Quad Creator Execution Execution Execution
Engine 0 Engine 1 Engine 2
Quad Manager Quad State Quad State Quad State
Shader Core Fabric ZS Memory
Load/store Attribute Varying Texture Blender & Depth & Unit Unit Unit Unit Tile Access Stencil
Tile Memory To L2 Mem Sys To L2 Mem Sys Tile Writeback
To L2 Mem Sys
11 © ARM2016
Recap: SIMD vectorization
Lane0 Lane1
Lane 2 Lane Lane 3 Lane . Midgard GPUs use SIMD vectorization . One thread at a time executes in each pipeline stage T0.x T0.y T0.z Idle Cycle 1 . Each thread must fill the width of the hardware T1.x T1.y T1.z Idle Cycle 2
T2.x T2.y T2.z Idle Cycle 3 . Sensitive to shader code T3.x T3.y T3.z Idle Cycle 4 . Code always evolving . Compiler vectorization is not perfect . Have to detect combinations of operations which can be merged to fill idle lanes . Scalar operations can not always be merged into vectors
12 © ARM2016
Quad vectorization
Lane0 Lane1
Lane 2 Lane Lane 3 Lane . Bifrost uses quad-parallel execution . Four scalar threads executed in lockstep in a “quad” T0.x T1.x T2.x T3.x Cycle 1 . One quad at a time executes in each pipeline stage . Each thread fills one 32-bit lane of the hardware T0.y T1.y T2.y T3.y Cycle 2 . 4 threads doing a vec3 FP32 add takes 3 cycles . Improves utilization T0.z T1.x T2.z T3.z Cycle 3
. Quad vectorization is compiler friendly . Each thread only sees a stream of scalar operations . Vector operations can always be split into scalars
13 © ARM2016 Bifrost execution engine
. Executes quad-parallel scalar operations . 4x32-bit multiplier FMA Main Regs Read . 4x32-bit adder ADD . Adder includes special function unit
. Smaller and more area efficient FMA
. Simplified layout eases compilation . Better scheduling in today’s code
. Better utilization ADD/SF Registers Temp
. One instruction word contains two instructions
Main Regs Write
14 © ARM2016 Bifrost execution engine: Special arithmetic ops
. Special function hardware is smaller than Main Regs Read Midgard VLUT equivalent . Many transcendental functions supported
. Special functions provide building blocks for compiled
shader code FMA . Part of the built-in function libraries
ADD/SF Registers Temp
Main Regs Write
15 © ARM2016 Bifrost execution engine: Functional units
. Retains support for smaller width data types . 2x performance for FP16 useful for pixel shaders Main Regs Read
int8 int8 int8 int8 8-bit integers FMA int16 int16 16-bit integers int32 32-bit integers
float16 float16 16-bit floating point ADD/SF Registers Temp float32 32-bit floating point
Main Regs Write
16 © ARM2016 Bifrost features
. Leverages Mali’s scalable architecture . Scalable to 32 shader cores
20% Scalable to 32 Higher energy Shader cores . Major shader core redesign efficiency* . New quad-based arithmetic units . New scalar, clause-based ISA
. New geometry data flow
. Reduces memory bandwidth and footprint 40% 20% Better Bandwidth performance Improvement* density* . Support for fine grain buffer sharing with the
CPU *Compared to Mali-T880, on same
17 © ARM2016 process node under the same conditions.
Clause execution
. A group of instructions which executes atomically
. Architectural state visible after clause completion
. Bypass path registers exposed to the compiler
. Non-deterministic instructions on clause boundaries
18 © ARM2016 Bifrost shader clause example
Linear Source
LOAD.32 r0, [r10] FADD.32 r1,r0,r0 FADD.32 r2,r1,r1 FADD.32 r3,r2,r2 FADD.32 r4,r3,r3 FADD.32 r3,r3,r4 FADD.32 r0,r3,r3 STORE.32 r0, [r10]
19 © ARM2016 Bifrost shader clause example
Linear Source Basic Clause Compile Load is variable LOAD.32 r0, [r10] length, so clause { FADD.32 r1,r0,r0 must be split LOAD.32 r0, [r10] FADD.32 r2,r1,r1 from use of r0 } FADD.32 r3,r2,r2 FADD.32 r4,r3,r3 wait(load) { FADD.32 r3,r3,r4 Next clause uses FADD.32 r1,r0,r0 FADD.32 r0,r3,r3 r0, so must wait FADD.32 r2,r1,r1 STORE.32 r0, [r10] for load to FADD.32 r3,r2,r2 complete FADD.32 r4,r3,r3 FADD.32 r3,r3,r4 FADD.32 r0,r3,r3 STORE.32 r0, [r10] }
20 © ARM2016 Bifrost shader clause example
Linear Source Basic Clause Compile Optimized Clause Compile Load is variable LOAD.32 r0, [r10] length, so clause { { FADD.32 r1,r0,r0 must be split LOAD.32 r0, [r10] LOAD.32 r0, [r10] FADD.32 r2,r1,r1 from use of r0 } } FADD.32 r3,r2,r2 FADD.32 r4,r3,r3 wait(load) { wait(load) { FADD.32 r3,r3,r4 Next clause uses FADD.32 r1,r0,r0 FADD.32 t0,r0,r0 FADD.32 r0,r3,r3 r0, so must wait FADD.32 r2,r1,r1 Use temporaries: FADD.32 t1,t0,t0 STORE.32 r0, [r10] for load to FADD.32 r3,r2,r2 only 2 register FADD.32 t0,t1,t1 complete FADD.32 r4,r3,r3 file accesses in 6 FADD.32 t1,t0,t0 FADD.32 r3,r3,r4 cycles FADD.32 t0,t1,t1 FADD.32 r0,r3,r3 FADD.32 r0,t0,t0 STORE.32 r0, [r10] } } Store may also stall, so split to { help scheduling STORE.32 r0, [r10] }
21 © ARM2016 Bifrost features
. Leverages Mali’s scalable architecture . Scalable to 32 shader cores
20% Scalable to 32 Higher energy Shader cores . Major shader core redesign efficiency* . New scalar, clause-based ISA . New quad-based arithmetic units
. New geometry data flow
. Reduces memory bandwidth and footprint 40% 20% Better Bandwidth performance Improvement* density* . Support for fine grain buffer sharing with the
CPU *Compared to Mali-T880, on same
22 © ARM2016 process node under the same conditions.
Tiler changes
. Bifrost uses the same underlying hierarchical binning design as Midgard
. Significantly redesigned tiler memory structures . Minimum buffer allocations eliminated . Buffer allocation granularity now finer . Micro-triangle elimination reduces the number of primitives stored in bin buffers for geometry- dense scenes
. Cumulative effect of all changes is up to 95% reduction in tiler memory footprint
23 © ARM2016 Index-driven position shading
½ x Read/write bandwidth Tiler Position Tiler Varying Fragment [x times of storage size] Assembly Shading Culling Shading Shading
1x 1x ½ x ½ x ½ x
Processing
Memory 3.5x 2.0x 2.5x 1.5x PositionsPositions AttributesAttribs Trans- Polygon Vertex Vertex Indices Positions formed Midgard List Attributes Varyings Positions Bifrost
Bandwidth used relative to 1x memory storage size
24 © ARM2016 Index-driven position shading • Leverages existing coherency flows
• Multiple shader cores write Driver Up to 32 shader cores supported Software transformed positions into a shared memory fifo. Shader Shader Shader Shader Job Manager Core 0 Core 1 Core 2 Core 31 • The fixed function Tiler reads the transformed positions directly GPU Fabric via shared memory reads. No manual flushing required (fifo L2 Cache L2 Cache L2 Cache L2 Cache values are most likely resident in Tiler MMU Segment Segment Segment Segment the L2C, but don’t have to be)
ACE Memory Bus ACE Memory Bus ACE Memory Bus ACE Memory Bus • Once the tiler has read the Bandwidth used 1x relative to memory storage size positions, they are no longer GPU coherency domain needed and may be discarded
25 © ARM2016 Bifrost features
. Leverages Mali’s scalable architecture . Scalable to 32 shader cores
20% Scalable to 32 Higher energy Shader cores . Major shader core redesign efficiency* . New scalar, clause-based ISA . New quad-based arithmetic units
. New geometry data flow
. Reduces memory bandwidth and footprint 40% 20% Better Bandwidth performance Improvement* density* . Support for fine grain buffer sharing with the
CPU *Compared to Mali-T880, on same
26 © ARM2016 process node under the same conditions.
Memory system
Driver Software
Shader Shader Shader Shader Job Manager Core 0 Core 1 Core 2 Core 31
Control Fabric
L2 Cache L2 Cache L2 Cache L2 Cache Tiler MMU Segment Segment Segment Segment
ACE Memory Bus ACE Memory Bus ACE Memory Bus ACE Memory Bus Full coherency using AMBA ACE protocol
27 © ARM2016 Memory system
. Full system coherency support Cortex-A73 Mali-G71 . Supports tightly coupled CPU+GPU use cases CPU GPU
. L2 cache improvements . Single logical L2 cache makes software easier CoreLink CCI-550 . Fewer partial lines written to memory which improves LPDDR4 performance DMC-500 . Supports TrustZone DRAM
28 © ARM2016 Next-generation heterogeneous computing
Workload balancing – . OpenCL 2.0 Introduces Shared Virtual memory overhead reduction 1 Memory
0.8
. Mali-G71 goes one step further with fine -81% -90%
grained buffers 0.6 Initialization and processing
. Significantly eases development and 0.4 Memory
Normalized Normalized runtime operations enables truly heterogeneous use case
0.2
0 Coarse grained Coarse grained Fine grained with with SW with Full Full Coherency Coherency Coherency
29 © ARM2016 Summary
. Leverages Mali’s scalable architecture . Scalable to 32 shader cores
20% Scalable to 32 Higher energy Shader cores . Major shader core redesign efficiency* . New scalar, clause-based ISA . New quad-based arithmetic units
. New geometry data flow
. Reduces memory bandwidth and footprint 40% 20% Better Bandwidth performance Improvement* density* . Support for fine grain buffer sharing with the
CPU *Compared to Mali-T880, on same
30 © ARM2016 process node under the same conditions.
Thank you!
The trademarks featured in this presentation are registered and/or unregistered trademarks of ARM Limited (or its subsidiaries) in the EU and/or elsewhere. All rights reserved. All other marks featured may be trademarks of their respective owners. Copyright © 2016 ARM Limited