Title 44pt sentence case ARM Mali GPU Midgard Architecture
Mathias Palmqvist Affiliations 24pt sentence case Senior Software Engineer/ARM MPG
20pt sentence case Lund University, Faculty of Engineering LTH December 6th 2016
© ARM 2016 Title 40pt sentence case Agenda
Bullets 24pt sentence case . Midgard Architecture Overview . Power Consumption & Bandwidth Reduction Sub-bullets 20pt sentence case . Device Driver Work building . Power Consumption . Forward Pixel Kill . GPU Software Interface . Transaction Elimination . GPU Hardware Jobs . ARM Framebuffer Compression . Job Manager . Pixel local storage . Unified Shader Core . Shader Core: Vertex Work . ARM Lund . Tiling Unit . ARM Lund GPU teams . Shader Core: Fragment Work . GPU HW/SW development . Shader Core: Tripipe . Party! . Student Oppurtunities
2 © ARM 2016
Title 40pt sentence case Mathias Palmqvist
Bullets 24pt sentence case Work Interests Sub-bullets 20pt sentence case . Senior Software Engineer, ARM ~5 years • Game Development(often non-graphics related) GPU driver, EGL on Android/X11, Multimedia. • Real-time techniques with mobile focus
. Previously at Sony Ericsson 7 years Personal Android Graphics Integration. • Married to California girl from UCSD since 13 Education years • Two children, 5 and 2. Aiden and Avery. . LTH, M.S.EE (e98). Fourth year at U.C. San Diego.
Hardware and graphics main focus.
3 © ARM 2016
Text 54pt sentence case Midgard GPU Architecture
4 © ARM 2016 Title 40pt sentence case Acrynoms
Bullets 24pt sentence case AMBA Advanced Microcontroller Bus Architecture Sub-bullets 20pt sentence case AXI AMBA Advanced eXtensible Interface APB AMBA Advanced Peripherial Bus ACE AMBA AXI Coherency Extensions GPU Graphics Processing Unit VPU Video Processing Unit DPU Display Processing Unit ISA Instruction Set Architecture SIMD Single Instruction Multiple Data
5 © ARM 2016 Title 40pt sentence case Midgard Architecture Overview
Bullets 24pt sentence case . Tile-based, Unified Shading Architecture Sub-bullets 20pt sentence case . Latest Midgard GPU is T880 . Maximum of 16 shader cores . Tile size 16x16 (4x4-32x32 internally) . GLES 3.2, Vulkan 1.0, CL 1.2, DX FL11_2 . MSAA 4x, 8x,16x . T880 in Samsung GS7 and Huawei P9
. First product T604 in 2011q4 . SoC: Samsung Exynos 5250 . Products: Nexus10, Samsung Chromebook
6 © ARM 2016 Title 40pt sentence case GPU top level
Bullets 24pt sentence case Interconnect Sub-bullets 20pt sentence case Shader Shader Shader core Job corecore MMU Tiling Unit Manager
Interconnect
L2 Cache #0 L2 Cache #1
Driver External Shared Memory
7 © ARM 2016 Title 40pt sentence case Simplified Application-Driver Interaction
Bullets 24pt sentence case Sub-bullets 20pt sentence case App Frame 0 Frame 1 Frame 2 Frame 3
Flushed Flushed Flushed Flushed
Driver F0 F1 F2 F3
Scheduled Scheduled Scheduled
GPU F0 F1 F2
Post Post Post
DPU F0 F1
8 © ARM 2016 Title 40pt sentence case Driver building GPU resource structures
App Driver Tex Structure Shader1 Structure Bullets 24pt sentence case glTexImageX() Build Texture structure Format, Dimensions ProgramPointer Sub-bullets 20pt sentence case Data Pointer
Buffer Structure glShaderX() Build Shader structure Format, Dimensions Data Pointer VertexJob1 Structure VertexJob2 glBufferX() Build Vertex/Index BufferStructure Pointer Index Pointer buffer structures Shader2 Structure Buffer Pointer Shader Pointer ProgramPointer Index Pointer glDraw() Build Vertex Job Structure Shader Pointer glShaderX() Link with vertex/index/shader glDraw() structures FragmentJob ... Structure eglSwapBuffers() Build Fragment Job Structure FrameBuffer Pointer Link with FB and last Vertex Job Last Job Pointer
9 © ARM 2016 Title 40pt sentence case GPU hardware job types
Vertex Job(V) Vertex shader running on a set of vertices Bullets 24pt sentence case Sub-bullets 20pt sentence case Tiler Job(T) Tiling Unit (Fixed Function) work to split transformed primitives into affected tiles
Fragment Job(F) Single render target job running over all tiles
Job Chain A chain of jobs
Job Chain 1 Job Chain 2
V V V V
T T T T F
10 © ARM 2016 Title 40pt sentence case GPU job progression
Bullets 24pt sentence case Sub-bullets 20pt sentence case
11 © ARM 2016 Title 40pt sentence case GPU Software Interface
CPU GPU Bullets 24pt sentence case Job Sub-bullets 20pt sentence case Kernel User A Manager MMU P Power B Interrup Submit tJobs chains
AXI
External Shared Memory
AXI VertexJob1 Structure TilingJob1 Structure FragmentJob Structure VertexJob2 Structure TilingJob2 Structure
12 © ARM 2016 Title 40pt sentence case Job Manager
VertexJob (96 vertices) TilingJob FragmentJob (Full Frame) Bullets 24pt sentence case Sub-bullets 20pt sentence case Job Manager
VTask1(32) VTask2(32) VTaskN(32) TTask1 FTask1(Tile1) FTask2(Tile2)
VTaskN TTaskN FTask3
VTask2 TTask2 FTask2 Shader Core VTask1 TTask1 FTask1 Shader Core Shader Core Shader Core Shader Core Tiling Unit Shader Core
External Shared Memory
13 © ARM 2016 Title 40pt sentence case Unified Shader Core
Vertex Frontend Fragment Frontend External Shared Memory Bullets 24pt sentence case Sub-bullets 20pt sentence case Tiler Data Tripipe
Attributes/Uniforms in Varyings out
Textures
Framebuffer
Tile Tile Buffer TileBuffer Fragment Backend Buffer
14 © ARM 2016 Title 40pt sentence case Shader Core: Vertex Work
Vertex Thread Creator . Vertex Threads do not write to tile buffers but Bullets 24pt sentence case directly to main memory Sub-bullets 20pt sentence case Thread 1 Thread 2 . Vertex Tasks contain multiple of 4 vertices
Thread 3 Thread 4
External Shared Memory
Attributes/Uniforms in Varyings out
15 © ARM 2016 Title 40pt sentence case Tiling Unit Tiler’s goal Hierarchy Level 0 - 16x16 tile bins
Bullets 24pt sentence case Hierarchy Level 1 - 32x32 tile bins Find out what tiles are covered by a Sub-bullets 20pt sentence case Hierarchy Level 2 - 64x64 tile bins primtive. Update tile structure with that info. 1 2 3 4 5
Assumptions 6 7 8 9 10 Small primitive -> Affect few tiles -> Use low hierarchy
11 12 13 14 15 level -> Save read bandwidth
Large primitive -> Affect many tiles -> Use high 16 17 18 19 20 hierarchy level -> Save write bandwidth
21 22 23 24 25 Heuristic approach
Determine what is best given distribution
16 © ARM 2016 Title 40pt sentence case Shader Core: Fragment Work
Fragment Front-End Fragment Back-End Bullets 24pt sentence case Polygon List Reader Sub-bullets 20pt sentence case
Depth-bounds & Hierarchical ZS Test
Tile Writeback Rasterizer
Tile Memory Early ZS Testing
FPK Queue (128 Quads) Blending
Fragment Thread Creator Tripipe Late ZS Testing
17 © ARM 2016 Title 40pt sentence case Shader Core: Tripipe details
Bullets 24pt sentence case . SIMD processing core that executes Sub-bullets 20pt sentence case instructions from the Midgard ISA
. 3 types of pipes, but 5 total . 3 Arithmetic . 1 Load/Store . 1 Texture
. 128-bit operands, float or integer . 2xFP64, 4xFP32, 8xFP16
. Up to 256 active threads at the same time
18 © ARM 2016 Power Consumption Text 54pt sentence case and Bandwidth Reduction
19 © ARM 2016 Title 40pt sentence case Power Consumption Approximate GPU power Bullets 24pt sentence case . Power comparision consumption distribution Sub-bullets 20pt sentence case . Phone SoC 1-3W TexturePipe . Tablet ~10W ArithmeticPipes . Desktop Gfx card ~100-200W Other
. Constraints . Battery life time & heat dissapation . Static and Dynamic power consumption . Static power. Related to technology and area. Improved through power management(gating) . Dynamic power. Heavily tied to use case. Responsability of designer.
. Performance Density FPS/mm^2. Smaller Area and less dynamic power -> possibly more cores
20 © ARM 2016 Title 40pt sentence case Reduce Overdraw: Forward Pixel Kill
Bullets 24pt sentence case Sub-bullets 20pt sentence case Insert small FIFO for 2x2 quads after EarlyZ but before entering tripipe.
Reject quads ”in front” before they reach expensive tripipe execution
21 © ARM 2016 Title 40pt sentence case Transaction Elimination
. Signature/Hash calculated of color tilebuffer before write-out and saved. Bullets 24pt sentence case Sub-bullets 20pt sentence case . If it matches previous signature, write-out is skipped saving bandwidth.
22 © ARM 2016 Title 40pt sentence case ARM Framebuffer Compression/AFBC
. Real-time, lossless, small-area framebuffer compression technology Bullets 24pt sentence case Sub-bullets 20pt sentence case . Used on both external(window) and internal(FBO) buffers for GPU . Formats: RGB/YUV (including 10-bit), depth/stencil
23 © ARM 2016 Title 40pt sentence case EXT_shader_pixel_local_storage
Bullets 24pt sentence case . Bandwidth Efficient Deferred Shading Sub-bullets 20pt sentence case . Define G-Buffer in Pixel Local Storage space(in tile memory)
G-Buffer initialization Lighting accumulation Final shading
24 © ARM 2016 Text 54pt sentence case ARM Lund
25 © ARM 2016 Title 40pt sentence case ARM Lund GPU teams
Bullets 24pt sentence case . Hardware design/verification 15 (Fragment front/back-end) Sub-bullets 20pt sentence case . GPU Modelling 17 . Graphics/Backend Compiler team 16 . HW/SW regression testing 10 . Multimedia(GPU+VPU+DPU) development+regression 4
ARM Lund total ~100. Remaining are Video IP development.
Cambridge/Trondheim major GPU development sites
26 © ARM 2016 Title 40pt sentence case GPU HW Development
Bullets 24pt sentence case . HDL: Verilog + System Verilog Sub-bullets 20pt sentence case . git/gerrit . Cluster based RTL simulation framework . HW Engineers usually with low-spec desktop/laptop. . Everything runs on massive CPU clusters in Cambridge
27 © ARM 2016 Title 40pt sentence case GPU SW Development
Bullets 24pt sentence case . Sw tools: git/gerrit/svn, scons(DDK configuration), codecollaborator, JIRA Sub-bullets 20pt sentence case . GPU DDK release cycle: scrum-based. Release every iteration. . Iteration: ~2 months . Sprints: 2 weeks. 4 sprints/iteration . Regression Server Test Framework: Internal . Handles FPGA board flashing, driver building, board allocation and test execution, automatic bug filing with git bisect . Hardware platforms . ARM Juno/VersatileExpress + single/multiple FPGA tiles . Silicon development platforms with Mali . Mali GPU Model (bit accurate, close to cycle accurate)
28 © ARM 2016 Title 40pt sentence case FPGA Lab
Bullets 24pt sentence case Sub-bullets 20pt sentence case
29 © ARM 2016 Title 40pt sentence case ARM Lund Open-House Event TONIGHT(6/12)
Bullets 24pt sentence case . Where Sub-bullets 20pt sentence case . Tuesday December 6:th @17.15 Emdalavägen 6, 4:th floor
. What . Food/Drinks/Snacks . Lund Team ”stations”. Talk to a hw designer or compiler engineer . Graphics & Video HW/SW presentations . Demos . Mingle
30 © ARM 2016 Title 40pt sentence case ARM Lund Student opportunities
Bullets 24pt sentence case . Graduate job offerings – apply through www.arm.com/careers Sub-bullets 20pt sentence case . Internship (part/full time) – apply through www.arm.com/careers . Master thesis work – see adds on hand outs, apply to [email protected]
. If you have any questions or specific proposals on how to engage with ARM, just drop an email to [email protected]
31 © ARM 2016 Q & A
The trademarks featured in this presentation are registered and/or unregistered trademarks of ARM Limited (or its subsidiaries) in the EU and/or elsewhere. All rights reserved. All other marks featured may be trademarks of their respective owners. Copyright © 2016 ARM Limited
© ARM 2016 Title 40pt sentence case Backup slides
Bullets 24pt sentence case Sub-bullets 20pt sentence case
33 © ARM 2016 Title 40pt sentence case Midgard Shader Core: Fragment Frontend
. Polygon List Reader / Triangle Setup Bullets 24pt sentence case Polygon List Reader Sub-bullets 20pt sentence case . Loads primitives from the tiler polygon list; ~13 cycles a triangle
. Depth-bounds and Hierarchical-Z S Testing Depth-bounds & . Conservative fast cull of primitives based on line equations Hierarchical ZS Test
. Rasterizer . Generate 2x2 fragment quads with per per-fragment coverage mask Rasterizer
. Early ZS Testing . Cull quads based on ZS values, if possible Early ZS Testing . Those which pass ZS may cull old quads in FPK Queue
. FPK Queue FPK Queue (128 Quads) . Buffer with 128 quads; Freya supports priority quad issue into FTC
. Fragment Thread Creator Fragment Thread Creator . Spawn a quad as four fragment threads over four cycles
34 © ARM 2016 Title 40pt sentence case Midgard Shader Core: Fragment Backend
Bullets 24pt sentence case . Late ZS Testing Late ZS Testing Sub-bullets 20pt sentence case . Resolve any outstanding ZS tests we couldn’t do early
. Blending Blending . Blend up to 4 samples per clock for subset of blend equations . 4x MSAA needs 1 cycle, 8x needs 2 cycles, and 16x needs 4 cycles . Complex blend operations implemented using blend shaders
. Tile Memory Tile Memory . Storage for color and depth+stencil data . Mali-T760 onwards provide flexible allocation for MRT and/or MSAA
. Tile Writeback Tile Writeback . Requires one cycle per output pixel . Provides the CRC generation for transaction elimination . AFBC compression and YUV writeback supported (Mali-T760 onwards)
35 © ARM 2016 Title 40pt sentence case Smart Composition
Bullets 24pt sentence case . Selectively update parts of a frame Sub-bullets 20pt sentence case . Khronos EGL extensions . KHR_partial_update . EXT_buffer_age
. Producer can utilize age of existing backbuffer contents to draw frame with multiple bounding boxes
. Similar to TE, but here we wont even write to the tilebuffer
36 © ARM 2016