Hybrid Threading Processor (HTP)

Hybrid Threaded Processing for Sparse Data Kernels Tony Brewer Chief Architect, Advanced Computing Solutions May 9, 2018 Distribution Statement “A”: Approved for Public Release, Distribution Unlimited “This research was, in part, funded by the U.S. Government. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government. ©2016©2015 Micron Technology, Inc. All rights reserved. Information, products, and/or specifications are subject to change without notice. All information is provided on an “AS IS” basis without warranties of any kind. Statements regarding products, including regarding their features, availability, functionality, or compatibility, Title Slide are provided for informational purposes only and do not modify the warranty, if any, applicable to any product. Drawings may not be to scale. Micron, the Micron logo, and all other Micron trademarks are the Primary design property of Micron Technology, Inc. All other trademarks are the property of their respective owners. for the first slide in the deck. The Challenge . Sparse data sets that greatly exceed a processor’s cache size are a challenge for most systems – Processor’s are typically optimized for high cache hit rate (>90%) . Low cache hit rate results in idle cores – Memory accesses are cache line size (64B) . Sparse data sets result in memory accesses where the majority of accessed data is not used Title and Content The primary layout used for standard slides. The placeholder can be 2 © 2016 Micron Technology, Inc. | May 9, 2018 used to create text, tables, or charts. Hybrid Threading Processor (HTP) . RISC-V ISA (RV64G) . Efficient memory usage – Extensions for thread and message – Memory access size 8, 16, 32 or 64B management . Software managed coherency . High thread count barrel processor – Small cache per thread – Similar to Cray’s MTA architecture – Atomics performed at memory – One instruction per thread per scheduling interval (avoids register . User space only hazard checking) – Host processor provides system support . Event driven processor – Pause for memory response . Standard GCC compiler – Pause for thread join – Runtime provides access to new – Pause for message reception instructions Agenda Two-column layout, 3 © 2016 Micron Technology, Inc. | May 9, 2018 to be used with any number of items. New RISC-V Instructions . Thread Management – Thread Create, Return, Join . Message Management Instructions – Message Send, Broadcast, Receive, Listen . Non-Cached Loads and Stores – Integer and Float Title and Content The primary layout used for standard slides. The placeholder can be 4 © 2016 Micron Technology, Inc. | May 9, 2018 used to create text, tables, or charts. System Architecture Storage Class DRAM SCM DRAM SCM Memory DRAM SCM DRAM SCM Ctrl Ctrl Ctrl Ctrl SRAM Cache SRAM Cache Atomic Atomic Operations Operations Network on Chip (NOC) Hybrid Hybrid Threading Threading Processor (HTP) Processor (HTP) Title and Content The primary layout used for standard slides. The placeholder can be 5 © 2016 Micron Technology, Inc. | May 9, 2018 used to create text, tables, or charts. System Architecture DRAM SCM DRAM SCM Memory Controller DRAM SCM DRAM SCM (MC) Chiplet Ctrl Ctrl Ctrl Ctrl SRAM Cache SRAM Cache Atomic Atomic Operations Operations Compute Network on Chip (NOC) Near Memory (CNM) Chiplet Hybrid Hybrid Threading Threading Processor (HTP) Processor (HTP) Title and Content The primary layout used for standard slides. The placeholder can be 6 © 2016 Micron Technology, Inc. | May 9, 2018 used to create text, tables, or charts. Modeled Configurations 1x1 CNM Configuration GDDR6 GDDR6 Memory Interposer Controller Chiplet C C C C M M M . Focused on two primary configurations: M Compute CPI CPI Near Memory – 1x1 CNM Config. CPI CPI CPI 2x2 CNM Configuration Chiplet C N I P O P C I . 1 CNM Chiplet C HTP HTP GDDR6 GDDR6 GDDR6 GDDR6 . 2 MC Chiplets DRAM DRAM DRAM DRAM HTP HTP C I G D P P 6 D NOC NOC NOC R I M R C D A A D R M 4 GDDR6 Memories R . D 6 D G HTP HTP C C C C C C C C G M M M M M M M M D 6 M C D R N N O O MC M C R HTP C HTP HTP C HTP I D C A M A C D P HTP HTP HTP HTP MC O P R M R NOC NOC NOC NOC NOC NOC HTP HTP D I N M C 6 HTP HTP HTP HTP MC C D G C C M C O HTP O HTP HTP HTP N N MC M – 2x2 CNM Config. C N N O O MC G C C HTP HTP HTP HTP D 6 M C D R CPI CPI CPI HTP HTP HTP HTP MC M R NOC NOC NOC NOC NOC NOC D A M C A D HTP HTP HTP HTP MC R M R C C M C O O D HTP HTP HTP HTP N N 6 CPI CPI MC D G M M M M . 4 CNM Chiplet M M M M C C C C C C C C M M M M G D 6 D R M R C C C C D A A D R M . 8 MC Chiplets R D 6 D G M A R D M A R D M A R M A R D . 16 GDDR6 Memories D 6 R D D G 6 R D D G 6 R D D 6 R D D G Title and Content G The primary layout GDDR6 GDDR6 used for standard slides. The placeholder can be 7 © 2016 Micron Technology, Inc. | May 9, 2018 used to create text, tables, or charts. Performance and Power . Simulator models functionality and performance – Clocked simulation model – Models functionality – Models data paths, arbitration and queueing . Power Estimation Methodology – Break design into major components: NOC, HTP, MC, GDDR6 – Identify all ram structures, ALUs, I/O, long signal runs (NOC), etc. – Determine power through foundry power estimation tools – Determine application activity factors through simulation Title and Content The primary layout used for standard – Determine power for each chiplet and total solution power slides. The placeholder can be 8 © 2016 Micron Technology, Inc. | May 9, 2018 used to create text, tables, or charts. Graph Spectral Clustering GRAPH COMMUNITY DETECTION USING SPECTRAL METHODS . Community detection using spectral . Profile on an X86 system methods uses linear algebraic to Overhead Symbol compute eigenvalues for the adjacency 13.82% [.] svd_ATxb matrix associated with a graph. The 13.36% [.] svd_ATxb2 lowest eigenvalues can be used to 10.70% [.] svd_Axb partition the graph. 10.01% [.] substruct 9.93% [.] svd_Axb2 . Sparse data structures store the graph 6.59% [.] _IO_vfscanf vertices, edges and properties. Title/Subtitle and Content Identical to main layout but includes the addition of a 9 © 2016 Micron Technology, Inc. | May 9, 2018 subtitle directly below the title. Memory Access Size Graph Spectral Clustering 25 SENSITIVITY ANALYSIS 8B 20 16B . Sensitivity analysis to determine 15 32B optimal configuration 10 64B 5 Total Thread Count 0 25 per Second EdgesMillion 0 50 100 150 200 20 Edges per Vertex 15 1024 10 512 5 256 . Other parameters 128 0 – Clock rate Title/Subtitle per Second EdgesMillion 0 50 100 150 200 – Cores per HTP processor and Content Identical to main Edges per Vertex layout but includes the addition of a 10 © 2016 Micron Technology, Inc. | May 9, 2018 subtitle directly below the title. Graph Spectral Clustering THREAD STATE MONITORING 1x1 1.2Ghz 70ms 900 . Provides insights into 800 700 run time dynamics of 600 500 Idle application RdyToRun 400 PausedMem 300 PausedEvent Thread Count Thread 200 100 0 Title/Subtitle 0.00E+00 2.00E+07 4.00E+07 6.00E+07 and Content Identical to main Simulation Time(ns) layout but includes the addition of a 11 © 2016 Micron Technology, Inc. | May 9, 2018 subtitle directly below the title. Graph Spectral Clustering COMPARISON TO REFERENCE PLATFORMS Haswell Nvidia K80 Nvidia DGX-1 NOC 1x1 Config NOC 2x2 Config 8-threads (Host + 1 GPU) (Host + P100) 1.2Ghz 1.2Ghz 1-socket 1 P100 time Simulated Simulated 13.5 Sec 5.0 Sec1 3.95 Sec1 2.90 Sec 0.814 Sec 140 Watts 340 Watts (Note – System 23.8 Watt 90.9 Watt 1890 1703 Joules was at a cloud 69 Joules 74 Joules Joules 24.7x provider, no 1.0x 1.07x 27.4x power info) GDDR6 GDDR6 GDDR6 GDDR6 DRAM DRAM DRAM DRAM P P P P P P P P C C C C C C C C T T T T T T T T M M M M M M M M H H H H H H H H CIPI CIPI CIPI CIPI GDDR6 GDDR6 CIPI CIPI CIPI CIPI CIPI CIPI DRAM DRAM M C G P T H NoC NoC MC D 6 C C C D R I I I M R I Edge Edge I I D A P T H P P P P P P HTP A D I I I R M I I I P P P P R C C C C D C C HTF HTF C HTF HTF 6 M D C T T T T HTP G M M M M H H H H Cluster Cluster Cluster Cluster MC CIPI CIPI C C I I I NoC NoC NoC NoC NoC NoC I P P P P I I I Edge Hub Edge Edge Hub Edge I C CIPI CIPI CIPI C NoC M C C I Edge I P P G I P T H MC D I HTF HTF HTF HTF 6 C C C D C HTF HTF R I I I M R I I I D A P T H P P P P P P HTP A D I I I R Cluster Cluster Cluster Cluster M I I I R D C C C Cluster Cluster NoC NoC 6 M D C HTP G Edge Edge MC C CIPI CIPI CIPI CIPI CIPI CIPI I Noc Noc NoC I P P I Edge Hub Edge I C CIPI CIPI CIPI CIPI CIPI CIPI M C G P T H NoC NoC MC D 6 C C C D R I I I M R I Edge Edge I I D A P T H P P P P P P HTP A D I I HTF HTF I R M C I I I R D C C C I HTF HTF HTF HTF 6 M I C D HTP G P P I Cluster Cluster I C NoC Cluster Cluster Cluster Cluster MC Edge C C I I I NoC NoC NoC NoC NoC NoC I P P P CIPI CIPI CIPI P I I I Edge Hub Edge Edge Hub Edge I C C CIPI CIPI CIPI M H C P M M M M H C o P P P P G P T H D I HTF HTF HTF HTF MC I C C C C 6 T T T T F s C C C D R e I I I M R t I I I D A P T H H H H H P P P P P P HTP A D I I I R Cluster Cluster Cluster Cluster M I I I R D C C C NoC NoC 6 M D C HTP G Edge Edge MC GDDR6 GDDR6 DRAM DRAM CIPI CIPI CIPI CIPI CIPI CIPI CIPI CIPI CIPI CIPI CIPI H P H H H H H H H H M M M M M M M M H C o T T T T T T T T I I C C C C C C C C F s P P P P P P P P e t GDDR6 GDDR6 GDDR6 GDDR6 Notes: DRAM DRAM DRAM DRAM Title/Subtitle and Content Identical to main 1.

Hybrid Threading Processor (HTP)

Antikernel: a Decentralized Secure Hardware-Software Operating

Mali-400 MP: a Scalable GPU for Mobile Devices

Nyami: a Synthesizable GPU Architectural Model for General-Purpose and Graphics-Speciﬁc Workloads

Dynamic Task Scheduling and Binding for Many-Core Systems Through Stream Rewriting

An Overview of MIPS Multi-Threading White Paper

C for a Tiny System Implementing C for a Tiny System and Making the Architecture More Suitable for C

Antikernel: a Decentralized Secure Hardware-Software Operating System Architecture

On the Exploration of the DRISC Architecture

Programming the Cray XMT at PNNL

Abstract PRET Machines

Concurrent Programming CLASS NOTES

Understanding SPARC Processor Performance