Introduction to CUDA Programming

Προηγμένη Αρχιτεκτονική Υπολογιστών Non-Uniform Cache Architectures Νεκτάριος Κοζύρης & Διονύσης Πνευματικάτος {nkoziris,pnevmati}@cslab.ece.ntua.gr Διαφάνειες από τον Ανδρέα Μόσχοβο, University of Toronto 8ο εξάμηνο ΣΗΜΜΥ ⎯ Ακαδημαϊκό Έτος: 2019-20 http://www.cslab.ece.ntua.gr/courses/advcomparch/ Modern Processors Have Lots of Cores and Large Caches • Sun Niagara T1 From http://jinsatoh.jp/ennui/archives/2006/03/opensparc.html Modern Processors Have Lots of Cores and Large Caches • Intel i7 (Nehalem) From http://www.legitreviews.com/article/824/1/ Modern Processors Have Lots of Cores and Large Caches • AMD Shanghai From http://www.chiparchitect.com Modern Processors Have Lots of Cores and Large Caches • IBM Power 5 From http://www.theinquirer.net/inquirer/news/1018130/ibms-power5-the-multi-chipped-monster-mcm-revealed Why? • Helps with Performance and Energy • Find graph with perfect vs. realistic memory system What Cache Design Used to be About Core L1I L1D 1-3 cycles / Latency Limited L2 10-16 cycles / Capacity Limited Main Memory > 200 cycles • L2: Worst Latency == Best Latency • Key Decision: What to keep in each cache level What Has Changed ISSCC 2003 What Has Changed • Where something is matters • More time for longer distances NUCA: Non-Uniform Cache Architecture Core L1I L1D • Tiled Cache • Variable Latency L2 L2 L2 L2 • Closer tiles = Faster L2 L2 L2 L2 L2 L2 L2 L2 • Key Decisions: – Not only what to cache L2 L2 L2 L2 – Also where to cache NUCA Overview • Initial Research focused on Uniprocessors • Data Migration Policies – When to move data among tiles • L-NUCA: Fine-Grained NUCA Another Development: Chip Multiprocessors Core Core Core Core L1I L1D L1I L1D L1I L1D L1I L1D L2 • Easily utilize on-chip transistors • Naturally exploit thread-level parallelism • Dramatically reduce design complexity • Future CMPs will have more processor cores • Future CMPs will have more cache Text from Michael Zhang & Krste Asanovic, MIT Initial Chip Multiprocessor Designs core core core core • Layout: “Dance-Hall” L1$ L1$ L1$ L1$ – Core + L1 cache Intra-Chip Switch – L2 cache • Small L1 cache: Very low access L2 latency Cache • Large L2 cache A 4-node CMP with a large L2 cache Slide from Michael Zhang & Krste Asanovic, MIT Chip Multiprocessor w/ Large Caches core core core core • Layout: “Dance-Hall” L1$ L1$ L1$ L1$ – Core + L1 cache Intra-Chip Switch – L2 cache L2 Slice L2 Slice L2 Slice L2 Slice L2 Slice L2 Slice L2 Slice L2 Slice L2 Slice L2 Slice L2 Slice L2 Slice • Small L1 cache: Very low access L2 Slice L2 Slice L2 Slice L2 Slice latency L2 Slice L2 Slice L2 Slice L2 Slice L2 Slice L2 Slice L2 Slice L2 Slice L2 Slice L2 Slice L2 Slice L2 Slice L2 Slice L2 Slice L2 Slice L2 Slice • Large L2 cache: Divided into slices to A 4-node CMP with a minimize access latency and power large L2 cache usage Slide from Michael Zhang & Krste Asanovic, MIT Chip Multiprocessors + NUCA • Current: Caches are designed with (long) core core core core uniform access latency for the worst case: L1$ L1$ L1$ L1$ Intra-Chip Switch Best Latency == Worst Latency L2 Slice L2 Slice L2 Slice L2 Slice • Future: Must design with non-uniform access L2 Slice L2 Slice L2 Slice L2 Slice latencies depending on the on-die location of L2 Slice L2 Slice L2 Slice L2 Slice the data: L2 Slice L2 Slice L2 Slice L2 Slice L2 Slice L2 Slice L2 Slice L2 Slice Best Latency << Worst Latency L2 Slice L2 Slice L2 Slice L2 Slice L2 Slice L2 Slice L2 Slice L2 Slice L2 Slice L2 Slice L2 Slice L2 Slice • Challenge: How to minimize average cache A 4-node CMP with a access latency: large L2 cache Average Latency Best Latency Slide from Michael Zhang & Krste Asanovic, MIT Tiled Chip Multiprocessors Switch . Tiled CMPs for Scalability core L1$ Minimal redesign effort Use directory-based protocol for scalability L2$ L2$ Slice Slice Data Tag . Managing the L2s to minimize the effective access latency Keep data close to the requestors c L1 SW c L1 SW c L1 SW c L1 SW L2$ L2$ L2$ L2$ L2$ L2$ L2$ L2$ Keep data on-chip Data Tag Data Tag Data Tag Data Tag c L1 SW c L1 SW c L1 SW c L1 SW L2$ L2$ L2$ L2$ L2$ L2$ L2$ L2$ Data Tag Data Tag Data Tag Data Tag c L1 SW c L1 SW c L1 SW c L1 SW L2$ L2$ L2$ L2$ L2$ L2$ L2$ L2$ Data Tag Data Tag Data Tag Data Tag c L1 SW c L1 SW c L1 SW c L1 SW L2$ L2$ L2$ L2$ L2$ L2$ L2$ L2$ Data Tag Data Tag Data Tag Data Tag Slide from Michael Zhang & Krste Asanovic, MIT Option #1: Private Caches Core Core Core Core L1I L1D L1I L1D L1I L1D L1I L1D L2 L2 L2 L2 Main Memory • + Low Latency • - Fixed allocation Option #2: Shared Caches Core Core Core Core L1I L1D L1I L1D L1I L1D L1I L1D L2 L2 L2 L2 Main Memory • Higher, variable latency • One Core can use all of the cache Data Cache Management for CMP Caches • Get the best of both worlds – Low Latency of Private Caches – Capacity Adaptability of Shared Caches NUCA: A Non-Uniform Cache Access Architecture for Wire-Delay Dominated On-Chip Caches Changkyu Kim, D.C. Burger, and S.W. Keckler, 10th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-X), October, 2002. Some material from slides by Prof. Hsien-Hsin S. Lee ECE, GTech Conventional – Monolithic Cache • UCA: Uniform Access Cache UCA • Best Latency = Worst Latency • Time to access the farthest possible bank UCA Design • Partitioned in Banks Sub-bank Data Bank Bus Predecoder Address Bus Sense amplifier Tag Array Wordline driver and decoder • Conceptually a single address and a single data bus • Pipelining can increase throughput • See CACTI tool: • http://www.hpl.hp.com/research/cacti/ • http://quid.hpl.hp.com:9081/cacti/ Experimental Methodology • SPEC CPU 2000 • Sim-Alpha • CACTI • 8 FO4 cycle time • 132 cycles to main memory • Skip and execute a sample • Technology Nodes – 130nm, 100nm, 70nm, 50nm UCA Scaling – 130nm to 50nm Miss Rate Unloaded Latency 0,25 45 40 0,2 35 30 0,15 25 Cycles 20 0,1 15 0,05 10 5 0 0 130nm/2MB 100nm/4MB 70nm/8MB 50nm/16MB 130nm/2MB 100nm/4MB 70nm/8MB 50nm/16MB Loaded Latency IPC 300 0,45 0,4 250 0,35 200 0,3 0,25 150 0,2 100 0,15 0,1 50 0,05 0 0 130nm/2MB 100nm/4MB 70nm/8MB 50nm/16MB 130nm/2MB 100nm/4MB 70nm/8MB 50nm/16MB Relative Latency and Performance Degrade as Technology Improves UCA Discussion • Loaded Latency: Contention – Bank – Channel • Bank may be free but path to it is not Multi-Level Cache • Conventional Hierarchy L2 L3 • Common Usage: • Serial-Access for Energy and Bandwidth Reduction • This paper: • Parallel Access • Prove that even then their design is better ML-UCA Evaluation Latency • Better than UCA 45 40 • Performance Saturates 35 30 at 70nm 25 20 • No benefit from larger 15 10 cache at 50nm 5 0 130nm/512K/2MB 100nm/512K/4M 70nm/1M/8M 50nm/1M/16M Unloaded L2 Unloaded L3 • Aggressively banked IPC • Multiple parallel 0,66 0,64 accesses 0,62 0,6 0,58 0,56 0,54 0,52 0,5 130nm/512K/2MB 100nm/512K/4M 70nm/1M/8M 50nm/1M/16M S-NUCA-1 • Static NUCA with per bank set busses Sub-bank Bank Data Bus Address Bus Tag Set Offset Bank Set • Use private per bank set channel • Each bank has its distinct access latency • A given address maps to a given bank set • Lower bits of block address S-NUCA-1 • How fast can we initiate requests? – If c = scheduler delay Bank Data • Conservative /Realistic: Bus – Bank + 2 x interconnect + c Address Bus • Aggressive / Unrealistic: – Bank + c • What is the optimal number of bank sets? – Exhaustive evaluation of all options – Which gives the highest IPC S-NUCA-1 Latency Variability 45 40 35 30 25 Unloaded min S2 Unloaded avg S2 20 Unloaded max S2 15 10 5 0 130nm/2M/16 100nm/4M/32 70nm/8M/32 50nm/16M/32 • Variability increases for finer technologies • Number of banks does not increase beyond 4M – Overhead of additional channels – Banks become larger and slower S-NUCA-1 Loaded Latency Aggr. Loaded 35 30 25 20 15 10 5 0 130nm/2M/16 100nm/4M/32 70nm/8M/32 50nm/16M/32 • Better than ML-UCA S-NUCA-1: IPC Performance IPC S1 0,64 0,62 0,6 0,58 0,56 0,54 0,52 0,5 130nm/2M/16 100nm/4M/32 70nm/8M/32 50nm/16M/32 • Per bank channels become an overhead • Prevent finer partitioning @70nm or smaller S-NUCA2 • Use a 2-D Mesh P2P interconnect Tag Array Bank Switch Data bus Predecoder Wordline driver and decoder • Wire overhead much lower: • S1: 20.9% vs. S2: 5.9% at 50nm and 32banks • Reduces contention • 128-bit bi-directional links S-NUCA2 vs. S-NUCA1 Unloaded Latency 45 40 35 Hmm 30 Unloaded min S1 25 Unloaded min S2 Unloaded avg S1 20 Unloaded avg S2 Unloaded max S1 15 Unloaded max S2 10 5 0 130nm/2M/16 100nm/4M/32 70nm/8M/32 50nm/16M/32 • S-NUCA2 almost always better S-NUCA2 vs. S-NUCA-1 IPC Performance 0,66 Aggressive scheduling for S-NUCA1 Channel fully pipelined 0,64 0,62 0,6 0,58 IPC S1 0,56 IPC S2 0,54 0,52 0,5 0,48 130nm/2M/16 100nm/4M/32 70nm/8M/32 50nm/16M/32 • S2 better than S1 Dynamic NUCA Processor Core fast tag data way 0 way 1 ..

Introduction to CUDA Programming

Intel ® Atom™ Processor E6xx Series SKU for Different Segments” on Page 30 Updated Table 15

Adding Support for Vector Instructions to 8051 Architecture

Datasheet, Volume 2

MPC500 Family MPC509 User's Manual

Address Decoding Large-Size Binary Decoder: 28-To-268435456 Binary Decoder for 256Mb Memory

Covert and Side Channels Due to Processor Architecture*

Special Address Generation Arrangement

Low Overhead Memory Subsystem Design for a Multicore Parallel DSP Processor

The Memory System

The CPU and Memory

Intel SGX Explained

Zng: Architecting GPU Multi-Processors with New Flash For