Oracle's Next Generation SPARC Cache Hierarchy

Next Generaon SPARC Processor Cache Hierarchy Ram Sivaramakrishnan Hardware Director Sum. Jairath Sr. Hardware Architect Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 2 Safe Harbor Statement The following is intended to outline our general product direcAon. It is intended for informaon purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or funcAonality, and should not be relied upon in making purchasing decisions. The development, release, and Aming of any features or funcAonality described for Oracle’s products remains at the sole discreAon of Oracle. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 3 Presentaon Outline • Overview • ParAAoned L3 Cache- Design ObjecAves • Core Cluster and L3 ParAAon • On-Chip Network and SMP Coherence • On-Chip Protocol Features • Workload Performance OpAmizaon • Conclusion Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 4 Processor Overview The next generaon SPARC processor contains CORE CORE CORE CORE CLUSTER CLUSTER CLUSTER CLUSTER • 32 cores organized as 8 core clusters • 64MB L3 cache SUBSYSTEM • 8 DDR4 memory schedulers (MCU) providing 160GBps of COHERENCE, SMP & I/O sustained memory bandwidth L3$ & ON-CHIP NETWORK • A coherency subsystem for 1 to 32 socket scaling ACCELERATORS ACCELERATORS MEMORY CONTROL MEMORY CONTROL • A coherency and IO interconnect that provides >336GBps of peak bandwidth • CORE CORE SUBSYSTEM CORE CORE 8 Data AnalyAcs Accelerators (DAX) for query acceleraon and CLUSTER CLUSTER CLUSTER CLUSTER messaging COHERENCE, SMP & I/O Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 5 Why a ParAAoned Last Level Cache? • A Logically Unified Last Level Cache – Latency scales up with core count and cache size – Benefits large shared workloads – Workloads with liele or no sharing see the impact of longer latency • ParAAoned Shared Last Level Cache – A core can allocate only in its local parAAon – A thread sees a lower access latency to its local parAAon – Low latency sharing between parAAons makes them appear as a larger shared cache Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 6 Last Level Cache Design ObjecAves • Minimize the effecAve L2 cache miss latency for response Ame criAcal applicaons • Provide an on-chip interconnect for low latency inter-parAAon communicaon • Provide protocol features that enable the collecAon of parAAons to appear as a larger shared cache Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 7 Core Cluster and ParAAon Overview • Four S4 SPARC cores L2I 256KB 8-way – 2-issue OOO pipeline and 8 strands per core – 16KB L1 I$ and a 16KB write-through L1 D$ per core • 256KB 8-way L2 I$ shared by all four cores core0 core1 core2 core3 • 256KB 8-way dual-banked write-back L2 D$ shared by a core pair • Dual-banked 8MB 8-way L3$ L2D 256KB 8-way L2D 256KB 8-way • 41 cycle ( <10ns) load to use latency to the L3 • >192GBps interface between a core cluster and its local L3 parAAon L3 8MB 8-way Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 8 Processor Block Diagram • An On-Chip Network (OCN) 4 4 SMP & IO 4 4 that connects cores cores Gateway cores cores – 8 parAAons SMP & IO – 4 SMP& IO Gateways 8 MB L3 8 MB L3 Gateway 8 MB L3 8 MB L3 – 4 scheduler and 4 DAX instances on each side MCUs ON-CHIP NETWORK(OCN) MCUs & DAXs & DAXs • 64GBps per OCN port per direcon SMP & IO Gateway 8 MB L3 8 MB L3 8 MB L3 8 MB L3 • Low latency cross-parAAon SMP & IO access 4 4 Gateway 4 4 cores cores cores cores Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 9 Salient Features of the L3 • Supplier state per cache line to prevent mulAple sharing parAAons from sourcing a line • Provides tracking of applicaon request history for opAmal use of the on- chip protocol features • Idle parAAons can be used as a vicAm cache by other acAve parAAons • Allows the DAX to directly deposit the results of a data analyAcs query or a message to be used by a consuming thread • Pointer version per cache line for applicaon data protecAon Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 10 On-Chip Network(OCN) The OCN consists of three networks • Requests are broadcast on four address planed rings, one per SMP gateway on chip – No queuing on the ring, maximum hop count is 11 • The Data network is constructed as a mesh with 10-ported switches – Provides ~512GBps per cross-secAon – Unloaded latency <6ns • The Response network is a point to point network that aggregates snoop responses from all parAAons before delivery to the requester Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 11 SMP Coherence • Distributed Directory based SMP coherence for 8-way glue-less and 32-way glued scaling • Directory is implemented as an inclusive tag rather than an L3 mirror – Reduces the needed associavity in the tag • Inclusive tag sized at 80MB, 20-way set associave to minimize distributed cache evicAons for throughput workloads • Dynamic organizaon supports all socket counts from 1-8 in glue-less systems Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 12 OCN Protocol The OCN protocol supports • Mediated and Peer-to-Peer Requests, Memory Prefetch – Protocol choice based on sharing characterisAcs of a workload • Dynamically ParAAoned Cache – Fair re-distribuAon of last level cache among varying number of acAve threads • DAX Allocaon – Coherent communicaon protocol between core and DAX Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 13 Mediated and Peer-to-Peer Requests Address space sharing characterisAcs of a workload • Peer-to-Peer Requests – Broadcast to all parAAons on the request network – Lowest latency path to get data from another parAAon (also referred to as C2C transfer) • Mediated Requests – Unicast to coherence gateway. Other agents ignore the request – Is the fastest and most power efficient path to data from memory Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 14 Peer to Peer Request Flow 4 cores 4 cores SMP 4 cores 4 cores gateway • L3 miss broadcast to all agents 8 MB 8 MB 8 MB 8 MB SMP on the ring L3 L3 gateway L3 L3 • All L3 parAAons lookup tag and respond with an Ack/Nack MCU MCU • Supplier ParAAon sources data • Lowest latency cross-parAAon 8 MB 8 MB SMP gateway 8 MB 8 MB transfer L3 L3 L3 L3 SMP • Turns into a mediated request 4 cores 4 cores gateway 4 cores 4 cores on failure Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 15 Mediated Request Flow • Unicast to SMP gateway 4 cores 4 cores SMP gateway 4 cores 4 cores • SMP directory lookup and 8 MB 8 MB 8 MB 8 MB SMP L3 L3 gateway L3 L3 Memory Prefetch • Memory read sent on the MCU request network to the MCU appropriate MCU instance (not shown) 8 MB 8 MB SMP gateway 8 MB 8 MB • Memory provides data to the L3 L3 L3 L3 requester SMP 4 cores 4 cores gateway 4 cores 4 cores • Lowest latency path to memory Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 16 Dynamic Request • T1 T2 Tn Tracks how its L3 misses are served and determines if a mediated or peer-to-peer Data-Supplier Tracking Code-Supplier Tracking request is appropriate. 4 cores 4 cores • High cache-to-cache rate drives peer-to- peer requests 8 MB L3 8 MB L3 • High memory return drives mediated Data From Memory Data Via Cache to Cache requests Memory Controller Code From Memory Code Via Cache to Cache • Temporal sharing behavior is effecAvely Cache-Line Supplier Tracking tracked using a 64-bit history register • Code and Data are tracked independently. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 17 Memory Prefetch 4 cores • L3 miss parAAon can request for a speculave SMP memory read 8 MB L3 gateway • Supported for peer-to-peer and mediated requests • Data returning from memory is preserved in a 64 entry (256 entries per chip) buffer unAl a memory read is received • Allows for faster memory latency for workloads with M Prefetch C low sharing Buffer U • Is dynamically enabled using code and data source tracking Memory Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 18 Workload Caching CharacterisAcs Workload Examples Instruc.ons Data Data Source Performance Needs • InstrucAon and Data - Individual Programs, InstrucAons and Data Not Shared Not Shared Memory Prefetch. Virtual Machines from Memory • Predictable QoS MulAple Programs Working on Common InstrucAons from • InstrucAon Memory Data – e.g. Database Not Shared Shared Memory, some Data as Prefetch Shared Global Area C2C • Fast Data C2C (SGA) • Fast InstrucAon C2C • Data Memory ParAAoned Data InstrucAons as C2C, Prefetch Processing – e.g. Shared Not Shared Data from Memory • High Bandwidth Low- AnalyAcs Latency LLC Hits • QoS - Predictability Linked Lists, Control Both InstrucAons and • Fast InstrucAon and Shared Shared Data Structure Data as C2C Data C2C Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 19 Dynamically ParAAoned Cache • For opAmal use of the cache under variable load, a parAAon can use another parAAon as a vicAm cache • A vicAm parAAon can either accept or reject the request based on a heurisc Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 20 Dynamically ParAAoned Cache T0 T1 T2 Tn Time Throughput Duraon – Large • ParAAoned L3 Cache thread count, parallelized • OpAmal for “Throughput Duraon” workload of the workload T1 Serial Duraon – Small thread • Unified L3 Cache count – result collecAon, • OpAmal for “Serial Duraon” of the garbage collecAon, workload scheduling T0 T1 T2 Tn • Dynamically ParAAoned Cache Throughput Duraon - Large • Adapts with workload behavior thread count, parallelized • L3 ParAAons join or disjoin based on workload acAve threads in parAAon Workload AcAve-Threads Timeline Copyright © 2014 Oracle and/or its affiliates.

Load more