Next Genera on SPARC Processor Cache Hierarchy Ram Sivaramakrishnan Hardware Director
Sum Jairath Sr. Hardware Architect
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
2 Safe Harbor Statement The following is intended to outline our general product direc on. It is intended for informa on purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or func onality, and should not be relied upon in making purchasing decisions. The development, release, and ming of any features or func onality described for Oracle’s products remains at the sole discre on of Oracle.
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 3 Presenta on Outline
• Overview • Par oned L3 Cache- Design Objec ves • Core Cluster and L3 Par on • On-Chip Network and SMP Coherence • On-Chip Protocol Features • Workload Performance Op miza on • Conclusion
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 4 Processor Overview The next genera on SPARC processor contains
CORE CORE CORE CORE CLUSTER CLUSTER CLUSTER CLUSTER • 32 cores organized as 8 core clusters • 64MB L3 cache SUBSYSTEM • 8 DDR4 memory schedulers (MCU) providing 160GBps of COHERENCE, SMP & I/O sustained memory bandwidth L3$ & ON-CHIP NETWORK • A coherency subsystem for 1 to 32 socket scaling ACCELERATORS ACCELERATORS MEMORY CONTROL MEMORY CONTROL • A coherency and IO interconnect that provides >336GBps of peak bandwidth • CORE CORE SUBSYSTEM CORE CORE 8 Data Analy cs Accelerators (DAX) for query accelera on and CLUSTER CLUSTER CLUSTER CLUSTER messaging COHERENCE, SMP & I/O
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 5 Why a Par oned Last Level Cache? • A Logically Unified Last Level Cache – Latency scales up with core count and cache size – Benefits large shared workloads – Workloads with li le or no sharing see the impact of longer latency • Par oned Shared Last Level Cache – A core can allocate only in its local par on – A thread sees a lower access latency to its local par on – Low latency sharing between par ons makes them appear as a larger shared cache
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 6 Last Level Cache Design Objec ves • Minimize the effec ve L2 cache miss latency for response me cri cal applica ons • Provide an on-chip interconnect for low latency inter-par on communica on • Provide protocol features that enable the collec on of par ons to appear as a larger shared cache
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 7 Core Cluster and Par on Overview
• Four S4 SPARC cores L2I 256KB 8-way – 2-issue OOO pipeline and 8 strands per core – 16KB L1 I$ and a 16KB write-through L1 D$ per core
• 256KB 8-way L2 I$ shared by all four cores core0 core1 core2 core3 • 256KB 8-way dual-banked write-back L2 D$ shared by a core pair • Dual-banked 8MB 8-way L3$ L2D 256KB 8-way L2D 256KB 8-way • 41 cycle ( <10ns) load to use latency to the L3 • >192GBps interface between a core cluster and its local L3 par on L3 8MB 8-way
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 8 Processor Block Diagram • An On-Chip Network (OCN)
4 4 SMP & IO 4 4 that connects cores cores Gateway cores cores – 8 par ons
SMP & IO – 4 SMP& IO Gateways 8 MB L3 8 MB L3 Gateway 8 MB L3 8 MB L3 – 4 scheduler and 4 DAX instances on each side MCUs ON-CHIP NETWORK(OCN) MCUs & DAXs & DAXs • 64GBps per OCN port per direc on SMP & IO Gateway 8 MB L3 8 MB L3 8 MB L3 8 MB L3 • Low latency cross-par on
SMP & IO access 4 4 Gateway 4 4 cores cores cores cores
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 9 Salient Features of the L3 • Supplier state per cache line to prevent mul ple sharing par ons from sourcing a line • Provides tracking of applica on request history for op mal use of the on- chip protocol features • Idle par ons can be used as a vic m cache by other ac ve par ons • Allows the DAX to directly deposit the results of a data analy cs query or a message to be used by a consuming thread • Pointer version per cache line for applica on data protec on
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 10 On-Chip Network(OCN) The OCN consists of three networks • Requests are broadcast on four address planed rings, one per SMP gateway on chip – No queuing on the ring, maximum hop count is 11 • The Data network is constructed as a mesh with 10-ported switches – Provides ~512GBps per cross-sec on – Unloaded latency <6ns • The Response network is a point to point network that aggregates snoop responses from all par ons before delivery to the requester
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 11 SMP Coherence • Distributed Directory based SMP coherence for 8-way glue-less and 32-way glued scaling • Directory is implemented as an inclusive tag rather than an L3 mirror – Reduces the needed associa vity in the tag • Inclusive tag sized at 80MB, 20-way set associa ve to minimize distributed cache evic ons for throughput workloads • Dynamic organiza on supports all socket counts from 1-8 in glue-less systems
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 12 OCN Protocol The OCN protocol supports • Mediated and Peer-to-Peer Requests, Memory Prefetch – Protocol choice based on sharing characteris cs of a workload • Dynamically Par oned Cache – Fair re-distribu on of last level cache among varying number of ac ve threads • DAX Alloca on – Coherent communica on protocol between core and DAX
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 13 Mediated and Peer-to-Peer Requests Address space sharing characteris cs of a workload • Peer-to-Peer Requests – Broadcast to all par ons on the request network – Lowest latency path to get data from another par on (also referred to as C2C transfer) • Mediated Requests – Unicast to coherence gateway. Other agents ignore the request – Is the fastest and most power efficient path to data from memory
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 14 Peer to Peer Request Flow
4 cores 4 cores SMP 4 cores 4 cores gateway • L3 miss broadcast to all agents 8 MB 8 MB 8 MB 8 MB SMP on the ring L3 L3 gateway L3 L3 • All L3 par ons lookup tag and respond with an Ack/Nack MCU MCU • Supplier Par on sources data • Lowest latency cross-par on 8 MB 8 MB SMP gateway 8 MB 8 MB transfer L3 L3 L3 L3
SMP • Turns into a mediated request 4 cores 4 cores gateway 4 cores 4 cores on failure
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 15 Mediated Request Flow • Unicast to SMP gateway 4 cores 4 cores SMP gateway 4 cores 4 cores • SMP directory lookup and 8 MB 8 MB 8 MB 8 MB SMP L3 L3 gateway L3 L3 Memory Prefetch • Memory read sent on the
MCU request network to the MCU appropriate MCU instance (not shown)
8 MB 8 MB SMP gateway 8 MB 8 MB • Memory provides data to the L3 L3 L3 L3 requester SMP 4 cores 4 cores gateway 4 cores 4 cores • Lowest latency path to memory
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 16 Dynamic Request • T1 T2 Tn Tracks how its L3 misses are served and determines if a mediated or peer-to-peer
Data-Supplier Tracking Code-Supplier Tracking request is appropriate. 4 cores 4 cores • High cache-to-cache rate drives peer-to- peer requests 8 MB L3 8 MB L3 • High memory return drives mediated Data From Memory Data Via Cache to Cache requests Memory Controller Code From Memory Code Via Cache to Cache • Temporal sharing behavior is effec vely Cache-Line Supplier Tracking tracked using a 64-bit history register • Code and Data are tracked independently.
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 17 Memory Prefetch
4 cores • L3 miss par on can request for a specula ve
SMP memory read 8 MB L3 gateway • Supported for peer-to-peer and mediated requests • Data returning from memory is preserved in a 64 entry (256 entries per chip) buffer un l a memory read is received • Allows for faster memory latency for workloads with M Prefetch C low sharing Buffer U • Is dynamically enabled using code and data source tracking Memory
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 18 Workload Caching Characteris cs
Workload Examples Instruc ons Data Data Source Performance Needs • Instruc on and Data - Individual Programs, Instruc ons and Data Not Shared Not Shared Memory Prefetch. Virtual Machines from Memory • Predictable QoS Mul ple Programs Working on Common Instruc ons from • Instruc on Memory Data – e.g. Database Not Shared Shared Memory, some Data as Prefetch Shared Global Area C2C • Fast Data C2C (SGA) • Fast Instruc on C2C • Data Memory Par oned Data Instruc ons as C2C, Prefetch Processing – e.g. Shared Not Shared Data from Memory • High Bandwidth Low- Analy cs Latency LLC Hits • QoS - Predictability Linked Lists, Control Both Instruc ons and • Fast Instruc on and Shared Shared Data Structure Data as C2C Data C2C
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 19 Dynamically Par oned Cache • For op mal use of the cache under variable load, a par on can use another par on as a vic m cache • A vic m par on can either accept or reject the request based on a heuris c
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 20 Dynamically Par oned Cache
T0 T1 T2 Tn Time
Throughput Dura on – Large • Par oned L3 Cache thread count, parallelized • Op mal for “Throughput Dura on” workload of the workload
T1 Serial Dura on – Small thread • Unified L3 Cache count – result collec on, • Op mal for “Serial Dura on” of the garbage collec on, workload scheduling T0 T1 T2 Tn • Dynamically Par oned Cache Throughput Dura on - Large • Adapts with workload behavior thread count, parallelized • L3 Par ons join or disjoin based on workload ac ve threads in par on Workload Ac ve-Threads Timeline
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 21 Dynamically Par oned Cache - Example T1 T2 Tn T1 T2 Tn T1 T2 Tn
ZZZ… 4 cores 4 cores 4 cores 4 cores • Ac ve workload on a par on evicts cache 8 MB L3 8 MB L3 8 MB L3 8 MB L3 lines. Controller Controller Memory Memory • Evic ons are absorbed by Memory Memory Controller Controller other idle par ons 8 MB L3 8 MB L3 8 MB L3 8 MB L3 • Increases Last-Level- Cache size for ac ve 4 cores 4 cores 4 cores 4 cores threads.
ZZZ… ZZZ… Tn T2 T1 Tn T2 T1 Throughput Dura on Serial Dura on
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 22 Core-DAX Handshake
T1 T2 Tn
4 cores 4 cores • Core-DAX handshake is cache coherent 8 MB L3 8 MB L3 • Data produced by cores can be read from the
Command cache or memory by the DAX DAX Results • Results of the DAX opera on are deposited in 8 MB L3 8 MB L3 the cache or memory for consump on by core
4 cores 4 cores
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 23 DAX Alloca on • The DAX can directly deposit messages and results of a query into any L3 par on • DAX DMA buffers are reused in order to mi gate dirty evic ons from the cache – A probe instruc on is used to determine if the buffer was previously allocated in any par on – If so, that par on is chosen for alloca on
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 24 Dynamic Core-DAX Pairing
T1 T2 Tn T1 T2 Tn
4 cores 4 cores 4 cores 4 cores • DAX broadcasts a probe
8 MB L3 8 MB L3 8 MB L3 8 MB L3 request to discover the recipient thread’s par on. Decompression
Cache Allocate DAX Cache Probe • Results from the DAX are DAX deposited in the recipient
8 MB L3 8 MB L3 Compressed 8 MB L3 8 MB L3 thread’s par on. Data
4 cores 4 cores 4 cores 4 cores
Cache Probe Cache Allocate
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 25 Conclusion The next genera on SPARC processor cache hierarchy • Significantly improves the SOC response me over the previous genera on • Provides a low latency, high bandwidth, dynamically par oned cache – Ac vely tracks workload behavior – Applies the appropriate flavor of the on-chip protocol for op mal performance • Enables a low latency, high bandwidth DAX interface for effec ve off-load of accelera on tasks.
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 26 Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 27