Next Generaon SPARC Hierarchy Ram Sivaramakrishnan Hardware Director

Sum Jairath Sr. Hardware Architect

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |

2 Safe Harbor Statement The following is intended to outline our general product direcon. It is intended for informaon purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or funconality, and should not be relied upon in making purchasing decisions. The development, release, and ming of any features or funconality described for Oracle’s products remains at the sole discreon of Oracle.

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 3 Presentaon Outline

• Overview • Paroned L3 Cache- Design Objecves • Core Cluster and L3 Paron • On-Chip Network and SMP Coherence • On-Chip Protocol Features • Workload Performance Opmizaon • Conclusion

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 4 Processor Overview The next generaon SPARC processor contains

CORE CORE CORE CORE CLUSTER CLUSTER CLUSTER CLUSTER • 32 cores organized as 8 core clusters • 64MB L3 cache SUBSYSTEM • 8 DDR4 memory schedulers (MCU) providing 160GBps of COHERENCE, SMP & I/O sustained memory bandwidth L3$ & ON-CHIP NETWORK • A coherency subsystem for 1 to 32 socket scaling ACCELERATORS ACCELERATORS MEMORY CONTROL MEMORY CONTROL • A coherency and IO interconnect that provides >336GBps of peak bandwidth • CORE CORE SUBSYSTEM CORE CORE 8 Data Analycs Accelerators (DAX) for query acceleraon and CLUSTER CLUSTER CLUSTER CLUSTER messaging COHERENCE, SMP & I/O

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 5 Why a Paroned Last Level Cache? • A Logically Unified Last Level Cache – Latency scales up with core count and cache size – Benefits large shared workloads – Workloads with lile or no sharing see the impact of longer latency • Paroned Shared Last Level Cache – A core can allocate only in its local paron – A thread sees a lower access latency to its local paron – Low latency sharing between parons makes them appear as a larger shared cache

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 6 Last Level Cache Design Objecves • Minimize the effecve L2 cache miss latency for response me crical applicaons • Provide an on-chip interconnect for low latency inter-paron communicaon • Provide protocol features that enable the collecon of parons to appear as a larger shared cache

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 7 Core Cluster and Paron Overview

• Four S4 SPARC cores L2I 256KB 8-way – 2-issue OOO pipeline and 8 strands per core – 16KB L1 I$ and a 16KB write-through L1 D$ per core

• 256KB 8-way L2 I$ shared by all four cores core0 core1 core2 core3 • 256KB 8-way dual-banked write-back L2 D$ shared by a core pair • Dual-banked 8MB 8-way L3$ L2D 256KB 8-way L2D 256KB 8-way • 41 cycle ( <10ns) load to use latency to the L3 • >192GBps interface between a core cluster and its local L3 paron L3 8MB 8-way

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 8 Processor Block Diagram • An On-Chip Network (OCN)

4 4 SMP & IO 4 4 that connects cores cores Gateway cores cores – 8 parons

SMP & IO – 4 SMP& IO Gateways 8 MB L3 8 MB L3 Gateway 8 MB L3 8 MB L3 – 4 scheduler and 4 DAX instances on each side MCUs ON-CHIP NETWORK(OCN) MCUs & DAXs & DAXs • 64GBps per OCN port per direcon SMP & IO Gateway 8 MB L3 8 MB L3 8 MB L3 8 MB L3 • Low latency cross-paron

SMP & IO access 4 4 Gateway 4 4 cores cores cores cores

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 9 Salient Features of the L3 • Supplier state per cache line to prevent mulple sharing parons from sourcing a line • Provides tracking of applicaon request history for opmal use of the on- chip protocol features • Idle parons can be used as a vicm cache by other acve parons • Allows the DAX to directly deposit the results of a data analycs query or a message to be used by a consuming thread • Pointer version per cache line for applicaon data protecon

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 10 On-Chip Network(OCN) The OCN consists of three networks • Requests are broadcast on four address planed rings, one per SMP gateway on chip – No queuing on the ring, maximum hop count is 11 • The Data network is constructed as a mesh with 10-ported switches – Provides ~512GBps per cross-secon – Unloaded latency <6ns • The Response network is a point to point network that aggregates snoop responses from all parons before delivery to the requester

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 11 SMP Coherence • Distributed Directory based SMP coherence for 8-way glue-less and 32-way glued scaling • Directory is implemented as an inclusive tag rather than an L3 mirror – Reduces the needed associavity in the tag • Inclusive tag sized at 80MB, 20-way set associave to minimize distributed cache evicons for throughput workloads • Dynamic organizaon supports all socket counts from 1-8 in glue-less systems

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 12 OCN Protocol The OCN protocol supports • Mediated and Peer-to-Peer Requests, Memory Prefetch – Protocol choice based on sharing characteriscs of a workload • Dynamically Paroned Cache – Fair re-distribuon of last level cache among varying number of acve threads • DAX Allocaon – Coherent communicaon protocol between core and DAX

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 13 Mediated and Peer-to-Peer Requests Address space sharing characteriscs of a workload • Peer-to-Peer Requests – Broadcast to all parons on the request network – Lowest latency path to get data from another paron (also referred to as C2C transfer) • Mediated Requests – Unicast to coherence gateway. Other agents ignore the request – Is the fastest and most power efficient path to data from memory

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 14 Peer to Peer Request Flow

4 cores 4 cores SMP 4 cores 4 cores gateway • L3 miss broadcast to all agents 8 MB 8 MB 8 MB 8 MB SMP on the ring L3 L3 gateway L3 L3 • All L3 parons lookup tag and respond with an Ack/Nack MCU MCU • Supplier Paron sources data • Lowest latency cross-paron 8 MB 8 MB SMP gateway 8 MB 8 MB transfer L3 L3 L3 L3

SMP • Turns into a mediated request 4 cores 4 cores gateway 4 cores 4 cores on failure

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 15 Mediated Request Flow • Unicast to SMP gateway 4 cores 4 cores SMP gateway 4 cores 4 cores • SMP directory lookup and 8 MB 8 MB 8 MB 8 MB SMP L3 L3 gateway L3 L3 Memory Prefetch • Memory read sent on the

MCU request network to the MCU appropriate MCU instance (not shown)

8 MB 8 MB SMP gateway 8 MB 8 MB • Memory provides data to the L3 L3 L3 L3 requester SMP 4 cores 4 cores gateway 4 cores 4 cores • Lowest latency path to memory

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 16 Dynamic Request • T1 T2 Tn Tracks how its L3 misses are served and determines if a mediated or peer-to-peer

Data-Supplier Tracking Code-Supplier Tracking request is appropriate. 4 cores 4 cores • High cache-to-cache rate drives peer-to- peer requests 8 MB L3 8 MB L3 • High memory return drives mediated Data From Memory Data Via Cache to Cache requests Code From Memory Code Via Cache to Cache • Temporal sharing behavior is effecvely Cache-Line Supplier Tracking tracked using a 64-bit history register • Code and Data are tracked independently.

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 17 Memory Prefetch

4 cores • L3 miss paron can request for a speculave

SMP memory read 8 MB L3 gateway • Supported for peer-to-peer and mediated requests • Data returning from memory is preserved in a 64 entry (256 entries per chip) buffer unl a memory read is received • Allows for faster memory latency for workloads with M Prefetch C low sharing Buffer U • Is dynamically enabled using code and data source tracking Memory

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 18 Workload Caching Characteriscs

Workload Examples Instrucons Data Data Source Performance Needs • Instrucon and Data - Individual Programs, Instrucons and Data Not Shared Not Shared Memory Prefetch. Virtual Machines from Memory • Predictable QoS Mulple Programs Working on Common Instrucons from • Instrucon Memory Data – e.g. Database Not Shared Shared Memory, some Data as Prefetch Shared Global Area C2C • Fast Data C2C (SGA) • Fast Instrucon C2C • Data Memory Paroned Data Instrucons as C2C, Prefetch Processing – e.g. Shared Not Shared Data from Memory • High Bandwidth Low- Analycs Latency LLC Hits • QoS - Predictability Linked Lists, Control Both Instrucons and • Fast Instrucon and Shared Shared Data Structure Data as C2C Data C2C

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 19 Dynamically Paroned Cache • For opmal use of the cache under variable load, a paron can use another paron as a vicm cache • A vicm paron can either accept or reject the request based on a heurisc

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 20 Dynamically Paroned Cache

T0 T1 T2 Tn Time

Throughput Duraon – Large • Paroned L3 Cache thread count, parallelized • Opmal for “Throughput Duraon” workload of the workload

T1 Serial Duraon – Small thread • Unified L3 Cache count – result collecon, • Opmal for “Serial Duraon” of the garbage collecon, workload scheduling T0 T1 T2 Tn • Dynamically Paroned Cache Throughput Duraon - Large • Adapts with workload behavior thread count, parallelized • L3 Parons join or disjoin based on workload acve threads in paron Workload Acve-Threads Timeline

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 21 Dynamically Paroned Cache - Example T1 T2 Tn T1 T2 Tn T1 T2 Tn

ZZZ… 4 cores 4 cores 4 cores 4 cores • Acve workload on a paron evicts cache 8 MB L3 8 MB L3 8 MB L3 8 MB L3 lines. Controller Controller Memory Memory • Evicons are absorbed by Memory Memory Controller Controller other idle parons 8 MB L3 8 MB L3 8 MB L3 8 MB L3 • Increases Last-Level- Cache size for acve 4 cores 4 cores 4 cores 4 cores threads.

ZZZ… ZZZ… Tn T2 T1 Tn T2 T1 Throughput Duraon Serial Duraon

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 22 Core-DAX Handshake

T1 T2 Tn

4 cores 4 cores • Core-DAX handshake is cache coherent 8 MB L3 8 MB L3 • Data produced by cores can be read from the

Command cache or memory by the DAX DAX Results • Results of the DAX operaon are deposited in 8 MB L3 8 MB L3 the cache or memory for consumpon by core

4 cores 4 cores

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 23 DAX Allocaon • The DAX can directly deposit messages and results of a query into any L3 paron • DAX DMA buffers are reused in order to migate dirty evicons from the cache – A probe instrucon is used to determine if the buffer was previously allocated in any paron – If so, that paron is chosen for allocaon

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 24 Dynamic Core-DAX Pairing

T1 T2 Tn T1 T2 Tn

4 cores 4 cores 4 cores 4 cores • DAX broadcasts a probe

8 MB L3 8 MB L3 8 MB L3 8 MB L3 request to discover the recipient thread’s paron. Decompression

Cache Allocate DAX Cache Probe • Results from the DAX are DAX deposited in the recipient

8 MB L3 8 MB L3 Compressed 8 MB L3 8 MB L3 thread’s paron. Data

4 cores 4 cores 4 cores 4 cores

Cache Probe Cache Allocate

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 25 Conclusion The next generaon SPARC processor cache hierarchy • Significantly improves the SOC response me over the previous generaon • Provides a low latency, high bandwidth, dynamically paroned cache – Acvely tracks workload behavior – Applies the appropriate flavor of the on-chip protocol for opmal performance • Enables a low latency, high bandwidth DAX interface for effecve off-load of acceleraon tasks.

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 26 Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 27