IBM Zec12 Processor Subsystem the Foundation for a Highly Reliable, High Performance Mainframe SMP System
Total Page:16
File Type:pdf, Size:1020Kb
IBM zEC12 Processor Subsystem The Foundation for a Highly Reliable, High Performance Mainframe SMP System Robert Sonnelitter System z Processor Development, Systems & Technology Group, IBM Corp. [email protected] © 2013 IBM Corporation Smarter Systems for a Smarter Planet IBM zEC12 Processor Subsystem . Historical Background . Performance Characteristics . CP/L3 Design Highlights . SC/L4 Design Highlights . System RAS Features 2 © 2013 IBM Corporation Smarter Systems for a Smarter Planet System z Shared Cache History . 1990 – Fully shared second level cache . 1995 – Cluster Shared L2 . 1997 – Distributed Cache Topology . 1998 – Bi-Nodal Distributed Cache Design . 2003 – Modular Nodal Design, Ring Topology . 2008 – Three Level Cache Hierarchy, Fully Connected Topology . 2010 – Four Level Cache Hierarchy, eDRAM Caches 1990 1993 1995 1994 1995 1996 1997 1998 1999 2000 2003 2005 2008 2010 2012 H2 H5 H6 G1 G2 G3 G4 G5 G6 z900 z990 z9 z10 z196 zEC12 6w 6w 10w 6w 10w 10w 10w 10w 12w 20w 32w 54w 64w 80w 101w 3 © 2013 IBM Corporation Smarter Systems for a Smarter Planet Customer Environment & Workload Characteristics . Highly virtualized workloads – Heavily shared system environment – Sustained high processor utilizations – Tasks dynamically dispatched across the system . Large single image workloads . High data sharing across processors . Response time sensitive workloads . Large memory footprint . Extremely high system reliability . zOS, zVM, zVSE, TPF, zLinux operating systems 4 © 2013 IBM Corporation Smarter Systems for a Smarter Planet Performance Benchmarks System Capacity . Ensure Per-Thread and SMP 5 Performance growth with increased 4 system capacity 3 . Guaranteed customer performance 2 targets with constant software 1 z9 z10 z196 zEC12 90nm 65nm 45nm 32nm . Workloads Per-Thread Performance – Large System Performance Reference (LSPR) HIDI / MIDI / LODI 3 CB-L / WASDB / OLTP / etc. – Internal Custom Stressors 2 – External Benchmarks – Only LSPR metrics are published 1 z9 z10 z196 zEC12 90nm 65nm 45nm 32nm 5 © 2013 IBM Corporation Smarter Systems for a Smarter Planet Design / Performance Intersection . Private core cache – Low latency, high bandwidth access Performance Contributions – Caching for performance critical data 60% % of Data Source . High capacity fully shared system caches 50% – Fast shared latency, high bandwidth for smooth SMP scaling % of Total CPI – Caching for cross-processor sharing & reentrant data use 40% 30% . Tiered clustered multi-level cache structure – Localize processor and cache affinity 20% – Ensure consistent SMP scaling within a book – Interconnect bandwidth matched for caching effects 10% 0% *LSPR . Distributed Switch Design L2 L3 L4 R4 Mem – Ensure SMP flatness as system scales up – Balanced system effects 6 © 2013 IBM Corporation Smarter Systems for a Smarter Planet Logical Node Overview L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 2MB 2MB 2MB 2MB 2MB 2MB 2MB 2MB 2MB 2MB 2MB 2MB 2MB 2MB 2MB 2MB 2MB 2MB 2MB 2MB 2MB 2MB 2MB 2MB 2MB 2MB 2MB 2MB 2MB 2MB 2MB 2MB 2MB 2MB 2MB 2MB L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 48MB eDRAM 48MB eDRAM 48MB eDRAM 48MB eDRAM 48MB eDRAM 48MB eDRAM Inclusive L3 Inclusive L3 Inclusive L3 Inclusive L3 Inclusive L3 Inclusive L3 LRU Cast-Out CP Stores 384MB eDRAM Data Fetch Return Inclusive L4 7 © 2013 IBM Corporation Smarter Systems for a Smarter Planet Logical System Overview L3 L3 L3 L3 L3 L3 L3 L3 L3 L3 L3 L3 L4 L4 Node0 Node2 Node1 Node3 L4 L4 L3 L3 L3 L3 L3 L3 L3 L3 L3 L3 L3 L3 8 © 2013 IBM Corporation Smarter Systems for a Smarter Planet Node Multi-Chip Module (MCM) 9 © 2013 IBM Corporation Smarter Systems for a Smarter Planet CP (Central Processor) Chip Overview . 6 Cores . 48 MB Shared EDRAM L3 . 32 nm SOI Technology . 5.5 GHz constant core frequency (L3Q) (L3Q) (L3Q) (L3Q) . 4:1 L3 clock gear ratio . 2.75 billion transistors . 7.68 miles of wire . 598 mm^2 10 © 2013 IBM Corporation Smarter Systems for a Smarter Planet L3 Cache Features . Two independent slices based on low order address bit Core Core Core . Each slice connects to six cores, I/O memory and I/O controllers . 12 way set associative L40 L30 L31 L41 . 16k congruence classes . Byte Merge Stations for DMA partial line operations Core . HW Accelerators for page based Core Core MC operations . Recent Store History for preemptive exclusive data access 11 © 2013 IBM Corporation Smarter Systems for a Smarter Planet L3 Pipeline C0 C1 C2 C3 C4 C5 C6 Priority Directory Hit Reject/ Cache ECC Response Data Lookup Results Needs Check/ to L2 Return Resource Available Correct To L2 Checking Cache Model EDRAM Intervention Update Cache to L2 Access Broadcast to L4 Store to EDRAM C-1 Request 12 © 2013 IBM Corporation Smarter Systems for a Smarter Planet L3 EDRAM Structure L3 Bank ILV0 ILV1 ILV2 ILV3 . Two independent banks DW2 . Four way interleave per bank . Three cycle EDRAM busy time DW0 . Match EDRAM busy and data bus busy time L3B . 256B cache line spread across four interleaves DW1 DW3 13 © 2013 IBM Corporation Smarter Systems for a Smarter Planet L3 EDRAM Management Blocking Store Scheduler store C0 C1 C2 C3 C4 C5 C6 Stall Stall Stall Stall Stall C0 C1 C2 . EDRAM busy time vs write back cache fetch – Stores from the cores represent a ILV0 majority of operations processed by the ILV1 L3 ILV2 . Fetch vs Store Management ILV3 – Fetches schedule EDRAM access for continuous data streaming Flexible Store Scheduler – Flexible store scheduling to minimize store C0 C1 C2 C3 C4 C5 C6 resource conflicts fetch C0 C1 C2 C3 C4 C5 C6 C7 Fetch eDRAM busy ILV0 Store eDRAM busy ILV1 Fetch Start Blocked ILV2 ILV3 14 © 2013 IBM Corporation Smarter Systems for a Smarter Planet L3 Dataflow Challenges L2 Store 90GB/s x6 . Wiring and Reach – 384 EDRAM macros – ~100 cache line data buffers L3 Fetch L4 Fetch – Symmetric core latency 45GB/s x1 90GB/s x8 L4 Store L3 Store . Request Concurrency 45GB/s x1 – 6 Fetch and 12 Store requests per core 90GB/s x8 L3B – 12 Fetch and 8 Store requests per IO port Memory Fetch I/O DMA Read – 16 requests from L4 45GB/s x1 45GB/s x1 . Balanced Bandwidth I/O DMA Write Memory Store 45GB/s x1 45GB/s x1 L2 Fetch 90GB/s x6 15 © 2013 IBM Corporation Smarter Systems for a Smarter Planet SC (System Controller) Chip Overview . 192 MB Shared EDRAM L4 Cache Cache . 6 CP Chip Interfaces Arrays Arrays (L4Q) (L4Q) . 3 SMP Interfaces Control Logic (L4B) (L4B) . 3.3 billion transistors (L4C) Cache Directory . 32 nm SOI Technology Cache Cache Arrays Arrays . 4:1 L4 clock gear ratio (L4Q) (L4Q) . 526 mm^2 16 © 2013 IBM Corporation Smarter Systems for a Smarter Planet L4 Cache Features . L4 spread across two SC chips CP2 CP1 CP0 – matches L3 address slicing . Each SC chip connects to six CP chips and three SC on other nodes FBC0 FBC0 FBC1 SC0 SC1 FBC1 . 24 way set associative FBC2 FBC2 . 64k congruence classes . L4 is the system coherency manager CP5 CP4 CP3 . HW Pattern Based Prefetching 17 © 2013 IBM Corporation Smarter Systems for a Smarter Planet L4 Pipeline C0 C1 C2 C3 C4 Priority Directory Hit EDRAM Reject/ Lookup Results Read Needs Resource Available Access Checking Bank Model Address Update Compares Valid C-1 C5 C6 C7 C8 Request Store Cache Intervention Data to L4 ECC to L3 Return Cache Check/ To Correct Fabric L3/R4 Broadcast 18 © 2013 IBM Corporation Smarter Systems for a Smarter Planet L4 Design Challenges . Integration, Wireability, and Reach 16 Fetch & 16 Store Requests x6 Processor Chips – 1024 eDRAM macros & management logic – 114 cache line data buffers – 230 address registers w/compare logic . Request Concurrency – Support for 196 concurrent operations per chip . Intelligent Request Scheduling – Operation Address Interlocks – Central Fairness/Ordering . Bandwidth balancing – L3<->L4: 22GB/s per port, 132GB/s in/out – L4<->L4: 22GB/s per port, 66GB/s in/out – L4 Cache services >60% of L3 requests under storage hierarchy intense workloads 32 Fetch & 32 Store Requests x3 Remote Nodes 19 © 2013 IBM Corporation Smarter Systems for a Smarter Planet Intra-Node Cache Management L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 . MESI derived protocol L3 L3 L3 L3 L3 L3 . L2, L3 and L4 are inclusive . L1 and L2 are write through L4 . L3 and L4 are write back . All L3 to L3 communication goes through L4 . L3 Miss Requests – L4 Hit Shared by one or more L3s Guaranteed intervention processing time by other L3s on shared lines Data sourced from L4 – L4 Hit Exclusive or Modified to L3 Data sourced from other L3 20 © 2013 IBM Corporation Smarter Systems for a Smarter Planet Inter-Node System Coherency Protocol . Enhanced MOESI Protocol N0 N1 – Intervention Master (IM) – Memory Master (MM) – Multi-Copy (MC) N2 N3 – Exclusive – Invalid – Ghost N0 N1 . Fabric Broadcast – Point to point communication – Local state information sent to remote nodes N2 N3 . Partial Response Broadcast – Any to any, expediting system state information – Partial responses are ordered, no response identifier tags N0 N1 .