Understanding SPARC Processor Performance

Total Page:16

File Type:pdf, Size:1020Kb

Understanding SPARC Processor Performance Understanding SPARC Processor Performance MAY 15 & 16, 2019 CLEVELAND PUBLIC AUDITORIUM, CLEVELAND, OHIO WWW.NEOOUG.ORG/GLOC About the Speaker • Akiva Lichtner • Physics background • Twenty years experience in IT • Enterprise production support analyst • Java developer • Oracle query plan manager … • Spoke here at G.L.O.C. about TDD and Java dynamic tracing Audience • Developers • System administrators • Tech support analysts • IT managers Motivation • I have been working in tech support for a large application • We have run SPARC T4 servers and now we run T7 servers • Application servers, database servers • Environments are all different • Users complained for years about “environment X” being slow, finally figured out why • What I learned can be very useful for users of SPARC servers What is SPARC? • First released in 1987, created by Sun Microsystems to replace the Motorola 68000 in its workstation products • During the .com boom Solaris/SPARC and Windows/Intel were the only supported platforms for the JVM • In 2000 the bubble burst and Sun server sales plunged • Sun acquired Afara Websystems, which had built an interesting new processor, and renamed it the UltraSPARC T1 • Was followed by T2 through M8, evolutions of the same design • More recently Oracle has added significant new functionality Processor Design • High core count (even in the early days) • Many threads per core • “Barrel” processor • Designed to switch efficiently • Non-uniform memory access • Per-processor shared cache • Core-level shared cache A picture speaks a thousand words … What happens if my threads run on the same core? • Core runs one thread at a time • Context switch is instantaneous • Ideally memory latency hides context switch entirely • However the caches have a limited size • Performance will depend on cache utilization • Are the threads on the same core “related” or not? • I had a CPU-intensive application that would slow down by a factor of five at random times, in the presence of another CPU- intensive application – coincidence? • To Solaris one core looks like 8 virtual processors What performance should I expect in general? • It all depends • Common for a Java application to have hundreds of threads • These servers often consolidate independent applications • By default, Solaris uses all the virtual processors • Not a massive SMP machine • Not a traditional CC-NUMA machine Some experiments to illustrate • One Java process, N threads • Random reads/writes on one array of data • Used pbind to influence choice of virtual processors One core 1 core 20 15 10 5 0 7 5 3 1 0-5 5-10 10-15 15-20 Two cores 2 non-adjacent cores 60 2 adjacent cores 50 40 30 20 20 15 10 13 10 0 5 9 13 0 9 5 4 16 5 64 1 256 1 1024 4096 16384 65536 262144 0-10 10-20 20-30 30-40 40-50 50-60 0-5 5-10 10-15 15-20 Four cores 4 non-adjacent cores 4 adjacent cores 100 80 60 60 40 40 20 20 0 0 1 4 1 4 4 9 5 7 64 13 32 10 13 21 17 1024 256 16 19 29 25 16384 2048 22 25 28 262144 16384 31 4194304 131072 1048576 8388608 0-20 20-40 40-60 0-20 20-40 40-60 60-80 80-100 Multi-processor server • For several years I had a problematic server • I finally noticed that it had two processors • Applications were mostly idle, but load average was high • I think thread migration overhead was inflating the run queue • A simple Java memory allocation test was 60% slower than on a single-processor server • Once a single-processor server was installed, the problem disappeared Choose application software carefully • Well-designed concurrent applications • Learn good concurrent programming • Streaming applications – Each core has a Data Analytics Acceleration (DAX) pipeline – compare / filter / translate / compress – Oracle Database, Apache Spark query acceleration – Java Streams API • Neural networks? Solaris Resource Management • Also known as Solaris Zones • Not virtualization, still just one server • Solaris “projects” • Assign application processes to projects • Create (virtual) processor sets • Allocate virtual processors to processor sets Solaris VM Server for SPARC • Also known as Logical Domains • This is virtualization • Hypervisor • Multiple instances of Solaris on the same server • Allocate processors to different VMs Dynamic Hardware Domains • Hardware feature in M-Series servers • Can add/delete/replace processor boards to domains Take-aways • There’s a lot to a modern SPARC processor • This is not an SMP, not a CC-NUMA • Complex, hierarchical structure • With great power comes great responsibility • If performance is a problem, consider your workload sharing • Keep different applications on different cores • Try to keep related apps on adjacent cores • VMs and Zones are not nice to have, you need them.
Recommended publications
  • Installation and Upgrade General Checklist Report
    VeritasTM Services and Operations Readiness Tools Installation and Upgrade Checklist Report for Storage Foundation for Sybase 5.1, Solaris 10, SPARC Index Back to top Important Notes System requirements Product documentation Patches for Storage Foundation for Sybase and Platform Platform configuration Host bus adapter (HBA) parameters and switch parameters Operations Manager Array Support Libraries (ASLs) Additional tasks to consider Important Notes Back to top Array Support Libraries (ASLs) When installing Veritas products, please be aware that ASLs updates are not included in the patches update bundle. Please go to the Array Support Libraries (ASLs) to get the latest updates for your disk arrays. System requirements Back to top Required number of CPUs: 1 Required disk space: Partitions Minimum space required Maximum space required Recommended space /opt 102 MB 441 MB 327 MB /root 52 MB 53 MB 52 MB /usr 154 MB 154 MB 154 MB /var 1 MB 1 MB 1 MB Supported architectures: SPARC M5 series [1][2] SPARC M6 series [1][2] SPARC T3 series [2] SPARC T4 series [2][3] SPARC T5 series [2][4] SPARC64 X+ series SPARC64-V series SPARC64-VI series 1 of 9 VeritasTM Services and Operations Readiness Tools SPARC64-VII/VII+ series SPARC64-X series UltraSPARC II series UltraSPARC III series UltraSPARC IV series UltraSPARC T1 series [2] UltraSPARC T2/T2+ series [2] 1. A minimum version of Storage Foundation 5.1SP1RP3, and Solaris 10 1/13 are required for support. 2. Oracle VM Server for SPARC supported. See the following TechNote: http://www.veritas.com/docs/DOC4397. 3. Installation of Storage Foundation products may encounter issue on Oracle T4 servers, see TechNote http://www.veritas.com/docs/TECH177307.
    [Show full text]
  • SPARC S7 Servers
    Oracle’s SPARC S7 Servers Technical Overview Rainer Schott Oracle Systems Sales Consulting September 2016 Copyright © 2016, Oracle and/or its affiliates. All rights reserved. | Safe Harbor Statement The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle’s products remains at the sole discretion of Oracle. Copyright © 2016, Oracle and/or its affiliates. All rights reserved. | 3 • App Data Integrity SPARC @ Oracle Including • DB Query Acceleration Software in Silicon } • Inline Decompression 7 Processors in 6 Years • More…. 2010 2011 2013 2013 2013 2015 2016 SPARC T3 SPARC T4 SPARC T5 SPARC M5 SPARC M6 SPARC M7 SPARC S7 16 x 2nd Gen cores 8 x 3rd Gen Cores 16 x 3rd Gen Cores 16x 3 rd Gen Cores 12 x 3rd Gen Cores 32 x 4th Gen Cores 8 x 4th Gen Cores 4MB L3 Cache 4MB L3 Cache 8MB L3 Cache 48MB L3 Cache 48MB L3 Cache 64MB L3 Cache 16MB L3 cache 1.65 GHz 3.0 GHz 3.6 GHz 3.6 GHz 3.6 GHz 4.1 GHz 4.5 GHz Copyright © 2016, Oracle and/or its affiliates. All rights reserved. | Oracle Confidential Advancing the State-of-the-Art M7 Microprocessor – World’s First Implementation of Software Features in Silicon • SQL in Silicon – High-Speed Memory Decompression… – Accelerates In-Memory Database • Always-On Security in Silicon – Memory intrusion detection • High-Speed Encryption – Near zero performance impact Copyright © 2016, Oracle and/or its affiliates.
    [Show full text]
  • Antikernel: a Decentralized Secure Hardware-Software Operating
    Antikernel A Decentralized Secure Hardware-Software Operating System Andrew Zonenberg (@azonenberg) Senior Security Consultant, IOActive Bülent Yener Professor, Rensselaer Polytechnic Institute This work is based on Zonenberg’s 2015 doctoral dissertation, advised by Yener. IOActive, Inc. Copyright ©2016. All Rights Reserved. Kernel mode = full access to all state • What OS code needs this level of access? – Memory manager only needs heap metadata – Scheduler only needs run queue – Drivers only need their peripheral – Nothing needs access to state of user-mode apps • No single subsystem that needs access to all state • Any code with ring 0 privs is incompatible with LRP! IOActive, Inc. Copyright ©2016. All Rights Reserved. Monolithic kernel, microkernel, … Huge Huge attack surface Small Better 0 code size - Ring Can we get here? None Monolithic Micro ??? IOActive, Inc. Copyright ©2016. All Rights Reserved. Exokernel (MIT, 1995) • OS abstractions can often hurt performance – You don’t need a full FS to store temporary data on disk • Split protection / segmentation from abstraction Word proc FS Cache Disk driver DHCP server IOActive, Inc. Copyright ©2016. All Rights Reserved. Exokernel (MIT, 1995) • OS does very little: – Divide resources into blocks (CPU time quanta, RAM pages…) – Provide controlled access to them IOActive, Inc. Copyright ©2016. All Rights Reserved. But wait, there’s more… • By removing non-security abstractions from the kernel, we shrink the TCB and thus the attack surface! IOActive, Inc. Copyright ©2016. All Rights Reserved. So what does the kernel have to do? • Well, obviously a few things… – Share CPU time between multiple processes – Allow processes to talk to hardware/drivers – Allow processes to talk to each other – Page-level RAM allocation/access control IOActive, Inc.
    [Show full text]
  • Mali-400 MP: a Scalable GPU for Mobile Devices
    Mali-400 MP: A Scalable GPU for Mobile Devices Tom Olson Director, Graphics Research, ARM Outline . ARM and Mobile Graphics . Design Constraints for Mobile GPUs . Mali Architecture Overview . Multicore Scaling in Mali-400 MP . Results 2 About ARM . World’s leading supplier of semiconductor IP . Processor Architectures and Implementations . Related IP: buses, caches, debug & trace, physical IP . Software tools and infrastructure . Business Model . License fees . Per-chip royalties . Graphics at ARM . Acquired Falanx in 2006 . ARM Mali is now the world’s most widely licensed GPU family 3 Challenges for Mobile GPUs . Size . Power . Memory Bandwidth 4 More Challenges . Graphics is going into “anything that has a screen” . Mobile . Navigation . Set Top Box/DTV . Automotive . Video telephony . Cameras . Printers . Huge range of form factors, screen sizes, power budgets, and performance requirements . In some applications, a huge difference between peak and average performance requirements 5 Solution: Scalability . Address a wide variety of performance points and applications with a single IP and a single software stack. Need static scalability to adapt to different peak requirements in different platforms / markets . Need dynamic scalability to reduce power when peak performance isn’t needed 6 Options for Scalability . Fine-grained: Multiple pipes, wide SIMD, etc . Proven approach, efficient and effective . But, adding pipes / lanes is invasive . Hard for IP licensees to do on their own . And, hard to partition to provide dynamic scalability . Coarse-grained: Multicore . Easy for licensees to select desired performance . Putting cores on separate power islands allows dynamic scaling 7 Mali 400-MP Top Level Architecture Asynch Mali-400 MP Top-Level APB Geometry Pixel Processor Pixel Processor Pixel Processor Pixel Processor Processor #1 #2 #3 #4 CLKs MaliMMUs RESETs IRQs IDLEs MaliL2 AXI .
    [Show full text]
  • How the SPARC T4 Processor Optimizes Throughput Capacity: a Case Study
    An Oracle Technical White Paper April 2012 How the SPARC T4 Processor Optimizes Throughput Capacity: A Case Study How the SPARCT4 Processor Optimizes Throughput Capacity: A Case Study Introduction ....................................................................................... 1 Summary of the Performance Results ............................................... 4 Performance Characteristics of the CSP1 and CSP2 Programs ........ 5 The UltraSPARC T2+ Pipeline Architecture ................................... 5 Instruction Scheduling Analysis for the CSP1 Program on the UltraSPARC T2+ Processor .......................................................... 6 Instruction Scheduling Analysis for the CSP2 Program on the UltraSPARC T2+ Processor .......................................................... 9 UltraSPARC T2+ Performance Results ........................................... 10 Test Framework........................................................................... 10 UltraSPARC T2+ Performance Results for the CSP1 Program .... 11 UltraSPARC T2+ Performance Results for the CSP2 Program .... 12 SPARC T4 Performance Results ..................................................... 13 Summary of the SPARC T4 Core Architecture ............................ 13 Test Framework........................................................................... 15 Cycle Count Estimates and Performance Results for the CSP1 Program on the SPARC T4 Processor ......................................... 15 Cycle Count Estimates and Performance Results for the
    [Show full text]
  • SPARC Enterprise Oracle VM Server for SPARC Important Information
    SPARC Enterprise Oracle VM Server for SPARC Important Information C120-E618-06EN October 2012 Copyright © 2007, 2012, Oracle and/or its affiliates and FUJITSU LIMITED. All rights reserved. Oracle and/or its affiliates and Fujitsu Limited each own or control intellectual property rights relating to products and technology described in this document, and such products, technology and this document are protected by copyright laws, patents, and other intellectual property laws and international treaties. This document and the product and technology to which it pertains are distributed under licenses restricting their use, copying, distribution, and decompilation. No part of such product or technology, or of this document, may be reproduced in any form by any means without prior written authorization of Oracle and/or its affiliates and Fujitsu Limited, and their applicable licensors, if any. The furnishings of this document to you does not give you any rights or licenses, express or implied, with respect to the product or technology to which it pertains, and this document does not contain or represent any commitment of any kind on the part of Oracle or Fujitsu Limited, or any affiliate of either of them. This document and the product and technology described in this document may incorporate third-party intellectual property copyrighted by and/or licensed from the suppliers to Oracle and/or its affiliates and Fujitsu Limited, including software and font technology. Per the terms of the GPL or LGPL, a copy of the source code governed by the GPL or LGPL, as applicable, is available upon request by the End User.
    [Show full text]
  • Nyami: a Synthesizable GPU Architectural Model for General-Purpose and Graphics-Specific Workloads
    Nyami: A Synthesizable GPU Architectural Model for General-Purpose and Graphics-Specific Workloads Jeff Bush Philip Dexter†, Timothy N. Miller†, and Aaron Carpenter⇤ San Jose, California †Dept. of Computer Science [email protected] ⇤ Dept. of Electrical & Computer Engineering Binghamton University {pdexter1, millerti, carpente}@binghamton.edu Abstract tempt to bridge this gap, Intel developed Larrabee (now called Graphics processing units (GPUs) continue to grow in pop- Xeon Phi) [25]. Larrabee is architected around small in-order ularity for general-purpose, highly parallel, high-throughput cores with wide vector ALUs to facilitate graphics rendering and systems. This has forced GPU vendors to increase their fo- multi-threading to hide instruction latencies. The use of small, cus on general purpose workloads, sometimes at the expense simple processor cores allows many cores to be packed onto a of the graphics-specific workloads. Using GPUs for general- single die and into a limited power envelope. purpose computation is a departure from the driving forces be- Although GPUs were originally designed to render images for hind programmable GPUs that were focused on a narrow subset visual displays, today they are used frequently for more general- of graphics rendering operations. Rather than focus on purely purpose applications. However, they must still efficiently per- graphics-related or general-purpose use, we have designed and form what would be considered traditional graphics tasks (i.e. modeled an architecture that optimizes for both simultaneously rendering images onto a screen). GPUs optimized for general- to efficiently handle all GPU workloads. purpose computing may downplay graphics-specific optimiza- In this paper, we present Nyami, a co-optimized GPU archi- tions, even going so far as to offload them to software.
    [Show full text]
  • Oracle's SPARC T5-2, SPARC T5-4, SPARC T5-8, and SPARC T5-1B Server Architecture Oracle's SPARC T5-2, SPARC T5-4, SPARC T5-8, and SPARC T5-1B Server Architecture
    An Oracle White Paper February 2014 Oracle's SPARC T5-2, SPARC T5-4, SPARC T5-8, and SPARC T5-1B Server Architecture Oracle's SPARC T5-2, SPARC T5-4, SPARC T5-8, and SPARC T5-1B Server Architecture Introduction ....................................................................................... 1 Comparison of SPARC T5–Based Server Features........................... 2 SPARC T5 Processor ........................................................................ 3 Taking Oracle’s Multicore/Multithreaded Design to the Next Level 5 SPARC T5 Processor Architecture ................................................ 6 SPARC T5 Processor Cache Architecture ..................................... 8 SPARC T5 Core Architecture ........................................................ 9 Oracle Solaris for Multicore Scalability............................................. 16 Oracle Solaris 11 Operating System ................................................ 18 Oracle Solaris Predictive Self Healing, Fault Management Architecture, and Service Management Facility ....................................................... 19 Oracle Solaris Cryptographic Frameworks................................... 19 End-to-End Virtualization Technology .............................................. 19 A Multithreaded Hypervisor ......................................................... 20 Oracle VM Server for SPARC ...................................................... 20 Oracle Solaris Zones ................................................................... 21 Enterprise-Class
    [Show full text]
  • Exploiting Simple Analytical Models for Modeling Hardware Accelerators
    Exploiting Simple Analytical Models for Modeling Hardware Accelerators by Muhammad Shoaib Bin Altaf A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy (Electrical & Computer Engineering) at the UNIVERSITY OF WISCONSIN–MADISON 2016 Date of final oral examination: 12/08/2016 The dissertation is approved by the following members of the Final Oral Committee: Mark Hill, Professor, Computer Science Mikko Lipasti, Professor, Electrical & Computer Engineering Karthikeyan Sankaralingam, Associate Professor, Computer Science Michael Swift, Associate Professor, Computer Science David Wood, Professor, Computer Science © Copyright by Muhammad Shoaib Bin Altaf 2016 All Rights Reserved i To my parents Tehseen Kausar and Sheikh Altaf Hussain, and my wife Iram Majeed for their love and support. ii acknowledgments I consider myself fortunate enough to work under the guidance of my advisor, David Wood. I would not have completed my thesis without his support. Working with David, can be a challenge in the beginning and you take time in getting settled with his unique style of mentoring. He gave me the freedom to choose a problem of my own choice but made sure that I stayed on the right path. He has a knack for communicating ideas succinctly, and expects (and forces) his students to develop the same. Thanks to David, I consider myself a better writer and researcher. Thanks David. I am also thankful to my committee members for providing useful feedback and com- ments on my work. Mark Hill encouraged and showed excitement about the modeling framework right form the beginning. His advice on making slides has helped me become a better presenter.
    [Show full text]
  • Table of Contents
    1 Copyright © 2013, Oracle and/or its affiliates. All rights reserved. Safe Harbor Statement The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle’s products remains at the sole discretion of Oracle. 2 Copyright © 2013, Oracle and/or its affiliates. All rights reserved. Eine phatastische Reise ins Innere der Hardware Franz Haberhauer Stefan Hinker Oracle Hardware in 3D 5 Copyright © 2013, Oracle and/or its affiliates. All rights reserved. T5 and M5 PCIe Carrier Card . Supports standard low-profile PCIe cards Air Flow PCIe Retimer x16 Connector (x8 electrical) 6 Copyright © 2013, Oracle and/or its affiliates. All rights reserved. PCIe Data Paths: Full System . Two root complexes per T5 processor . Each PCIe port on a T5 processor controls a single PCIe slot 7 Copyright © 2013, Oracle and/or its affiliates. All rights reserved. T5-2 Block Diagram DIMM DIMM DIMM DIMM DIMM DIMM DIMM DIMM DIMM DIMM DIMM DIMM DIMM DIMM DIMM DIMM BoB BoB BoB BoB BoB BoB BoB BoB BoB BoB BoB BoB BoB BoB BoB BoB T5-0 T5-1 CPU CPU TPM Host & CPU PCIe Debug CPU PCIe Debug Data Flash DC/DCs 0 1 Port DC/DCs 0 1 Port x8 x8 FPGA x8 x4 x8 x1 HDD0 DBG SAS/SATA x1 HDD0 IO Controller x4 x4 PCIe PCIe SP Module HDD0 get rid of all inside x8 x8 SAS/SATA smallSwitch boxes 0 Switch 1 FRUID HDD0 IO Controller Sideband Mgmt DRAM HDD0 USB 1.1 Keyboard Mouse Service SPI x8 USB 3.0 x8 USB 2.0 Storage Flash HDD0 Host Processor SATA DVD NAND USB 2.0 Hub USB USB 3.0 USB Internal USB Hub VGA VGA REAR IO Board USB2 USB3 VGA USB0 USB1 VGA Serial Enet Quad 10Gig Enet DB15 Mgmt Mgmt Slot 2 (8) 2 Slot (8) 3 Slot (8) 4 Slot (8) 5 Slot (8) 6 Slot (8) 7 Slot (8) 8 Slot Slot 1 (8) 1 Slot 10/100 FAN BOARD REAR IO 8 Copyright © 2013, Oracle and/or its affiliates.
    [Show full text]
  • Computer Architectures an Overview
    Computer Architectures An Overview PDF generated using the open source mwlib toolkit. See http://code.pediapress.com/ for more information. PDF generated at: Sat, 25 Feb 2012 22:35:32 UTC Contents Articles Microarchitecture 1 x86 7 PowerPC 23 IBM POWER 33 MIPS architecture 39 SPARC 57 ARM architecture 65 DEC Alpha 80 AlphaStation 92 AlphaServer 95 Very long instruction word 103 Instruction-level parallelism 107 Explicitly parallel instruction computing 108 References Article Sources and Contributors 111 Image Sources, Licenses and Contributors 113 Article Licenses License 114 Microarchitecture 1 Microarchitecture In computer engineering, microarchitecture (sometimes abbreviated to µarch or uarch), also called computer organization, is the way a given instruction set architecture (ISA) is implemented on a processor. A given ISA may be implemented with different microarchitectures.[1] Implementations might vary due to different goals of a given design or due to shifts in technology.[2] Computer architecture is the combination of microarchitecture and instruction set design. Relation to instruction set architecture The ISA is roughly the same as the programming model of a processor as seen by an assembly language programmer or compiler writer. The ISA includes the execution model, processor registers, address and data formats among other things. The Intel Core microarchitecture microarchitecture includes the constituent parts of the processor and how these interconnect and interoperate to implement the ISA. The microarchitecture of a machine is usually represented as (more or less detailed) diagrams that describe the interconnections of the various microarchitectural elements of the machine, which may be everything from single gates and registers, to complete arithmetic logic units (ALU)s and even larger elements.
    [Show full text]
  • Dynamic Task Scheduling and Binding for Many-Core Systems Through Stream Rewriting
    Dynamic Task Scheduling and Binding for Many-Core Systems through Stream Rewriting Dissertation zur Erlangung des akademischen Grades Doktor-Ingenieur (Dr.-Ing.) der Fakultät für Informatik und Elektrotechnik der Universität Rostock vorgelegt von Lars Middendorf, geb. am 21.09.1982 in Iserlohn aus Rostock Rostock, 03.12.2014 Gutachter Prof. Dr.-Ing. habil. Christian Haubelt Lehrstuhl "Eingebettete Systeme" Institut für Angewandte Mikroelektronik und Datentechnik Universität Rostock Prof. Dr.-Ing. habil. Heidrun Schumann Lehrstuhl Computergraphik Institut für Informatik Universität Rostock Prof. Dr.-Ing. Michael Hübner Lehrstuhl für Eingebettete Systeme der Informationstechnik Fakultät für Elektrotechnik und Informationstechnik Ruhr-Universität Bochum Datum der Abgabe: 03.12.2014 Datum der Verteidigung: 05.03.2015 Acknowledgements First of all, I would like to thank my supervisor Prof. Dr. Christian Haubelt for his guidance during the years, the scientific assistance to write this thesis, and the chance to research on a very individual topic. In addition, I thank my colleagues for interesting discussions and a pleasant working environment. Finally, I would like to thank my family for supporting and understanding me. Contents 1 INTRODUCTION........................................................................................................................1 1.1 STREAM REWRITING ...................................................................................................................5 1.2 RELATED WORK .........................................................................................................................7
    [Show full text]