Overview of Multiprocessors

Total Page:16

File Type:pdf, Size:1020Kb

Overview of Multiprocessors Review of Concept of ILP Processors • Interest in multiple-issue because wanted to improve performance without affecting uniprocessor programming model CS 211 Computer Architecture • Taking advantage of ILP is conceptually simple, but design problems are amazingly complex in practice ILP to Multiprocessing • Processors of last 5 years (Pentium 4, IBM Power 5, AMD Opteron) have the same basic structure and similar sustained issue rates (3 to 4 instructions per clock) as the 1st dynamically scheduled, multiple- issue processors announced in 1995 – Clocks 10 to 20X faster, caches 4 to 8X bigger, 2 to 4X as many renaming registers, and 2X as many load-store units ⇒ performance 8 to 16X • Peak v. delivered performance gap increasing 2 Next… Limits to ILP • Conflicting studies of ILP amount • Quick review of Limits to ILP: Chapter 3 – Benchmarks • Thread Level Parallelism: Chapter 3 – Hardware sophistication – Multithreading – Compiler sophistication – Simultaneous Multithreading • How much ILP is available using existing • Multiprocessing: Chapter 4 mechanisms with increasing HW budgets? – Fundamentals of multiprocessors » Synchronization, memory, … • Do we need to invent new HW/SW – Chip level multiprocessing: Multi-Core mechanisms to keep on processor performance curve? – Intel MMX, SSE (Streaming SIMD Extensions): 64 bit ints – Intel SSE2: 128 bit, including 2 64-bit Fl. Pt. per clock – Motorola AltaVec: 128 bit ints and FPs – Supersparc Multimedia ops, etc. 3 4 NOW Handout Page 1 1 Overcoming Limits Limits to ILP • Advances in compiler technology + significantly new and different hardware Assumptions for ideal/perfect machine to start: techniques may be able to overcome 1. Register renaming – infinite virtual registers limitations assumed in studies => all register WAW & WAR hazards are avoided • However, unlikely such advances when 2. Branch prediction – perfect; no mispredictions coupled with realistic hardware will 3. Jump prediction – all jumps perfectly predicted overcome these limits in near future (returns, case statements) 2 & 3 ⇒ no control dependencies; perfect speculation & an unbounded buffer of instructions available 4. Memory-address alias analysis – addresses known & a load can be moved before a store provided addresses not equal; 1&4 eliminates all but RAW 5. Window size is infinite 5 6 Some numbers… Upper Limit to ILP: Ideal Machine (Figure 3.1) • Initial MIPS HW Model here; MIPS compilers. 160 150.1 FP: 75 - 150 • perfect caches; 1 cycle latency for all 140 instructions (FP *,/); unlimited instructions Integer: 18 - 60 118.7 issued/clock cycle 120 100 75.2 80 62.6 60 54.8 40 17.9 20 0 Instructions Per Clock Instructions gcc espresso li fpppp doducd tomcatv Programs 7 8 NOW Handout Page 2 2 ILP Limitations: In Reality – ILP Limitations: Architecture “Parameters” Architecture“Parameters” • Window size • Window size – Large window size could lead to more ILP, but more time to examine – Large window size could lead to more ILP, but more time to instruction packet to determine parallelism examine instruction packet to determine parallelism • Register file size – Finite size of register file (virtual and physical) introduces more name dependencies • Branch prediction – Effects of realistic branch predictor • Memory aliasing – Idealized model assumes we can analyze all memory dependencies: but compile time analysis cannot be perfect – Realistic: » Global/stack: assume perfect predictions for global and stack data but all heap references will conflict » Inspection: deterministic compile time references • R1(10) and R1(100) cannot conflict if R1 has not changed between the two » None: all memory accesses are assumed to conflict 9 10 More Realistic HW: Window Impact ILP Limitations: Figure 3.2 Architecture“Parameters” Change from Infinite window 2048, 512, 128, 32 FP: 9 - 150 160 150 • Register file size – Finite size of register file (virtual and physical) introduces 140 more name dependencies 119 k 120 Integer: 8 - 63 100 75 80 63 61 59 60 55 IPC 60 49 45 41 36 35 40 34 Instructions Per Cloc 18 16 1513 1512 14 15 14 20 101088119 9 0 gcc espresso li fpppp doduc tomcatv Inf inite 2048 512 128 32 11 12 NOW Handout Page 3 3 More Realistic HW: ILP Limitations: Renaming Register Impact (N int + N fp +64) Architecture“Parameters” Figure 3.5 70 FP: 11 - 45 Change 2048 instr 59 60 window, 64 instr 54 49 50 issue, 8K 2 level 45 Prediction 44 • Branch prediction – Effects of realistic branch predictor 40 35 30 Integer: 5 - 15 29 28 IPC 20 20 16 15 15 15 13 12 12 12 11 11 11 10 10 10 10 9 6 7 5 5 5 555 5 4 4 4 0 gcc espresso li fpppp doducd tomcatv Program Infinite 256 128 64 32 None Infinite 256 128 64 32 None 13 14 More Realistic HW: Branch Impact Figure 3.3 Misprediction Rates Change from Infinite 61 FP: 15 -60 45 60 window to examine to 58 35% 2048 and maximum 30% 30% 50 issue of 64 instructions 48 46 46 per clock cycle 45 45 45 25% 23% 41 40 20% 18% 18% 35 16% 14% 14% Integer: 6 - 12 15% 30 29 12% 12% 10% 6% Misprediction Rate Misprediction IPC 20 19 5% 4% 16 3% 15 5% 2% 2% 13 14 1%1% 12 0% 10 10 9 0% 7 7 6 666 4 tomcatv doduc fpppp li espresso gcc 2 2 2 0 gcc espresso li fpppp doducd tomcatv Profile-based 2-bit counter Tournament Program Perfect Selective predictor Standard 2-bit Static None 15 16 Perfect Tournament BHT (512) Profile No prediction NOW Handout Page 4 4 ILP Limitations: More Realistic HW: Architecture“Parameters” Memory Address Alias Impact Figure 3.6 49 49 • Memory aliasing 50 45 45 – Idealized model assumes we can analyze all memory 45 Change 2048 instr dependencies: compile time analysis cannot be perfect 40 window, 64 instr FP: 4 - 45 – Realistic: issue, 8K 2 level 35 (Fortran, » Global/stack: assume perfect predictions for global and Prediction, 256 30 no heap) stack data but all heap references will conflict renaming registers » Inspection: deterministic compile time references 25 • R1(10) and R1(100) cannot conflict if R1 has not changed between the 20 Integer: 4 - 9 16 16 two 15 15 12 » None: all memory accesses are assumed to conflict 10 9 IPC 10 7 7 5 5 6 4 4 4 5 5 3 3 3 4 4 0 gcc espresso li fpppp doducd tomcatv Program Perfect Global/stack Perfect Inspection None Perfect Global/Stack perf; Inspec. None 17 heap conflicts Assem. 18 How to Exceed ILP Limits of this study? HW v. SW to increase ILP • These are not laws of physics; just practical limits • Memory disambiguation: HW best for today, and perhaps overcome via research • Speculation: • Compiler and ISA advances could change results – HW best when dynamic branch prediction • WAR and WAW hazards through memory: better than compile time prediction eliminated WAW and WAR hazards through – Exceptions easier for HW register renaming, but not in memory usage – Can get conflicts via allocation of stack frames as a called – HW doesn’t need bookkeeping code or procedure reuses the memory addresses of a previous frame compensation code on the stack – Very complicated to get right • Scheduling: SW can look ahead to schedule better • Compiler independence: does not require new compiler, recompilation to run well 19 20 NOW Handout Page 5 5 Next….Thread Level Parallelism Performance beyond single thread ILP • There can be much higher natural parallelism in some applications (e.g., Database or Scientific codes) • Explicit Thread Level Parallelism or Data Level Parallelism • Thread: process with own instructions and data – thread may be a process part of a parallel program of multiple processes, or it may be an independent program – Each thread has all the state (instructions, data, PC, register state, and so on) necessary to allow it to execute • Data Level Parallelism: Perform identical operations on data, and lots of data – Example: Vector operations, matrix computations 21 22 Thread Level Parallelism (TLP) New Approach: Mulithreaded Execution • ILP exploits implicit parallel operations • Multithreading: multiple threads to share the within a loop or straight-line code functional units of 1 processor via segment overlapping • TLP explicitly represented by the use of – processor must duplicate independent state of each thread multiple threads of execution that are e.g., a separate copy of register file, a separate PC, and for inherently parallel running independent programs, a separate page table – memory shared through the virtual memory mechanisms, • Goal: Use multiple instruction streams to which already support multiple processes improve – HW for fast thread switch; much faster than full process 1. Throughput of computers that run many switch ≈ 100s to 1000s of clocks programs • When switch? 2. Execution time of multi-threaded programs – Fine Grain: Alternate instruction per thread • TLP could be more cost-effective to – Coarse grain: When a thread is stalled, perhaps for a cache exploit than ILP miss, another thread can be executed 23 24 NOW Handout Page 6 6 Multithreaded Processing Fine-Grained Multithreading Fine Grain Coarse Grain • Switches between threads on each instruction, causing the execution of multiples threads to be interleaved – Usually done in a round-robin fashion, skipping any stalled threads • CPU must be able to switch threads every clock • Advantage: – can hide both short and long stalls, since instructions from other threads executed when one thread stalls • Disadvantage: – slows down execution of individual threads, since a thread ready to execute without stalls will be delayed by instructions from other threads • Used on Sun’s Niagara (see textbook) Thread 1 Thread 3 Thread 5 Thread 2 Thread 4 Idle slot 25 26 Course-Grained
Recommended publications
  • Development of a Dynamically Extensible Spinnaker Chip Computing Module
    FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Development of a Dynamically Extensible SpiNNaker Chip Computing Module Rui Emanuel Gonçalves Calado Araújo Master in Electrical and Computers Engineering Supervisor: Jörg Conradt Co-Supervisor: Diamantino Freitas January 27, 2014 Resumo O projeto SpiNNaker desenvolveu uma arquitetura que é capaz de criar um sistema com mais de um milhão de núcleos, com o objetivo de simular mais de um bilhão de neurónios em tempo real biológico. O núcleo deste sistema é o "chip" SpiNNaker, um multiprocessador System-on-Chip com um elevado nível de interligação entre as suas unidades de processamento. Apesar de ser uma plataforma de computação com muito potencial, até para aplicações genéricas, atualmente é ape- nas disponibilizada em configurações fixas e requer uma estação de trabalho, como uma máquina tipo "desktop" ou "laptop" conectada através de uma conexão Ethernet, para a sua inicialização e receber o programa e os dados a processar. No sentido de tirar proveito das capacidades do "chip" SpiNNaker noutras áreas, como por exemplo, na área da robótica, nomeadamente no caso de robots voadores ou de tamanho pequeno, uma nova solução de hardware com software configurável tem de ser projetada de forma a poder selecionar granularmente a quantidade do poder de processamento. Estas novas capacidades per- mitem que a arquitetura SpiNNaker possa ser utilizada em mais aplicações para além daquelas para que foi originalmente projetada. Esta dissertação apresenta um módulo de computação dinamicamente extensível baseado em "chips" SpiNNaker com a finalidade de ultrapassar as limitações supracitadas das máquinas SpiN- Naker atualmente disponíveis. Esta solução consiste numa única placa com um microcontrolador, que emula um "chip" SpiNNaker com uma ligação Ethernet, acessível através de uma porta série e com um "chip" SpiNNaker.
    [Show full text]
  • Wind Rose Data Comes in the Form >200,000 Wind Rose Images
    Making Wind Speed and Direction Maps Rich Stromberg Alaska Energy Authority [email protected]/907-771-3053 6/30/2011 Wind Direction Maps 1 Wind rose data comes in the form of >200,000 wind rose images across Alaska 6/30/2011 Wind Direction Maps 2 Wind rose data is quantified in very large Excel™ spreadsheets for each region of the state • Fields: X Y X_1 Y_1 FILE FREQ1 FREQ2 FREQ3 FREQ4 FREQ5 FREQ6 FREQ7 FREQ8 FREQ9 FREQ10 FREQ11 FREQ12 FREQ13 FREQ14 FREQ15 FREQ16 SPEED1 SPEED2 SPEED3 SPEED4 SPEED5 SPEED6 SPEED7 SPEED8 SPEED9 SPEED10 SPEED11 SPEED12 SPEED13 SPEED14 SPEED15 SPEED16 POWER1 POWER2 POWER3 POWER4 POWER5 POWER6 POWER7 POWER8 POWER9 POWER10 POWER11 POWER12 POWER13 POWER14 POWER15 POWER16 WEIBC1 WEIBC2 WEIBC3 WEIBC4 WEIBC5 WEIBC6 WEIBC7 WEIBC8 WEIBC9 WEIBC10 WEIBC11 WEIBC12 WEIBC13 WEIBC14 WEIBC15 WEIBC16 WEIBK1 WEIBK2 WEIBK3 WEIBK4 WEIBK5 WEIBK6 WEIBK7 WEIBK8 WEIBK9 WEIBK10 WEIBK11 WEIBK12 WEIBK13 WEIBK14 WEIBK15 WEIBK16 6/30/2011 Wind Direction Maps 3 Data set is thinned down to wind power density • Fields: X Y • POWER1 POWER2 POWER3 POWER4 POWER5 POWER6 POWER7 POWER8 POWER9 POWER10 POWER11 POWER12 POWER13 POWER14 POWER15 POWER16 • Power1 is the wind power density coming from the north (0 degrees). Power 2 is wind power from 22.5 deg.,…Power 9 is south (180 deg.), etc… 6/30/2011 Wind Direction Maps 4 Spreadsheet calculations X Y POWER1 POWER2 POWER3 POWER4 POWER5 POWER6 POWER7 POWER8 POWER9 POWER10 POWER11 POWER12 POWER13 POWER14 POWER15 POWER16 Max Wind Dir Prim 2nd Wind Dir Sec -132.7365 54.4833 0.643 0.767 1.911 4.083
    [Show full text]
  • 1 Introduction
    Cambridge University Press 978-0-521-76992-1 - Microprocessor Architecture: From Simple Pipelines to Chip Multiprocessors Jean-Loup Baer Excerpt More information 1 Introduction Modern computer systems built from the most sophisticated microprocessors and extensive memory hierarchies achieve their high performance through a combina- tion of dramatic improvements in technology and advances in computer architec- ture. Advances in technology have resulted in exponential growth rates in raw speed (i.e., clock frequency) and in the amount of logic (number of transistors) that can be put on a chip. Computer architects have exploited these factors in order to further enhance performance using architectural techniques, which are the main subject of this book. Microprocessors are over 30 years old: the Intel 4004 was introduced in 1971. The functionality of the 4004 compared to that of the mainframes of that period (for example, the IBM System/370) was minuscule. Today, just over thirty years later, workstations powered by engines such as (in alphabetical order and without specific processor numbers) the AMD Athlon, IBM PowerPC, Intel Pentium, and Sun UltraSPARC can rival or surpass in both performance and functionality the few remaining mainframes and at a much lower cost. Servers and supercomputers are more often than not made up of collections of microprocessor systems. It would be wrong to assume, though, that the three tenets that computer archi- tects have followed, namely pipelining, parallelism, and the principle of locality, were discovered with the birth of microprocessors. They were all at the basis of the design of previous (super)computers. The advances in technology made their implementa- tions more practical and spurred further refinements.
    [Show full text]
  • Instruction Latencies and Throughput for AMD and Intel X86 Processors
    Instruction latencies and throughput for AMD and Intel x86 processors Torbj¨ornGranlund 2019-08-02 09:05Z Copyright Torbj¨ornGranlund 2005{2019. Verbatim copying and distribution of this entire article is permitted in any medium, provided this notice is preserved. This report is work-in-progress. A newer version might be available here: https://gmplib.org/~tege/x86-timing.pdf In this short report we present latency and throughput data for various x86 processors. We only present data on integer operations. The data on integer MMX and SSE2 instructions is currently limited. We might present more complete data in the future, if there is enough interest. There are several reasons for presenting this report: 1. Intel's published data were in the past incomplete and full of errors. 2. Intel did not publish any data for 64-bit operations. 3. To allow straightforward comparison of an important aspect of AMD and Intel pipelines. The here presented data is the result of extensive timing tests. While we have made an effort to make sure the data is accurate, the reader is cautioned that some errors might have crept in. 1 Nomenclature and notation LNN means latency for NN-bit operation.TNN means throughput for NN-bit operation. The term throughput is used to mean number of instructions per cycle of this type that can be sustained. That implies that more throughput is better, which is consistent with how most people understand the term. Intel use that same term in the exact opposite meaning in their manuals. The notation "P6 0-E", "P4 F0", etc, are used to save table header space.
    [Show full text]
  • Power4 Focuses on Memory Bandwidth IBM Confronts IA-64, Says ISA Not Important
    VOLUME 13, NUMBER 13 OCTOBER 6,1999 MICROPROCESSOR REPORT THE INSIDERS’ GUIDE TO MICROPROCESSOR HARDWARE Power4 Focuses on Memory Bandwidth IBM Confronts IA-64, Says ISA Not Important by Keith Diefendorff company has decided to make a last-gasp effort to retain control of its high-end server silicon by throwing its consid- Not content to wrap sheet metal around erable financial and technical weight behind Power4. Intel microprocessors for its future server After investing this much effort in Power4, if IBM fails business, IBM is developing a processor it to deliver a server processor with compelling advantages hopes will fend off the IA-64 juggernaut. Speaking at this over the best IA-64 processors, it will be left with little alter- week’s Microprocessor Forum, chief architect Jim Kahle de- native but to capitulate. If Power4 fails, it will also be a clear scribed IBM’s monster 170-million-transistor Power4 chip, indication to Sun, Compaq, and others that are bucking which boasts two 64-bit 1-GHz five-issue superscalar cores, a IA-64, that the days of proprietary CPUs are numbered. But triple-level cache hierarchy, a 10-GByte/s main-memory IBM intends to resist mightily, and, based on what the com- interface, and a 45-GByte/s multiprocessor interface, as pany has disclosed about Power4 so far, it may just succeed. Figure 1 shows. Kahle said that IBM will see first silicon on Power4 in 1Q00, and systems will begin shipping in 2H01. Looking for Parallelism in All the Right Places With Power4, IBM is targeting the high-reliability servers No Holds Barred that will power future e-businesses.
    [Show full text]
  • NVIDIA CUDA on IBM POWER8: Technical Overview, Software Installation, and Application Development
    Redpaper Dino Quintero Wei Li Wainer dos Santos Moschetta Mauricio Faria de Oliveira Alexander Pozdneev NVIDIA CUDA on IBM POWER8: Technical overview, software installation, and application development Overview The exploitation of general-purpose computing on graphics processing units (GPUs) and modern multi-core processors in a single heterogeneous parallel system has proven highly efficient for running several technical computing workloads. This applied to a wide range of areas such as chemistry, bioinformatics, molecular biology, engineering, and big data analytics. Recently launched, the IBM® Power System S824L comes into play to explore the use of the NVIDIA Tesla K40 GPU, combined with the latest IBM POWER8™ processor, providing a unique technology platform for high performance computing. This IBM Redpaper™ publication discusses the installation of the system, and the development of C/C++ and Java applications using the NVIDIA CUDA platform for IBM POWER8. Note: CUDA stands for Compute Unified Device Architecture. It is a parallel computing platform and programming model created by NVIDIA and implemented by the GPUs that they produce. The following topics are covered: Advantages of NVIDIA on POWER8 The IBM Power Systems S824L server Software stack System monitoring Application development Tuning and debugging Application examples © Copyright IBM Corp. 2015. All rights reserved. ibm.com/redbooks 1 Advantages of NVIDIA on POWER8 The IBM and NVIDIA partnership was announced in November 2013, for the purpose of integrating IBM POWER®-based systems with NVIDIA GPUs, and enablement of GPU-accelerated applications and workloads. The goal is to deliver higher performance and better energy efficiency to companies and data centers. This collaboration produced its initial results in 2014 with: The announcement of the first IBM POWER8 system featuring NVIDIA Tesla GPUs (IBM Power Systems™ S824L).
    [Show full text]
  • Openpower AI CERN V1.Pdf
    Moore’s Law Processor Technology Firmware / OS Linux Accelerator sSoftware OpenStack Storage Network ... Price/Performance POWER8 2000 2020 DRAM Memory Chips Buffer Power8: Up to 12 Cores, up to 96 Threads L1, L2, L3 + L4 Caches Up to 1 TB per socket https://www.ibm.com/blogs/syst Up to 230 GB/s sustained memory ems/power-systems- openpower-enable- bandwidth acceleration/ System System Memory Memory 115 GB/s 115 GB/s POWER8 POWER8 CPU CPU NVLink NVLink 80 GB/s 80 GB/s P100 P100 P100 P100 GPU GPU GPU GPU GPU GPU GPU GPU Memory Memory Memory Memory GPU PCIe CPU 16 GB/s System bottleneck Graphics System Memory Memory IBM aDVantage: data communication and GPU performance POWER8 + 78 ms Tesla P100+NVLink x86 baseD 170 ms GPU system ImageNet / Alexnet: Minibatch size = 128 ADD: Coherent Accelerator Processor Interface (CAPI) FPGA CAPP PCIe POWER8 Processor ...FPGAs, networking, memory... Typical I/O MoDel Flow Copy or Pin MMIO Notify Poll / Int Copy or Unpin Ret. From DD DD Call Acceleration Source Data Accelerator Completion Result Data Completion Flow with a Coherent MoDel ShareD Mem. ShareD Memory Acceleration Notify Accelerator Completion Focus on Enterprise Scale-Up Focus on Scale-Out and Enterprise Future Technology and Performance DriVen Cost and Acceleration DriVen Partner Chip POWER6 Architecture POWER7 Architecture POWER8 Architecture POWER9 Architecture POWER10 POWER8/9 2007 2008 2010 2012 2014 2016 2017 TBD 2018 - 20 2020+ POWER6 POWER6+ POWER7 POWER7+ POWER8 POWER8 P9 SO P9 SU P9 SO 2 cores 2 cores 8 cores 8 cores 12 cores w/ NVLink
    [Show full text]
  • IBM Powerpc 970 (A.K.A. G5)
    IBM PowerPC 970 (a.k.a. G5) Ref 1 David Benham and Yu-Chung Chen UIC – Department of Computer Science CS 466 PPC 970FX overview ● 64-bit RISC ● 58 million transistors ● 512 KB of L2 cache and 96KB of L1 cache ● 90um process with a die size of 65 sq. mm ● Native 32 bit compatibility ● Maximum clock speed of 2.7 Ghz ● SIMD instruction set (Altivec) ● 42 watts @ 1.8 Ghz (1.3 volts) ● Peak data bandwidth of 6.4 GB per second A picture is worth a 2^10 words (approx.) Ref 2 A little history ● PowerPC processor line is a product of the AIM alliance formed in 1991. (Apple, IBM, and Motorola) ● PPC 601 (G1) - 1993 ● PPC 603 (G2) - 1995 ● PPC 750 (G3) - 1997 ● PPC 7400 (G4) - 1999 ● PPC 970 (G5) - 2002 ● AIM alliance dissolved in 2005 Processor Ref 3 Ref 3 Core details ● 16(int)-25(vector) stage pipeline ● Large number of 'in flight' instructions (various stages of execution) - theoretical limit of 215 instructions ● 512 KB L2 cache ● 96 KB L1 cache – 64 KB I-Cache – 32 KB D-Cache Core details continued ● 10 execution units – 2 load/store operations – 2 fixed-point register-register operations – 2 floating-point operations – 1 branch operation – 1 condition register operation – 1 vector permute operation – 1 vector ALU operation ● 32 64 bit general purpose registers, 32 64 bit floating point registers, 32 128 vector registers Pipeline Ref 4 Benchmarks ● SPEC2000 ● BLAST – Bioinformatics ● Amber / jac - Structure biology ● CFD lab code SPEC CPU2000 ● IBM eServer BladeCenter JS20 ● PPC 970 2.2Ghz ● SPECint2000 ● Base: 986 Peak: 1040 ● SPECfp2000 ● Base: 1178 Peak: 1241 ● Dell PowerEdge 1750 Xeon 3.06Ghz ● SPECint2000 ● Base: 1031 Peak: 1067 Apple’s SPEC Results*2 ● SPECfp2000 ● Base: 1030 Peak: 1044 BLAST Ref.
    [Show full text]
  • IBM Power Systems Performance Report Apr 13, 2021
    IBM Power Performance Report Power7 to Power10 September 8, 2021 Table of Contents 3 Introduction to Performance of IBM UNIX, IBM i, and Linux Operating System Servers 4 Section 1 – SPEC® CPU Benchmark Performance 4 Section 1a – Linux Multi-user SPEC® CPU2017 Performance (Power10) 4 Section 1b – Linux Multi-user SPEC® CPU2017 Performance (Power9) 4 Section 1c – AIX Multi-user SPEC® CPU2006 Performance (Power7, Power7+, Power8) 5 Section 1d – Linux Multi-user SPEC® CPU2006 Performance (Power7, Power7+, Power8) 6 Section 2 – AIX Multi-user Performance (rPerf) 6 Section 2a – AIX Multi-user Performance (Power8, Power9 and Power10) 9 Section 2b – AIX Multi-user Performance (Power9) in Non-default Processor Power Mode Setting 9 Section 2c – AIX Multi-user Performance (Power7 and Power7+) 13 Section 2d – AIX Capacity Upgrade on Demand Relative Performance Guidelines (Power8) 15 Section 2e – AIX Capacity Upgrade on Demand Relative Performance Guidelines (Power7 and Power7+) 20 Section 3 – CPW Benchmark Performance 19 Section 3a – CPW Benchmark Performance (Power8, Power9 and Power10) 22 Section 3b – CPW Benchmark Performance (Power7 and Power7+) 25 Section 4 – SPECjbb®2015 Benchmark Performance 25 Section 4a – SPECjbb®2015 Benchmark Performance (Power9) 25 Section 4b – SPECjbb®2015 Benchmark Performance (Power8) 25 Section 5 – AIX SAP® Standard Application Benchmark Performance 25 Section 5a – SAP® Sales and Distribution (SD) 2-Tier – AIX (Power7 to Power8) 26 Section 5b – SAP® Sales and Distribution (SD) 2-Tier – Linux on Power (Power7 to Power7+)
    [Show full text]
  • Exam 1 Solutions
    Midterm Exam ECE 741 – Advanced Computer Architecture, Spring 2009 Instructor: Onur Mutlu TAs: Michael Papamichael, Theodoros Strigkos, Evangelos Vlachos February 25, 2009 EXAM 1 SOLUTIONS Problem Points Score 1 40 2 20 3 15 4 20 5 25 6 20 7 (bonus) 15 Total 140+15 • This is a closed book midterm. You are allowed to have only two letter-sized cheat sheets. • No electronic devices may be used. • This exam lasts 1 hour 50 minutes. • If you make a mess, clearly indicate your final answer. • For questions requiring brief answers, please provide brief answers. Do not write an essay. You can be penalized for verbosity. • Please show your work when needed. We cannot give you partial credit if you do not clearly show how you arrive at a numerical answer. • Please write your name on every sheet. EXAM 1 SOLUTIONS Problem 1 (Short answers – 40 points) i. (3 points) A cache has the block size equal to the word length. What property of program behavior, which usually contributes to higher performance if we use a cache, does not help the performance if we use THIS cache? Spatial locality ii. (3 points) Pipelining increases the performance of a processor if the pipeline can be kept full with useful instructions. Two reasons that often prevent the pipeline from staying full with useful instructions are (in two words each): Data dependencies Control dependencies iii. (3 points) The reference bit (sometimes called “access” bit) in a PTE (Page Table Entry) is used for what purpose? Page replacement The similar function is performed by what bit or bits in a cache’s tag store entry? Replacement policy bits (e.g.
    [Show full text]
  • Introduction to the Cell Multiprocessor
    Introduction J. A. Kahle M. N. Day to the Cell H. P. Hofstee C. R. Johns multiprocessor T. R. Maeurer D. Shippy This paper provides an introductory overview of the Cell multiprocessor. Cell represents a revolutionary extension of conventional microprocessor architecture and organization. The paper discusses the history of the project, the program objectives and challenges, the design concept, the architecture and programming models, and the implementation. Introduction: History of the project processors in order to provide the required Initial discussion on the collaborative effort to develop computational density and power efficiency. After Cell began with support from CEOs from the Sony several months of architectural discussion and contract and IBM companies: Sony as a content provider and negotiations, the STI (SCEI–Toshiba–IBM) Design IBM as a leading-edge technology and server company. Center was formally opened in Austin, Texas, on Collaboration was initiated among SCEI (Sony March 9, 2001. The STI Design Center represented Computer Entertainment Incorporated), IBM, for a joint investment in design of about $400,000,000. microprocessor development, and Toshiba, as a Separate joint collaborations were also set in place development and high-volume manufacturing technology for process technology development. partner. This led to high-level architectural discussions A number of key elements were employed to drive the among the three companies during the summer of 2000. success of the Cell multiprocessor design. First, a holistic During a critical meeting in Tokyo, it was determined design approach was used, encompassing processor that traditional architectural organizations would not architecture, hardware implementation, system deliver the computational power that SCEI sought structures, and software programming models.
    [Show full text]
  • IBM Power Roadmap
    POWER6™ Processor and Systems Jim McInnes [email protected] Compiler Optimization IBM Canada Toronto Software Lab © 2007 IBM Corporation All statements regarding IBM future directions and intent are subject to change or withdrawal without notice and represent goals and objectives only Role .I am a Technical leader in the Compiler Optimization Team . Focal point to the hardware development team . Member of the Power ISA Architecture Board .For each new microarchitecture I . help direct the design toward helpful features . Design and deliver specific compiler optimizations to enable hardware exploitation 2 IBM POWER6 Overview © 2007 IBM Corporation All statements regarding IBM future directions and intent are subject to change or withdrawal without notice and represent goals and objectives only POWER5 Chip Overview High frequency dual-core chip . 8 execution units 2LS, 2FP, 2FX, 1BR, 1CR . 1.9MB on-chip shared L2 – point of coherency, 3 slices . On-chip L3 directory and controller . On-chip memory controller Technology & Chip Stats . 130nm lithography, Cu, SOI . 276M transistors, 389 mm2 . I/Os: 2313 signal, 3057 Power/Gnd 3 IBM POWER6 Overview © 2007 IBM Corporation All statements regarding IBM future directions and intent are subject to change or withdrawal without notice and represent goals and objectives only POWER6 Chip Overview SDU RU Ultra-high frequency dual-core chip IFU FXU . 8 execution units L2 Core0 VMX L2 2LS, 2FP, 2FX, 1BR, 1VMX QUAD LSU FPU QUAD . 2 x 4MB on-chip L2 – point of L3 coherency, 4 quads L2 CNTL CNTL . On-chip L3 directory and controller . Two on-chip memory controllers M FBC GXC M Technology & Chip stats C C .
    [Show full text]