Energy-Effective Issue Logic

Total Page:16

File Type:pdf, Size:1020Kb

Energy-Effective Issue Logic Energy-Effective Issue Logic Daniele Folegnani and Antonio GonzaJez Departament d'Arquitectura de Computadors Universitat Politecnica de Catalunya Jordi Girona, 1-3 Mbdul D6 08034 Barcelona, Spain [email protected] Abstract require an increasing computing power, which sometimes is achieved through higher clock frequencies and more sophisticated Tile issue logic of a dynamically-scheduled superscalar processor architectural techniques, with an obvious impact on power is a complex mechanism devoted to start the execution of multiple consumption. instructions eveo' cycle. Due to its complexity, it is responsible for a significant percentage of the energy consumed by a With the fast increase in design complexity and reduction in microprocessor. The energy consumption of the issue logic design time, new generation CAD tools for VLSI design like depends on several architectural parameters, the instruction issue PowerMill [13] and QuickPower [12] are crucial to evaluate the queue size being one of the most important. In this paper we energy consumption at different points of the design, helping to present a technique to reduce the energy consumption of the issue make important decisions early in the design process. The logic of a high-performance superscalar processor. The proposed importance of a continuous feedback between the functional technique is based on the observation that the conventional issue system description and the power impact requires better and faster logic wastes a significant amount of energy for useless activio,. In evaluation tools and models to introduce changes rapidly through particular, the wake-up of empty entries and operands that are various design scenarios. Recent research has shown how read)' represents an important source of energy waste. Besides, architectural power estimation tools like SimplePower [36], we propose a mechanism to dynamically reduce the effective size Wattch [6] and Architectural Power Evaluation [7] can accurately of the instruction queue. We show that on average the effective estimate the power consumption of a whole microprocessor with instruction queue size can be reduced by a factor of 26% with a reasonable overhead and accuracy. minimal impact on performance. This reduction together with the In this paper we first evaluate the energy consumed by the energy saved for empo' and read)' entries result in about 90.7% different parts of a superscalar processor through the Architectural reduction in the energy consumed by the wake-up logic, which Power Evaluation tool [7], which is a detailed cycle-level represents 14.9% of the total energy of the assumed processor. simulator that includes both performance and power consumption Keywords: Issue logic; energy consumption; low power; estimation techniques. This analysis demonstrates that one of the adaptive hardware. critical components for power consumption in modern superscalar processors is the part devoted to extract parallelism from applications at run time. In particular, the wake-up function, 1. Introduction which analyzes data dependencies of the program through an associative search in the instruction queue, is the main power Power consumption has become an important concern in consuming part of the issue logic. This is reflected in a high processor design. More than 95% of current microprocessors complexity circuit and high logic activity. For a typical produced today are used in embedded systems, so low power microarchitecture the instruction queue and its associated issue requirements are nowadays critical[30]. Mobile systems need an logic are responsible for about 25% of the total power extremely efficient use of the energy due to their limited battery consumption on average. This observation motivates an analysis life, which is not expected to experience drastic increases in the of the effectiveness of a large instruction queue. near future. On the other hand, high performance microprocessors have to deal with heat dissipation and high current peak problems. We first note that both empty entries and ready entries This impose very expensive cooling systems with an increasing consume energy for doing a useless activity: the former do not relative cost with respect to the total chip cost. For actual cooling contain any valid data whereas the latter are known to be ready, so systems and technology, beyond 50 Watt of total dissipation, the they do not need to be checked by the wake-up logic. Besides, we cost of dissipating a further Watt becomes superlinear, and may observe that the instruction queue must be large if its size is fixed soon reach an unreasonable cost [271]. Besides, several failure and high performance is desired for a wide range of applications. mechanisms such as thermal runaway, gate dielectric, junction However, different applications may require different queue sizes, fatigue, electro-migration diffusion, electrical-parameter shift and as already pointed out by other authors [2]. In addition, the queue silicon interconnections fatigue become significantly worse as size requirements may significantly vary during the execution of temperature increases [28]. As an example of the current trend in a program. Based on these observations, we propose a technique power consumption, current microarchitectures such as the Alpha to dynamically resize the instruction queue according to the 21364 and PowerPC 704 dissipate about 85 and 100 Watt characteristics of the dynamic instruction stream. This technique respectively [31]. Another important question is that the growing reduces the effective size of the instruction queue by about 26% demand of multimedia functionalities for computer systems [ 16] on average. We show that these mechanisms reduce the energy 230 1063-6897/01 $10.00 © 2001 IEEE consumption of the wake-up logic by about 90% with respect to a requirements [3][2]. Marculescu proposes a mechanism to fixed size instruction queue, while the performance is hardly dynamically adapt the fetch and execution bandwidth based on a affected. profiling at the basic-block level [22]. The rest of this paper is organized as follows. Section 2 Code and data compression [24][15][8] can reduce activity in reviews previous work. Section 3 describes the experimental memories and buses although they introduce the overhead of framework. Section 4 analyzes the energy consumption of a decompressing tasks. typical superscalar processor. Section 5 presents the proposed Some compiler techniques can also be found in the literature. techniques to reduce energy consumption and section 6 evaluates For instance, instruction scheduling approaches [9] can be used to their performance. Finally, section 7 summarizes the main avoid large current peaks in the execution core of high conclusions of this work. performance processors. 2. Related work 3. Experimental Framework A large number of techniques for low power have been proposed in this section we present the framework used to estimate the in the literature. The problem has been attacked from different performance and power consumption of a typical superscalar standpoints, ranging from a high-level perspective, such as the processor. For this purpose, we have used the Architectural Power operating system, to the circuit and technology level. Two main Evaluation tool [7] based on the SimpleScalar simulator [4], with objectives of the low power research area are the peak power extensions to compute the energy spent every cycle. reduction, (i.e. reducing the maximum power dissipated by the die), and the total energy reduction for a given workload. 3.1. Power Consumption Estimation Model Power consumption in CMOS technology is quadratic with voltage supply and linear with switching capacitance and activity. The Architectural Power Evaluation tool [7] considers the Orchestrated methods to scale voltage and frequency processor microarchitecture partitioned into 32 blocks, each simultaneously can be really effective to drastically reduce the corresponding to a basic functional block and a circuit block, The power requirements of microarchitectures. As examples, lntel internal connections of blocks are modeled as components of the StrongARM SA-I10 [11] and Transmeta Crusoe [14] use blocks. The power consumption of the interconnections among combinations of these techniques to considerably reduce power blocks is not considered since it is usually a negligible part of the consumption. total power consumption. Every block is divided in subparts, Another example used in commercial products is the TAU depending on the hardware implementation, and the simulator (Thermal Assist Unit) of the PowerPC G3 and G4 family, where counts the number of times that each subpart is used along the an interrupt is generated based on a programmable thermal program execution. The power consumption of each subpart is threshold [25]. This causes an instruction flow reduction between characterized by the following three parameters: power density, instruction cache and the instruction buffer in order to decrease the area occupied and circuit type. Five different types of circuits are temperature of the chip. Typically, voltage and frequency scaling considered: memory, static logic, dynamic logic, PLA and clock or interruption-based power-aware techniques are managed by the distribution 1. operating system or the applications. Recent studies [5] have In the model, the power characterization of a digital investigated adaptive thermal management and power-conscious synchronous functional block of the microarchitecture is
Recommended publications
  • Implicitly-Multithreaded Processors
    Appears in the Proceedings of the 30th Annual International Symposium on Computer Architecture (ISCA) Implicitly-Multithreaded Processors Il Park, Babak Falsafi∗ and T. N. Vijaykumar School of Electrical & Computer Engineering ∗Computer Architecture Laboratory (CALCM) Purdue University Carnegie Mellon University {parki,vijay}@ecn.purdue.edu [email protected] http://www.ece.cmu.edu/~impetus Abstract In this paper, we propose the Implicitly-Multi- Threaded (IMT) processor. IMT executes compiler-speci- This paper proposes the Implicitly-MultiThreaded fied speculative threads from a sequential program on a (IMT) architecture to execute compiler-specified specula- wide-issue SMT pipeline. IMT is based on the fundamental tive threads on to a modified Simultaneous Multithreading observation that Multiscalar’s execution model — i.e., pipeline. IMT reduces hardware complexity by relying on compiler-specified speculative threads [10] — can be the compiler to select suitable thread spawning points and decoupled from the processor organization — i.e., distrib- orchestrate inter-thread register communication. To uted processing cores. Multiscalar [10] employs sophisti- enhance IMT’s effectiveness, this paper proposes three cated specialized hardware, the register ring and address novel microarchitectural mechanisms: (1) resource- and resolution buffer, which are strongly coupled to the distrib- dependence-based fetch policy to fetch and execute suit- uted core organization. In contrast, IMT proposes to map able instructions, (2) context multiplexing to improve utili- speculative threads on to generic SMT. zation and map as many threads to a single context as IMT differs fundamentally from prior proposals, TME allowed by availability of resources, and (3) early thread- and DMT, for speculative threading on SMT. While TME invocation to hide thread start-up overhead by overlapping executes multiple threads only in the uncommon case of one thread’s invocation with other threads’ execution.
    [Show full text]
  • Validated Products List, 1995 No. 3: Programming Languages, Database
    NISTIR 5693 (Supersedes NISTIR 5629) VALIDATED PRODUCTS LIST Volume 1 1995 No. 3 Programming Languages Database Language SQL Graphics POSIX Computer Security Judy B. Kailey Product Data - IGES Editor U.S. DEPARTMENT OF COMMERCE Technology Administration National Institute of Standards and Technology Computer Systems Laboratory Software Standards Validation Group Gaithersburg, MD 20899 July 1995 QC 100 NIST .056 NO. 5693 1995 NISTIR 5693 (Supersedes NISTIR 5629) VALIDATED PRODUCTS LIST Volume 1 1995 No. 3 Programming Languages Database Language SQL Graphics POSIX Computer Security Judy B. Kailey Product Data - IGES Editor U.S. DEPARTMENT OF COMMERCE Technology Administration National Institute of Standards and Technology Computer Systems Laboratory Software Standards Validation Group Gaithersburg, MD 20899 July 1995 (Supersedes April 1995 issue) U.S. DEPARTMENT OF COMMERCE Ronald H. Brown, Secretary TECHNOLOGY ADMINISTRATION Mary L. Good, Under Secretary for Technology NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY Arati Prabhakar, Director FOREWORD The Validated Products List (VPL) identifies information technology products that have been tested for conformance to Federal Information Processing Standards (FIPS) in accordance with Computer Systems Laboratory (CSL) conformance testing procedures, and have a current validation certificate or registered test report. The VPL also contains information about the organizations, test methods and procedures that support the validation programs for the FIPS identified in this document. The VPL includes computer language processors for programming languages COBOL, Fortran, Ada, Pascal, C, M[UMPS], and database language SQL; computer graphic implementations for GKS, COM, PHIGS, and Raster Graphics; operating system implementations for POSIX; Open Systems Interconnection implementations; and computer security implementations for DES, MAC and Key Management.
    [Show full text]
  • Kaisen Lin and Michael Conley
    Kaisen Lin and Michael Conley Simultaneous Multithreading ◦ Instructions from multiple threads run simultaneously on superscalar processor ◦ More instruction fetching and register state ◦ Commercialized! DEC Alpha 21464 [Dean Tullsen et al] Intel Hyperthreading (Xeon, P4, Atom, Core i7) Web applications ◦ Web, DB, file server ◦ Throughput over latency ◦ Service many users ◦ Slight delay acceptable 404 not Idea? ◦ More functional units and caches ◦ Not just storage state Instruction fetch ◦ Branch prediction and alignment Similar problems to the trace cache ◦ Large programs: inefficient I$ use Issue and retirement ◦ Complicated OOE logic not scalable Need more wires, hardware, ports Execution ◦ Register file and forwarding logic Power-hungry Replace complicated OOE with more processors ◦ Each with own L1 cache Use communication crossbar for a shared L2 cache ◦ Communication still fast, same chip Size details in paper ◦ 6-SS about the same as 4x2-MP ◦ Simpler CPU overall! SimOS: Simulate hardware env ◦ Can run commercial operating systems on multiple CPUs ◦ IRIX 5.3 tuned for multi-CPU Applications ◦ 4 integer, 4 floating-point, 1 multiprog PARALLELLIZED! Which one is better? ◦ Misses per completed instruction ◦ In general hard to tell what happens Which one is better? ◦ 6-SS isn’t taking advantage! Actual speedup metrics ◦ MP beats the pants off SS some times ◦ Doesn’t perform so much worse other times 6-SS better than 4x2-MP ◦ Non-parallelizable applications ◦ Fine-grained parallel applications 6-SS worse than 4x2-MP
    [Show full text]
  • A Speculative Control Scheme for an Energy-Efficient Banked Register File
    IEEE TRANSACTIONS ON COMPUTERS, VOL. 54, NO. 6, JUNE 2005 741 A Speculative Control Scheme for an Energy-Efficient Banked Register File Jessica H. Tseng, Student Member, IEEE, and Krste Asanovicc, Member, IEEE Abstract—Multiported register files are critical components of modern superscalar and simultaneously multithreaded (SMT) processors, but conventional designs consume considerable die area and power as register counts and issue widths grow. Banked multiported register files consisting of multiple interleaved banks of lesser ported cells can be used to reduce area, power, and access time and previous work has shown that such designs can provide sufficient bandwidth for a superscalar machine. These previous banked designs, however, have complex control structures to avoid bank conflicts or to buffer conflicting requests, which add to design complexity and would likely limit cycle time. This paper presents a much simpler and faster control scheme that speculatively issues potentially conflicting instructions, then quickly repairs the pipeline if conflicts occur. We show that, once optimizations to avoid regfile reads are employed, the remaining read accesses observed in detailed simulations are close to randomly distributed and this contributes to the effectiveness of our speculative control scheme. For a four-issue superscalar processor with 64 physical registers, we show that we can reduce area by a factor of three, access time by 25 percent, and energy by 40 percent, while decreasing IPC by less than 5 percent. For an eight-issue SMT processor with 512 physical registers, area is reduced by a factor of seven, access time by 30 percent, and energy by 60 percent, while decreasing IPC by less than 2 percent.
    [Show full text]
  • The Microarchitecture of a Low Power Register File
    The Microarchitecture of a Low Power Register File Nam Sung Kim and Trevor Mudge Advanced Computer Architecture Lab The University of Michigan 1301 Beal Ave., Ann Arbor, MI 48109-2122 {kimns, tnm}@eecs.umich.edu ABSTRACT Alpha 21464, the 512-entry 16-read and 8-write (16-r/8-w) ports register file consumed more power and was larger than The access time, energy and area of the register file are often the 64 KB primary caches. To reduce the cycle time impact, it critical to overall performance in wide-issue microprocessors, was implemented as two 8-r/8-w split register files [9], see because these terms grow superlinearly with the number of read Figure 1. Figure 1-(a) shows the 16-r/8-w file implemented and write ports that are required to support wide-issue. This paper directly as a monolithic structure. Figure 1-(b) shows it presents two techniques to reduce the number of ports of a register implemented as the two 8-r/8-w register files. The monolithic file intended for a wide-issue microprocessor without hardly any register file design is slow because each memory cell in the impact on IPC. Our results show that it is possible to replace a register file has to drive a large number of bit-lines. In register file with 16 read and 8 write ports, intended for an eight- contrast, the split register file is fast, but duplicates the issue processor, with a register file with just 8 read and 8 write contents of the register file in two memory arrays, resulting in ports so that the impact on IPC is a few percent.
    [Show full text]
  • SPARC64-III User's Guide
    SPARC64-III User’s Guide HAL Computer Systems, Inc. Campbell, California May 1998 Copyright © 1998 HAL Computer Systems, Inc. All rights reserved. This product and related documentation are protected by copyright and distributed under licenses restricting their use, copying, distribution, and decompilation. No part of this product or related documentation may be reproduced in any form by any means without prior written authorization of HAL Computer Systems, Inc., and its licensors, if any. Portions of this product may be derived from the UNIX and Berkeley 4.3 BSD Systems, licensed from UNIX System Laboratories, Inc., a wholly owned subsidiary of Novell, Inc., and the University of California, respectively. RESTRICTED RIGHTS LEGEND: Use, duplication, or disclosure by the United States Government is subject to the restrictions set forth in DFARS 252.227-7013 (c)(1)(ii), FAR 52.227-19, and NASA FAR Supplement. The product described in this book may be protected by one or more U.S. patents, foreign patents, or pending applications. TRADEMARKS HAL, the HAL logo, HyperScalar, and OLIAS are registered trademarks and HAL Computer Systems, Inc. HALstation 300, and Ishmail are trademarks of HAL Computer Systems, Inc. SPARC64 and SPARC64/OS are trademarks of SPARC International, Inc., licensed by SPARC International, Inc., to HAL Computer Systems, Inc. Fujitsu and the Fujitsu logo are trademarks of Fujitsu Limited. All SPARC trademarks, including the SCD Compliant Logo, are trademarks or registered trademarks of SPARC International, Inc. SPARCstation, SPARCserver, SPARCengine, SPARCstorage, SPARCware, SPARCcenter, SPARCclassic, SPARCcluster, SPARCdesign, SPARC811 SPARCprinter, UltraSPARC, microSPARC, SPARCworks, and SPARCompiler are licensed exclusively to Sun Microsystems, Inc.
    [Show full text]
  • REPORT Compaq Chooses SMT for Alpha Simultaneous Multithreading
    VOLUME 13, NUMBER 16 DECEMBER 6, 1999 MICROPROCESSOR REPORT THE INSIDERS’ GUIDE TO MICROPROCESSOR HARDWARE Compaq Chooses SMT for Alpha Simultaneous Multithreading Exploits Instruction- and Thread-Level Parallelism by Keith Diefendorff Given a full complement of on-chip memory, increas- ing the clock frequency will increase the performance of the As it climbs rapidly past the 100-million- core. One way to increase frequency is to deepen the pipeline. transistor-per-chip mark, the micro- But with pipelines already reaching upwards of 12–14 stages, processor industry is struggling with the mounting inefficiencies may close this avenue, limiting future question of how to get proportionally more performance out frequency improvements to those that can be attained from of these new transistors. Speaking at the recent Microproces- semiconductor-circuit speedup. Unfortunately this speedup, sor Forum, Joel Emer, a Principal Member of the Technical roughly 20% per year, is well below that required to attain the Staff in Compaq’s Alpha Development Group, described his historical 60% per year performance increase. To prevent company’s approach: simultaneous multithreading, or SMT. bursting this bubble, the only real alternative left is to exploit Emer’s interest in SMT was inspired by the work of more and more parallelism. Dean Tullsen, who described the technique in 1995 while at Indeed, the pursuit of parallelism occupies the energy the University of Washington. Since that time, Emer has of many processor architects today. There are basically two been studying SMT along with other researchers at Washing- theories: one is that instruction-level parallelism (ILP) is ton. Once convinced of its value, he began evangelizing SMT abundant and remains a viable resource waiting to be tapped; within Compaq.
    [Show full text]
  • The Conference Program Booklet
    Austin Convention Center Conference Austin, TX Program http://sc15.supercomputing.org/ Conference Dates: Exhibition Dates: The International Conference for High Performance Nov. 15 - 20, 2015 Nov. 16 - 19, 2015 Computing, Networking, Storage and Analysis Sponsors: SC15.supercomputing.org SC15 • Austin, Texas The International Conference for High Performance Computing, Networking, Storage and Analysis Sponsors: 3 Table of Contents Welcome from the Chair ................................. 4 Papers ............................................................... 68 General Information ........................................ 5 Posters Research Posters……………………………………..88 Registration and Conference Store Hours ....... 5 ACM Student Research Competition ........ 114 Exhibit Hall Hours ............................................. 5 Posters SC15 Information Booth/Hours ....................... 5 Scientific Visualization/ .................................... 120 Data Analytics Showcase SC16 Preview Booth/Hours ............................. 5 Student Programs Social Events ..................................................... 5 Experiencing HPC for Undergraduates ...... 122 Registration Pass Access .................................. 7 Mentor-Protégé Program .......................... 123 Student Cluster Competition Kickoff ......... 123 SCinet ............................................................... 8 Student-Postdoc Job & ............................. 123 Convention Center Maps ................................. 12 Opportunity Fair Daily Schedules
    [Show full text]
  • PERL – a Register-Less Processor
    PERL { A Register-Less Processor A Thesis Submitted in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy by P. Suresh to the Department of Computer Science & Engineering Indian Institute of Technology, Kanpur February, 2004 Certificate Certified that the work contained in the thesis entitled \PERL { A Register-Less Processor", by Mr.P. Suresh, has been carried out under my supervision and that this work has not been submitted elsewhere for a degree. (Dr. Rajat Moona) Professor, Department of Computer Science & Engineering, Indian Institute of Technology, Kanpur. February, 2004 ii Synopsis Computer architecture designs are influenced historically by three factors: market (users), software and hardware methods, and technology. Advances in fabrication technology are the most dominant factor among them. The performance of a proces- sor is defined by a judicious blend of processor architecture, efficient compiler tech- nology, and effective VLSI implementation. The choices for each of these strongly depend on the technology available for the others. Significant gains in the perfor- mance of processors are made due to the ever-improving fabrication technology that made it possible to incorporate architectural novelties such as pipelining, multiple instruction issue, on-chip caches, registers, branch prediction, etc. To supplement these architectural novelties, suitable compiler techniques extract performance by instruction scheduling, code and data placement and other optimizations. The performance of a computer system is directly related to the time it takes to execute programs, usually known as execution time. The expression for execution time (T), is expressed as a product of the number of instructions executed (N), the average number of machine cycles needed to execute one instruction (Cycles Per Instruction or CPI), and the clock cycle time (), as given in equation 1.
    [Show full text]
  • Computer Architectures an Overview
    Computer Architectures An Overview PDF generated using the open source mwlib toolkit. See http://code.pediapress.com/ for more information. PDF generated at: Sat, 25 Feb 2012 22:35:32 UTC Contents Articles Microarchitecture 1 x86 7 PowerPC 23 IBM POWER 33 MIPS architecture 39 SPARC 57 ARM architecture 65 DEC Alpha 80 AlphaStation 92 AlphaServer 95 Very long instruction word 103 Instruction-level parallelism 107 Explicitly parallel instruction computing 108 References Article Sources and Contributors 111 Image Sources, Licenses and Contributors 113 Article Licenses License 114 Microarchitecture 1 Microarchitecture In computer engineering, microarchitecture (sometimes abbreviated to µarch or uarch), also called computer organization, is the way a given instruction set architecture (ISA) is implemented on a processor. A given ISA may be implemented with different microarchitectures.[1] Implementations might vary due to different goals of a given design or due to shifts in technology.[2] Computer architecture is the combination of microarchitecture and instruction set design. Relation to instruction set architecture The ISA is roughly the same as the programming model of a processor as seen by an assembly language programmer or compiler writer. The ISA includes the execution model, processor registers, address and data formats among other things. The Intel Core microarchitecture microarchitecture includes the constituent parts of the processor and how these interconnect and interoperate to implement the ISA. The microarchitecture of a machine is usually represented as (more or less detailed) diagrams that describe the interconnections of the various microarchitectural elements of the machine, which may be everything from single gates and registers, to complete arithmetic logic units (ALU)s and even larger elements.
    [Show full text]
  • UNIVERSITY of CALIFORNIA, SAN DIEGO Holistic Design for Multi-Core Architectures a Dissertation Submitted in Partial Satisfactio
    UNIVERSITY OF CALIFORNIA, SAN DIEGO Holistic Design for Multi-core Architectures A dissertation submitted in partial satisfaction of the requirements for the degree Doctor of Philosophy in Computer Science (Computer Engineering) by Rakesh Kumar Committee in charge: Professor Dean Tullsen, Chair Professor Brad Calder Professor Fred Chong Professor Rajesh Gupta Dr. Norman P. Jouppi Professor Andrew Kahng 2006 Copyright Rakesh Kumar, 2006 All rights reserved. The dissertation of Rakesh Kumar is approved, and it is acceptable in quality and form for publication on micro- film: Chair University of California, San Diego 2006 iii DEDICATIONS This dissertation is dedicated to friends, family, labmates, and mentors { the ones who taught me, indulged me, loved me, challenged me, and laughed with me, while I was also busy working on my thesis. To Professor Dean Tullsen for teaching me the values of humility, kind- ness,and caring while trying to teach me football and computer architecture. For always encouraging me to do the right thing. For always letting me be myself. For always believing in me. For always challenging me to dream big. For all his wisdom. And for being an adviser in the truest sense, and more. To Professor Brad Calder. For always caring about me. For being an inspiration. For his trust. For making me believe in myself. For his lies about me getting better at system administration and foosball even though I never did. To Dr Partha Ranganathan. For always being there for me when I would get down on myself. And that happened often. For the long discussions on life, work, and happiness.
    [Show full text]
  • Superscalar Execution Scalar Pipeline and the Flynn Bottleneck Multiple
    Remainder of CSE 560: Parallelism • Last unit: pipeline-level parallelism • Execute one instruction in parallel with decode of next • Next: instruction-level parallelism (ILP) CSE 560 • Execute multiple independent instructions fully in parallel Computer Systems Architecture • Today: multiple issue • In a few weeks: dynamic scheduling • Extract much more ILP via out-of-order processing Superscalar • Data-level parallelism (DLP) • Single-instruction, multiple data • Ex: one instruction, four 16-bit adds (using 64-bit registers) • Thread-level parallelism (TLP) • Multiple software threads running on multiple cores 1 6 This Unit: Superscalar Execution Scalar Pipeline and the Flynn Bottleneck App App App • Superscalar scaling issues regfile System software • Multiple fetch and branch prediction I$ D$ • Dependence-checks & stall logic Mem CPU I/O B • Wide bypassing P • Register file & cache bandwidth • So far we have looked at scalar pipelines: • Multiple-issue designs 1 instruction per stage (+ control speculation, bypassing, etc.) – Performance limit (aka “Flynn Bottleneck”) is CPI = IPC = 1 • Superscalar – Limit is never even achieved (hazards) • VLIW and EPIC (Itanium) – Diminishing returns from “super-pipelining” (hazards + overhead) 7 8 Multiple-Issue Pipeline Superscalar Pipeline Diagrams - Ideal regfile scalar 1 2 3 4 5 6 7 8 9101112 lw 0(r1)r2 F D XMW I$ D$ lw 4(r1)r3 F D XMW lw 8(r1)r4 F DXMW B add r14,r15r6 F D XMW P add r12,r13r7 F DXMW add r17,r16r8 F D XMW lw 0(r18)r9 F DXMW • Overcome this limit using multiple issue • Also called
    [Show full text]