Simultaneous Multithreading: Exploiting Instruction-Level and Thread-Level Parallelism in Microprocessors

Total Page:16

File Type:pdf, Size:1020Kb

Simultaneous Multithreading: Exploiting Instruction-Level and Thread-Level Parallelism in Microprocessors Simultaneous Multithreading: Exploiting Instruction-Level and Thread-Level Parallelism in Microprocessors Jeffrey Butera Senior Thesis Computer Science Advised by Dave Wonnacott Haverford College April 25, 2011 1 Abstract Increasing the performance of microprocessors has always been a pressing issue in the fields of Computer Science and Computer Engi- neering. The first processors functioned by executing a single stream of sequential instructions. The initial approach to increasing the speed of instruction execution involved using smaller and faster transistors so that the clock frequency could be increased. This approach was stymied by slow memory accesses that could not keep up with a fast clock cycle. Cache memories were developed to keep frequently ac- cessed data closer to the processor to prevent the types of stalls. RISC, reduced instruction set computing, architectures were also designed to allow for faster clock cycles and execution. After RISC, the focus shifted away from speeding up the execution of a single instruction and moved towards increasing the number of in- structions that can be active in a single clock cycle. Different designs, such as pipelining and superscalars, allowed multiple instructions to be issued in the same clock cycle, but were limited by increased power consumption, heat, and chip size. Multi-core and multiprocessor de- signs allowed multiple threads of instructions to be simultaneously ex- ecuted on different processors or cores. These designs were improved by the development of multithreading, which aimed to increase thread- level parallelism on a single core. This thesis addresses the intricacies of a specific type of hardware multithreading: Simultaneous Multithreading (SMT). SMT is a tech- nique used to increase the performance of a microprocessor by exploit- ing parallelism in all available forms. SMT combines instruction-level and thread-level parallelism and enables multiple instructions from multiple threads to be issued in the same clock cycle. This paper is a detailed examination of Simultaneous Multithreading and will specifi- cally address: how multiple instructions are fetched and issued simul- taneously, how hardware resources are designated amongst threads, the performance upgrade associated with SMT in terms of instruc- tions per cycle, the energy-efficiency of the design, and how SMT is utilized in general purpose microprocessors such as in the Intel Pen- tium 4 Hyper-Threading processor. 2 Contents 1 Background 4 2 Exploiting Instruction-Level Parallelism 4 3 Maximizing Thread-Level Parallelism 5 4 Increasing ILP and TLP 7 5 How SMT Works 8 5.1TheInstructionFetchUnit.................... 10 5.1.1 ThreadSelection..................... 10 5.1.2 InstructionSelection................... 13 5.1.3 ClearingUptheFetchUnit............... 14 5.2ResourceSharing......................... 15 6 SMT Performance 15 7 The Energy-Efficiency of SMT 17 8 Intel Hyper-Threading 18 9 Conclusion 19 3 1 Background A microprocessor, or central processing unit (CPU), is the brain of a com- puter. Much like a human brain, it processes information and sends out instructions to tell the different parts of the computer what to do. The op- eration of a microprocessor can be broken down into four major steps. First, an instruction is fetched from memory. Second, the instruction is decoded so that the CPU can be told what to do. Next, the instruction is executed by the hardware, and finally the result of the execution is written back to some form of memory. In the 1970s, the first computers contained a single processor that could execute a single stream of instructions in a sequential fashion. The first ap- proaches to speed up a processor involved creating smaller and faster transis- tors so that the clock speed could be increased. Eventually, the clock speed could no longer be increased because slow memory accesses held back per- formance gains. Cache memories were developed to keep frequently accessed data in smaller, faster memories closer to the CPU. If, during the execution of an instruction, the CPU needs to access data that is stored in memory, it first checks the cache for the data. If the data is found, a cache hit, then it simply reads the data. If that data is not in the cache, a cache miss, then the data must be fetched from the main memory. The processor stalls while waiting for the data to be fetched from memory, increasing the time for the instruction execution. The next approach to increasing processor performance dealt with the instructions themselves. The reduced instruction set computing strategy (RISC) was developed to simplify instructions. The idea behind RISC was to eliminate the complex instructions that may take multiple, longer clock cycles to complete. By converting a complex instruction into a sequence of simpler instructions that could be executed faster, the clock rate could be increased and the same result could be achieved faster. The development of RISC based architectures influenced designers to come up with more ways to exploit instruction-level parallelism to increase performance. 2 Exploiting Instruction-Level Parallelism The first major attempt to maximize instruction-level parallelism (ILP) was a technique called pipelining. Pipelining breaks down a processor into multiple stages and creates a pipeline that instructions pass through. This pipeline functions much like an assembly line. An instruction enters at one end, passes through the different stages of the pipe, and exits at the other end. 4 The pipeline allows for multiple instructions to flow through the pipe at the same time. Different instructions can be executed in parallel from each stage of the pipeline, thus increasing the throughput, or number of instructions leaving the pipeline per unit time[4]. Most pipelined processors are broken down into five stages: Fetch, De- code, Execute, Memory, and Writeback. This means that five instructions can execute at the same time, one from each stage of the pipeline. Also, each stage of the pipeline only has to deal with 1/5 of the processor’s logic, mak- ing the stages faster than full instructions and allowing the clock frequency to be increased[3]. A pipelined architecture must be aware of the length of each pipeline stage, because the clock rate is dependent on the slowest stage of the pipeline. That is to say, no instruction can move to the next stage of the pipeline until all instructions have finished at their current stage. The next development in increasing instruction-level parallelism was su- perscalar processors. Superscalars achieve ILP and execute multiple instruc- tions per cycle by having multiple copies of hardware datapaths. A super- scalar that has two datapaths can fetch two instructions simultaneously from memory. This means that the processor must also have double the logic to fetch and decode two instructions at the same time[3]. Superscalars also utilize pipelining to fully maximize parallelism. A sample execution of a su- perscalar processor can be seen in Figure 1a. The superscalar is executing a single program and each box represents a potential issue slot for an in- struction. This particular superscalar has four datapaths and attempts to fill these issue slots each processor cycle to achieve ILP. Further advances in pipelining and superscalars are limited. In the case of pipelining, increasing the clock rate means an increase in power consumption and heat. Superscalars are limited by their width, or number or datapaths. As the width increases, the amount of hardware needed on the chip grows as well. More hardware means bigger chips and more power consumed. These issues make it impractical to solely rely on instruction-level parallelism to in- crease processor performance and they have become the driving force behind the switch to increasing thread-level parallelism. 3 Maximizing Thread-Level Parallelism A second form of parallelism, thread level parallelism (TLP), arose to make use of multiprocessor and multi-core designs. A multiprocessor exploits TLP by connecting two or more processors. Rather than overlapping individ- ual instructions from one thread (ILP), multiple streams of instructions are simultaneously executed on the different processors [4]. In a multi-core ar- 5 Figure 1: Sample Execution Sequence for: (a)Superscalar processor, (b)Multithreaded superscalar processor, (c)SMT processor. [1] chitecture, multiple processor cores are situated on a single chip. The core of the processor is regarded as the unit that actually interprets and executes the instructions. Each core usually has its own cache, but shares lower-level caches and the input/output buses [4]. Multi-core designs achieve TLP by assigning each core its own stream of instructions to execute. A technique called multithreading takes thread-level parallelism a step further by increasing TLP in a single core. A core with multithreading capa- bilities is assigned multiple threads from which it can execute instructions. While one thread is waiting for a response from memory, the processor can switch threads and execute another instruction. Each thread also has its own program counter and set of registers to hold its architectural state so that the processor can pick up where it left off when it switched from the thread. This ensures that more than one thread can be active at one time[3]. The basic idea of enhancing TLP through the use of multithreading can be explained with a simple example. Imagine enhancing the quality of a picture using some sort of image processing software. A single-core, non- 6 multithreaded processor would work on enhancing the picture, pixel by pixel, from left to right and top to bottom. Multithreading can be used to speed up this process, but the software orchestrating the image enhancement must be written to support multiple threads. If a software program was written to supports four threads, then each thread could handle enhancing a quarter of the image. A quad-core processor running this software could enhance the image up to four times faster than the single-core processor by sending one thread of instructions to each core of the processor.
Recommended publications
  • 1 Introduction
    Cambridge University Press 978-0-521-76992-1 - Microprocessor Architecture: From Simple Pipelines to Chip Multiprocessors Jean-Loup Baer Excerpt More information 1 Introduction Modern computer systems built from the most sophisticated microprocessors and extensive memory hierarchies achieve their high performance through a combina- tion of dramatic improvements in technology and advances in computer architec- ture. Advances in technology have resulted in exponential growth rates in raw speed (i.e., clock frequency) and in the amount of logic (number of transistors) that can be put on a chip. Computer architects have exploited these factors in order to further enhance performance using architectural techniques, which are the main subject of this book. Microprocessors are over 30 years old: the Intel 4004 was introduced in 1971. The functionality of the 4004 compared to that of the mainframes of that period (for example, the IBM System/370) was minuscule. Today, just over thirty years later, workstations powered by engines such as (in alphabetical order and without specific processor numbers) the AMD Athlon, IBM PowerPC, Intel Pentium, and Sun UltraSPARC can rival or surpass in both performance and functionality the few remaining mainframes and at a much lower cost. Servers and supercomputers are more often than not made up of collections of microprocessor systems. It would be wrong to assume, though, that the three tenets that computer archi- tects have followed, namely pipelining, parallelism, and the principle of locality, were discovered with the birth of microprocessors. They were all at the basis of the design of previous (super)computers. The advances in technology made their implementa- tions more practical and spurred further refinements.
    [Show full text]
  • Instruction Latencies and Throughput for AMD and Intel X86 Processors
    Instruction latencies and throughput for AMD and Intel x86 processors Torbj¨ornGranlund 2019-08-02 09:05Z Copyright Torbj¨ornGranlund 2005{2019. Verbatim copying and distribution of this entire article is permitted in any medium, provided this notice is preserved. This report is work-in-progress. A newer version might be available here: https://gmplib.org/~tege/x86-timing.pdf In this short report we present latency and throughput data for various x86 processors. We only present data on integer operations. The data on integer MMX and SSE2 instructions is currently limited. We might present more complete data in the future, if there is enough interest. There are several reasons for presenting this report: 1. Intel's published data were in the past incomplete and full of errors. 2. Intel did not publish any data for 64-bit operations. 3. To allow straightforward comparison of an important aspect of AMD and Intel pipelines. The here presented data is the result of extensive timing tests. While we have made an effort to make sure the data is accurate, the reader is cautioned that some errors might have crept in. 1 Nomenclature and notation LNN means latency for NN-bit operation.TNN means throughput for NN-bit operation. The term throughput is used to mean number of instructions per cycle of this type that can be sustained. That implies that more throughput is better, which is consistent with how most people understand the term. Intel use that same term in the exact opposite meaning in their manuals. The notation "P6 0-E", "P4 F0", etc, are used to save table header space.
    [Show full text]
  • GPU Implementation Over IPTV Software Defined Networking
    Esmeralda Hysenbelliu. Int. Journal of Engineering Research and Application www.ijera.com ISSN : 2248-9622, Vol. 7, Issue 8, ( Part -1) August 2017, pp.41-45 RESEARCH ARTICLE OPEN ACCESS GPU Implementation over IPTV Software Defined Networking Esmeralda Hysenbelliu* Information Technology Faculty, Polytechnic University of Tirana, Sheshi “Nënë Tereza”, Nr.4, Tirana, Albania Corresponding author: Esmeralda Hysenbelliu ABSTRACT One of the most important issue in IPTV Software defined Network is Bandwidth Issue and Quality of Service at the client side. Decidedly, it is required high level quality of images in low bandwidth and for this reason it is needed different transcoding standards (Compression of image as much as it is possible without destroying the quality of it) as H.264, H265, VP8 and VP9. During a test performed in SMC IPTV SDN Cloud Network, it was observed that with a server HP ProLiant DL380 g6 with two physical processors there was not possible to transcode in format H.264 more than 30 channels simultaneously because CPU’s achieved 100%. This is the reason why it was immediately needed to use Graphic Processing Units called GPU’s which offer high level images processing. After GPU superscalar processor was integrated and done functional via module NVENC of FFEMPEG Program, number of channels transcoded simultaneously was tremendous increased (more than 100 channels). The aim of this paper is to real implement GPU superscalar processors in IPTV Cloud Networks by achieving improvement of performance to more than 60%. Keywords - GPU superscalar processor, Performance Improvement, NVENC, CUDA --------------------------------------------------------------------------------------------------------------------------------------- Date of Submission: 01 -05-2017 Date of acceptance: 19-08-2017 --------------------------------------------------------------------------------------------------------------------------------------- I.
    [Show full text]
  • Computer Hardware Architecture Lecture 4
    Computer Hardware Architecture Lecture 4 Manfred Liebmann Technische Universit¨atM¨unchen Chair of Optimal Control Center for Mathematical Sciences, M17 [email protected] November 10, 2015 Manfred Liebmann November 10, 2015 Reading List • Pacheco - An Introduction to Parallel Programming (Chapter 1 - 2) { Introduction to computer hardware architecture from the parallel programming angle • Hennessy-Patterson - Computer Architecture - A Quantitative Approach { Reference book for computer hardware architecture All books are available on the Moodle platform! Computer Hardware Architecture 1 Manfred Liebmann November 10, 2015 UMA Architecture Figure 1: A uniform memory access (UMA) multicore system Access times to main memory is the same for all cores in the system! Computer Hardware Architecture 2 Manfred Liebmann November 10, 2015 NUMA Architecture Figure 2: A nonuniform memory access (UMA) multicore system Access times to main memory differs form core to core depending on the proximity of the main memory. This architecture is often used in dual and quad socket servers, due to improved memory bandwidth. Computer Hardware Architecture 3 Manfred Liebmann November 10, 2015 Cache Coherence Figure 3: A shared memory system with two cores and two caches What happens if the same data element z1 is manipulated in two different caches? The hardware enforces cache coherence, i.e. consistency between the caches. Expensive! Computer Hardware Architecture 4 Manfred Liebmann November 10, 2015 False Sharing The cache coherence protocol works on the granularity of a cache line. If two threads manipulate different element within a single cache line, the cache coherency protocol is activated to ensure consistency, even if every thread is only manipulating its own data.
    [Show full text]
  • Superscalar Fall 2020
    CS232 Superscalar Fall 2020 Superscalar What is superscalar - A superscalar processor has more than one set of functional units and executes multiple independent instructions during a clock cycle by simultaneously dispatching multiple instructions to different functional units in the processor. - You can think of a superscalar processor as there are more than one washer, dryer, and person who can fold. So, it allows more throughput. - The order of instruction execution is usually assisted by the compiler. The hardware and the compiler assure that parallel execution does not violate the intent of the program. - Example: • Ordinary pipeline: four stages (Fetch, Decode, Execute, Write back), one clock cycle per stage. Executing 6 instructions take 9 clock cycles. I0: F D E W I1: F D E W I2: F D E W I3: F D E W I4: F D E W I5: F D E W cc: 1 2 3 4 5 6 7 8 9 • 2-degree superscalar: attempts to process 2 instructions simultaneously. Executing 6 instructions take 6 clock cycles. I0: F D E W I1: F D E W I2: F D E W I3: F D E W I4: F D E W I5: F D E W cc: 1 2 3 4 5 6 Limitations of Superscalar - The above example assumes that the instructions are independent of each other. So, it’s easily to push them into the pipeline and superscalar. However, instructions are usually relevant to each other. Just like the hazards in pipeline, superscalar has limitations too. - There are several fundamental limitations the system must cope, which are true data dependency, procedural dependency, resource conflict, output dependency, and anti- dependency.
    [Show full text]
  • The Multiscalar Architecture
    THE MULTISCALAR ARCHITECTURE by MANOJ FRANKLIN A thesis submitted in partial ful®llment of the requirements for the degree of DOCTOR OF PHILOSOPHY (Computer Sciences) at the UNIVERSITY OF WISCONSIN Ð MADISON 1993 THE MULTISCALAR ARCHITECTURE Manoj Franklin Under the supervision of Associate Professor Gurindar S. Sohi at the University of Wisconsin-Madison ABSTRACT The centerpiece of this thesis is a new processing paradigm for exploiting instruction level parallelism. This paradigm, called the multiscalar paradigm, splits the program into many smaller tasks, and exploits ®ne-grain parallelism by executing multiple, possibly (control and/or data) depen- dent tasks in parallel using multiple processing elements. Splitting the instruction stream at statically determined boundaries allows the compiler to pass substantial information about the tasks to the hardware. The processing paradigm can be viewed as extensions of the superscalar and multiprocess- ing paradigms, and shares a number of properties of the sequential processing model and the data¯ow processing model. The multiscalar paradigm is easily realizable, and we describe an implementation of the multis- calar paradigm, called the multiscalar processor. The central idea here is to connect multiple sequen- tial processors, in a decoupled and decentralized manner, to achieve overall multiple issue. The mul- tiscalar processor supports speculative execution, allows arbitrary dynamic code motion (facilitated by an ef®cient hardware memory disambiguation mechanism), exploits communication localities, and does all of these with hardware that is fairly straightforward to build. Other desirable aspects of the implementation include decentralization of the critical resources, absence of wide associative searches, and absence of wide interconnection/data paths.
    [Show full text]
  • Exam 1 Solutions
    Midterm Exam ECE 741 – Advanced Computer Architecture, Spring 2009 Instructor: Onur Mutlu TAs: Michael Papamichael, Theodoros Strigkos, Evangelos Vlachos February 25, 2009 EXAM 1 SOLUTIONS Problem Points Score 1 40 2 20 3 15 4 20 5 25 6 20 7 (bonus) 15 Total 140+15 • This is a closed book midterm. You are allowed to have only two letter-sized cheat sheets. • No electronic devices may be used. • This exam lasts 1 hour 50 minutes. • If you make a mess, clearly indicate your final answer. • For questions requiring brief answers, please provide brief answers. Do not write an essay. You can be penalized for verbosity. • Please show your work when needed. We cannot give you partial credit if you do not clearly show how you arrive at a numerical answer. • Please write your name on every sheet. EXAM 1 SOLUTIONS Problem 1 (Short answers – 40 points) i. (3 points) A cache has the block size equal to the word length. What property of program behavior, which usually contributes to higher performance if we use a cache, does not help the performance if we use THIS cache? Spatial locality ii. (3 points) Pipelining increases the performance of a processor if the pipeline can be kept full with useful instructions. Two reasons that often prevent the pipeline from staying full with useful instructions are (in two words each): Data dependencies Control dependencies iii. (3 points) The reference bit (sometimes called “access” bit) in a PTE (Page Table Entry) is used for what purpose? Page replacement The similar function is performed by what bit or bits in a cache’s tag store entry? Replacement policy bits (e.g.
    [Show full text]
  • Trends in Processor Architecture
    A. González Trends in Processor Architecture Trends in Processor Architecture Antonio González Universitat Politècnica de Catalunya, Barcelona, Spain 1. Past Trends Processors have undergone a tremendous evolution throughout their history. A key milestone in this evolution was the introduction of the microprocessor, term that refers to a processor that is implemented in a single chip. The first microprocessor was introduced by Intel under the name of Intel 4004 in 1971. It contained about 2,300 transistors, was clocked at 740 KHz and delivered 92,000 instructions per second while dissipating around 0.5 watts. Since then, practically every year we have witnessed the launch of a new microprocessor, delivering significant performance improvements over previous ones. Some studies have estimated this growth to be exponential, in the order of about 50% per year, which results in a cumulative growth of over three orders of magnitude in a time span of two decades [12]. These improvements have been fueled by advances in the manufacturing process and innovations in processor architecture. According to several studies [4][6], both aspects contributed in a similar amount to the global gains. The manufacturing process technology has tried to follow the scaling recipe laid down by Robert N. Dennard in the early 1970s [7]. The basics of this technology scaling consists of reducing transistor dimensions by a factor of 30% every generation (typically 2 years) while keeping electric fields constant. The 30% scaling in the dimensions results in doubling the transistor density (doubling transistor density every two years was predicted in 1975 by Gordon Moore and is normally referred to as Moore’s Law [21][22]).
    [Show full text]
  • Intro Evolution of Superscalar Processor
    Superscalar Processors Superscalar Processors vs. VLIW • 7.1 Introduction • 7.2 Parallel decoding • 7.3 Superscalar instruction issue • 7.4 Shelving • 7.5 Register renaming • 7.6 Parallel execution • 7.7 Preserving the sequential consistency of instruction execution • 7.8 Preserving the sequential consistency of exception processing • 7.9 Implementation of superscalar CISC processors using a superscalar RISC core • 7.10 Case studies of superscalar processors TECH Computer Science CH01 Superscalar Processor: Intro Emergence and spread of superscalar processors • Parallel Issue • Parallel Execution • {Hardware} Dynamic Instruction Scheduling • Currently the predominant class of processors 4Pentium 4PowerPC 4UltraSparc 4AMD K5- 4HP PA7100- 4DEC α Evolution of superscalar processor Specific tasks of superscalar processing Parallel decoding {and Dependencies check} Decoding and Pre-decoding • What need to be done • Superscalar processors tend to use 2 and sometimes even 3 or more pipeline cycles for decoding and issuing instructions • >> Pre-decoding: 4shifts a part of the decode task up into loading phase 4resulting of pre-decoding f the instruction class f the type of resources required for the execution f in some processor (e.g. UltraSparc), branch target addresses calculation as well 4the results are stored by attaching 4-7 bits • + shortens the overall cycle time or reduces the number of cycles needed The principle of perdecoding Number of perdecode bits used Specific tasks of superscalar processing: Issue 7.3 Superscalar instruction issue • How and when to send the instruction(s) to EU(s) Instruction issue policies of superscalar processors: Issue policies ---Performance, tread-----Æ Issue rate {How many instructions/cycle} Issue policies: Handing Issue Blockages • CISC about 2 • RISC: Issue stopped by True dependency Issue order of instructions • True dependency Æ (Blocked: need to wait) Aligned vs.
    [Show full text]
  • Parallel Architectures MICHAEL J
    Parallel Architectures MICHAEL J. FLYNN AND KEVIN W. RUDD Stanford University ^[email protected]&; ^[email protected]& PARALLEL ARCHITECTURES currently performing different phases of processing an instruction. This does not Parallel or concurrent operation has achieve concurrency of execution (with many different forms within a computer system. Using a model based on the multiple actions being taken on objects) different streams used in the computa- but does achieve a concurrency of pro- tion process, we represent some of the cessing—an improvement in efficiency different kinds of parallelism available. upon which almost all processors de- A stream is a sequence of objects such pend today. as data, or of actions such as instruc- Techniques that exploit concurrency tions. Each stream is independent of all of execution, often called instruction- other streams, and each element of a level parallelism (ILP), are also com- stream can consist of one or more ob- mon. Two architectures that exploit ILP jects or actions. We thus have four com- are superscalar and VLIW (very long binations that describe most familiar parallel architectures: instruction word). These techniques schedule different operations to execute (1) SISD: single instruction, single data concurrently based on analyzing the de- stream. This is the traditional uni- pendencies between the operations processor [Figure 1(a)]. within the instruction stream—dynami- (2) SIMD: single instruction, multiple cally at run time in a superscalar pro- data stream. This includes vector cessor and statically at compile time in processors as well as massively par- a VLIW processor. Both ILP approaches allel processors [Figure 1(b)].
    [Show full text]
  • Theoretical Peak FLOPS Per Instruction Set on Modern Intel Cpus
    Theoretical Peak FLOPS per instruction set on modern Intel CPUs Romain Dolbeau Bull – Center for Excellence in Parallel Programming Email: [email protected] Abstract—It used to be that evaluating the theoretical and potentially multiple threads per core. Vector of peak performance of a CPU in FLOPS (floating point varying sizes. And more sophisticated instructions. operations per seconds) was merely a matter of multiplying Equation2 describes a more realistic view, that we the frequency by the number of floating-point instructions will explain in details in the rest of the paper, first per cycles. Today however, CPUs have features such as vectorization, fused multiply-add, hyper-threading or in general in sectionII and then for the specific “turbo” mode. In this paper, we look into this theoretical cases of Intel CPUs: first a simple one from the peak for recent full-featured Intel CPUs., taking into Nehalem/Westmere era in section III and then the account not only the simple absolute peak, but also the full complexity of the Haswell family in sectionIV. relevant instruction sets and encoding and the frequency A complement to this paper titled “Theoretical Peak scaling behavior of current Intel CPUs. FLOPS per instruction set on less conventional Revision 1.41, 2016/10/04 08:49:16 Index Terms—FLOPS hardware” [1] covers other computing devices. flop 9 I. INTRODUCTION > operation> High performance computing thrives on fast com- > > putations and high memory bandwidth. But before > operations => any code or even benchmark is run, the very first × micro − architecture instruction number to evaluate a system is the theoretical peak > > - how many floating-point operations the system > can theoretically execute in a given time.
    [Show full text]
  • Modern Processor Design: Fundamentals of Superscalar
    Fundamentals of Superscalar Processors John Paul Shen Intel Corporation Mikko H. Lipasti University of Wisconsin WAVELAND PRESS, INC. Long Grove, Illinois To Our parents: Paul and Sue Shen Tarja and Simo Lipasti Our spouses: Amy C. Shen Erica Ann Lipasti Our children: Priscilla S. Shen, Rachael S. Shen, and Valentia C. Shen Emma Kristiina Lipasti and Elias Joel Lipasti For information about this book, contact: Waveland Press, Inc. 4180 IL Route 83, Suite 101 Long Grove, IL 60047-9580 (847) 634-0081 info @ waveland.com www.waveland.com Copyright © 2005 by John Paul Shen and Mikko H. Lipasti 2013 reissued by Waveland Press, Inc. 10-digit ISBN 1-4786-0783-1 13-digit ISBN 978-1-4786-0783-0 All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means without permission in writing from the publisher. Printed in the United States of America 7 6 5 4 3 2 1 Table of Contents PrefaceAbout the Authors x ix 1 Processor Design 1 1.1 The Evolution of Microprocessors 2 1.21.2.1 Instruction Digital Set Systems Processor Design Design 44 1.2.2 Architecture,Realization Implementation, and 5 1.2.3 Instruction Set Architecture 6 1.2.4 Dynamic-Static Interface 8 1.3 Principles of Processor Performance 10 1.3.1 Processor Performance Equation 10 1.3.2 Processor Performance Optimizations 11 1.3.3 Performance Evaluation Method 13 1.4 Instruction-Level Parallel Processing 16 1.4.1 From Scalar to Superscalar 16 1.4.2 Limits of Instruction-Level Parallelism 24 1.51.4.3 Machines Summary for Instruction-Level
    [Show full text]