Ece 685 Digital Computer Structure/Architecture

Total Page:16

File Type:pdf, Size:1020Kb

Ece 685 Digital Computer Structure/Architecture ECE 685 DIGITAL COMPUTER STRUCTURE/ARCHITECTURE FALL SEMESTER, 2002 TEXT: J. Hennessy and D. Patterson, Computer Architecture: A Quantitative Approach,3rd Edition, Morgan Kaufmann, 2003 Instructor: Heath SYLLABUS ECE 685-001 DIGITAL COMPUTER STRUCTURE/ARCHITECTURE COURSE SYLLABUS FALL, 2002 • Instructor: Dr. J. Robert (Bob) Heath • Office: 313 Electrical Engineering Annex • Office Phone Number: (859) 257-3124 • Email: [email protected] • Web Page: http://www.engr.uky.edu/~heath • Office Hours: M (3:30 pm-5:00 pm) W (2:30 pm-4:00 pm) • Text: J. Hennessy and D. Patterson, Computer Architecture: A Quantitative Approach, Third Edition, Morgan Kaufmann, 2003. • References: H.S. Stone, High Performance Computer Architecture, Addison Wesley, 1990. C. Hamacher, Z. Vranesic, and S. Zaky, Computer Organization, Fifth Edition, McGraw Hill, 2002. S.G. Shiva, Computer Design & Architecture, Second Edition, Harper Collins, 1991. D.A.Patterson and J.L.Hennessy, Computer Organization and Design: The Hardware/Software Interface, Second Edition, Morgan Kaufmann, San Mateo, CA 1998. Heath 2 SYLLABUS (continued) • W.Stallings, Computer Organization and Architecture: Designing for Performance, Fourth Edition, Prentice Hall, 1996. • S. Palnitkar, Verilog HDL: A Guide to Digital Design and Synthesis, Prentice Hall, 1996. • M. Ciletti, Modeling, Synthesis, And Rapid Prototyping With The Verilog HDL, Prentice Hall, 1999 (Available for purchase in local bookstores). • Meeting Schedule MWF (10:00 am-10:50 am) 303 Slone Research Bldg. • Course Description Study of fundamental concepts in digital computer system architecture/structure and design. Topics include: computer system modeling based on instruction set architecture models; architecture and design of datapaths, control units, processors, memory systems hierarchy, and input/output systems. Special topics include floating-point arithmetic, multiple level cache design, pipeline design techniques, multiple issue processors, and an introduction to parallel computer architectures. Use of Hardware Description Languages (HDLs) for architecture/design verification/validation via pre- synthesis simulation. Prereq: EE380 and EE581 or consent of instructor. Heath 3 SYLLABUS (Continued) • Topical Outline 1. Introduction to Computer Architecture and Design Fundamentals. 2. Instruction Set Architecture Models. 3. Introduction to Computer Architecture/Design Verification via use of a Hardware Description Language (HDL) – VERILOG. 4. Instruction Set Principles and Examples. 5. Pipelining. 6. Advanced Pipelining and Instruction-Level Parallelism. 7. Memory Hierarchy Design. 8. Storage Systems. 9. Input/Output Systems. 10. Computer Design & Design Documentation and Verification via VERILOG. 11. Interconnection Networks. 12. Introduction to Multiprocessors and Models. 13. Introduction to Vector Processors. • Grade: Test 1: (October 11) 25% Test 2: (November 25) 25% Homework: Design, Design Verification Projects: 25% Final Exam -Comprehensive (Fri. Dec. 20 (10:30 am)): 25% Your final grade will generally be determined by the number of points you have accumulated from 100 possible points as follows: A: 90-100 pts. B: 80-89 pts. C: 70-79 pts. E: 69 or below An equitable grade scale will be applied when warranted. Heath 4 SYLLABUS (Continued) • Make-Up Examinations Make-up examinations will only be given to students who miss examinations as a result of excused absences according to applicable university policy. Make-up exams may be of a different format from the regular exam format (Example: Oral format). • Cheating: Cheating will not be allowed or tolerated. Anyone who cheats will be dealt with according to applicable university policy. (Assignment of a grade of E for the course). • Class Attendance Attendance of all class lectures is required to assure maximum course performance. You are responsible for all business conducted within a class. • Homework Assignments Homework assignments will be periodically made. All assignments may not be graded. Assignments are due at the beginning of the class period on due dates. Heath 5 VERILOG • A Hardware Description Language (HDL) Used To Describe Digital Systems Hardware Design and Structure. • A HDL Can Be Used For Digital System Design Verification/Validation and ImplementationVia HDL Simulation, Synthesis, And Implementation. • Systems May Be Described At Three Levels: Behavioral, Register Transfer (Dataflow, Equation), And Gate (Structural) Levels. • Example: • Binary Full Adder Heath 6 Verilog Coding Example: Binary Full Adder (bfa) Gate (Structural) Level Coding Style module fulladd1 (I2, I1, I0, so, co); // Module Name and Input/Output Signal //Declaration input I2, I1, I0 ; // Input, Output, and Wire Declaration output so; output co; wire wx0, wa0, wa1, wa2; xor XO (wx0, I2, I1); // Gate Identification, Instantiation Names, an Output/Input(s) xor X1 (so, wx0, I0); and A0 (wa0, I1, I0); and A1 (wa1, I2, I1); and A2 (wa2, I2, I0); or (co, wa0, wa1, wa2); endmodule // End of Module // “and AI #2(wf,wc,wd)” Gate Delay Of 2 Simulation Time Units. Heath 7 Verilog Coding Example: Binary Full Adder (bfa) Register-Transfer-Level (RTL), Dataflow, Or Equation Level Coding Style // "bfa" RTL Coding Style. module fulladd2 (I2, I1, I0, so, co); // Module Name and Input/Output Signal Declaration. input I2, I1, I0 ; // Input, Output, and Wire Declaration. output so; output co; assign so = I2 ^ I1 ^ I0; // Equations of Binary Full Adder (bfa). Bit Wise Exclusive-OR (^). assign co = (I2 && I0) || (I1 && I0) || (I2 && I1); //Logical AND (&&); Logical OR (||). //Bit Wise AND (&); Bit Wise OR (|). endmodule // End of Module. Heath 8 Verilog Coding Example: Binary Full Adder (bfa) Behavioral Level Coding Style // "bfa" Behavioral Coding Style. module fulladd3 (I2, I1, I0, so, co); // Module Name and Input/Output Signal Declaration. input I2, I1, I0 ; // Input and Output Declaration. output co, so; reg co, so; // Port (Signal) Values Are Held Until They Change. always @ (I2 or I1 or I0) //Following Code Executed Anytime I2, I1 or I0 Changes begin //Value. case ({I2,I1,I0}) //Use Behavioral Level "Case" Structure. 3'b000: begin co=1’b0; so=1’b0; end //Implements Truth Table.. 3'b001: begin co=1’b0; so=1’b1; end 3'b010: begin co=1’b0; so=1’b1; end //3'b010 Implies 3-Bits Binary And They are //010. 3'b011: begin co=1’b1; so=1’b0; end 3'b100: begin co=1’b0; so=1’b1; end 3'b101: begin co=1’b1; so=1’b0; end 3'b110: begin co=1’b1; so=1’b0; end 3'b111: begin co=1’b1; so=1’b1; end endcase end endmodule Heath 9 AUTOMATED TESTBENCH FOR “fulladd1” MODULE module testfulladd1; reg I2, I1, I0, Cot, Sot, flag; // Signals Declared to be Registers (Hold Values Until Changed) wire Sos,Cos; fulladd1 ADD0(I2, I1, I0, Sos, Cos); // Instantation of Module Under Test (MUT) initial // This Process Defines Signals We Want to View as begin // System is Simulated and the Code We Represent // Signals in. Choices are Binary, Octal, or Hex $monitor($time, "I2=%b I1=%b I0=%b Cos=%b Cot=%b Sos=%b Sot=%b flag=%b", I2,I1,I0,Cos,Cot,Sos,Sot,flag); end initial // This Process Generates Stimulus Inputs to MUT begin: I2_loop // and for Each a Theoretically Correct Output integer m; for (m = 0 ; m < 2; m=m+1) begin: I1_loop integer n; for (n = 0 ; n < 2; n=n+1) begin: I0_loop integer o; for (o = 0 ; o <2; o=o+1) begin #1 I2 = m; I1 = n; I0 = o; // Stimulus Applied to MUT Inputs Heath 10 AUTOMATED TESTBENCH FOR “fulladd1” MODULE (Continued) if(m == 0 && n == 0 && o == 0) Cot = 0; // Generation of if(m == 0 && n == 0 && o == 0) Sot = 0; //Theoretically if(m == 0 && n == 0 && o == 1) Cot = 0; //Correct Outputs if(m == 0 && n == 0 && o == 1) Sot = 1; if(m == 0 && n == 1 && o == 0) Cot = 0; if(m == 0 && n == 1 && o == 0) Sot = 1; if(m == 0 && n == 1 && o == 1) Cot = 1; if(m == 0 && n == 1 && o == 1) Sot = 0; if(m == 1 && n == 0 && o == 0) Cot = 0; if(m == 1 && n == 0 && o == 0) Sot = 1; if(m == 1 && n == 0 && o == 1) Cot = 1; if(m == 1 && n == 0 && o == 1) Sot = 0; if(m == 1 && n == 1 && o == 0) Cot = 1; if(m == 1 && n == 1 && o == 0) Sot = 0; if(m == 1 && n == 1 && o == 1) Cot = 1; if(m == 1 && n == 1 && o == 1) Sot = 1; #1 if((Cos==Cot)&&(Sos==Sot)) flag = 0; // Comparison of Theoretically else flag = 1; // Correct Output to MUT Output. end // “==“ Logical Equality end end #5 flag = 1; end initial // Process Which Will Shut Down a Run-Away Simulation begin // ie, Getting in an Endless Loop. #150 $finish; end endmodule Heath 11 SIMUCAD SILOS III DEMONSTRATIONS (Use Files “bfa_gate_level” And “testbfa_gate_level”) (Use Files “mux4x1_gate_level” And “testmux4x1_gate_level”) • Examples Of “Exhaustive Automated” Testbenches Heath 12 CHAP. 1 –Fundamentals of Computer Design • Electronic Computing Originated in 1945. • Performance Growth 1945 ? % Per Year Technology and Architectural Innovation. 1970 25% – 35% Per Year Technology Only. 1980 50% Per Year *RISC – Technology and Arch. Innovation (Inst. Level Parallelism (ILP) and Cache Memories. * (Microprocessor Driven) Fewer: Mainframes Supercomputers 1995 50% Per Year *Technology and Architectural Innovation. (Super Scalar Pipelining, ILP, and Parallel Processing) *Application Classes: PCs, Servers, Embedded Computers. 2002 Heath 13 Growth In Microprocessor Performance Due to Technology and Architectural Innovation 1600 1500 1400 1300 1200 1100 1000 900 800 700 600 500 400 300 200 100 0 Heath 14 TASK OF A COMPUTER DESIGNER (ARCHITECT) 1. Determine Important Requirements/Attributes of a New Computer to be Developed. 2. Design to Meet Requirements/Attributes and to Maximize Performance (When Required) and Lower Cost. • Design Steps: 1. Assembly Language Instruction Set Design. 2. Functional Organization (1. and 2. Comprise ISA Design). 3. Logic Design of Functional Units to RTL and Gate Level only when Required. 4. Design Capture (Verilog or VHDL). 5. Organization/Architecture/Design Verification/Validation via HDL Simulation. 6. Implementation. * IC Design, Layout, etc. * Power * Cooling * Low Power Design * Testing • Computer Architecture: Includes: 1. Instruction Set Architecture Design.
Recommended publications
  • 18-741 Advanced Computer Architecture Lecture 1: Intro And
    18-742 Fall 2012 Parallel Computer Architecture Lecture 10: Multithreading II Prof. Onur Mutlu Carnegie Mellon University 9/28/2012 Reminder: Review Assignments Due: Sunday, September 30, 11:59pm. Mutlu, “Some Ideas and Principles for Achieving Higher System Energy Efficiency,” NSF Position Paper and Presentation 2012. Ebrahimi et al., “Parallel Application Memory Scheduling,” MICRO 2011. Seshadri et al., “The Evicted-Address Filter: A Unified Mechanism to Address Both Cache Pollution and Thrashing,” PACT 2012. Pekhimenko et al., “Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency,” CMU SAFARI Technical Report 2012. 2 Feedback on Project Proposals In your email General feedback points Concrete mechanisms, even if not fully right, is a good place to start testing your ideas 3 Last Lecture Asymmetry in Memory Scheduling Wrap up Asymmetry Multithreading Fine-grained Coarse-grained 4 Today More Multithreading 5 More Multithreading 6 Readings: Multithreading Required Spracklen and Abraham, “Chip Multithreading: Opportunities and Challenges,” HPCA Industrial Session, 2005. Kalla et al., “IBM Power5 Chip: A Dual-Core Multithreaded Processor,” IEEE Micro 2004. Tullsen et al., “Exploiting choice: instruction fetch and issue on an implementable simultaneous multithreading processor,” ISCA 1996. Eyerman and Eeckhout, “A Memory-Level Parallelism Aware Fetch Policy for SMT Processors,” HPCA 2007. Recommended Hirata et al., “An Elementary Processor Architecture with Simultaneous Instruction Issuing from Multiple Threads,” ISCA 1992 Smith, “A pipelined, shared resource MIMD computer,” ICPP 1978. Gabor et al., “Fairness and Throughput in Switch on Event Multithreading,” MICRO 2006. Agarwal et al., “APRIL: A Processor Architecture for Multiprocessing,” ISCA 1990. 7 Review: Fine-grained vs.
    [Show full text]
  • Hard Real-Time Performances in Multiprocessor-Embedded Systems Using ASMP-Linux
    Hindawi Publishing Corporation EURASIP Journal on Embedded Systems Volume 2008, Article ID 582648, 16 pages doi:10.1155/2008/582648 Research Article Hard Real-Time Performances in Multiprocessor-Embedded Systems Using ASMP-Linux Emiliano Betti,1 Daniel Pierre Bovet,1 Marco Cesati,1 and Roberto Gioiosa1, 2 1 System Programming Research Group, Department of Computer Science, Systems, and Production, University of Rome “Tor Vergata”, Via del Politecnico 1, 00133 Rome, Italy 2 Computer Architecture Group, Computer Science Division, Barcelona Supercomputing Center (BSC), c/ Jordi Girona 31, 08034 Barcelona, Spain Correspondence should be addressed to Roberto Gioiosa, [email protected] Received 30 March 2007; Accepted 15 August 2007 Recommended by Ismael Ripoll Multiprocessor systems, especially those based on multicore or multithreaded processors, and new operating system architectures can satisfy the ever increasing computational requirements of embedded systems. ASMP-LINUX is a modified, high responsive- ness, open-source hard real-time operating system for multiprocessor systems capable of providing high real-time performance while maintaining the code simple and not impacting on the performances of the rest of the system. Moreover, ASMP-LINUX does not require code changing or application recompiling/relinking. In order to assess the performances of ASMP-LINUX, benchmarks have been performed on several hardware platforms and configurations. Copyright © 2008 Emiliano Betti et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. 1. INTRODUCTION nificantly higher than that of single-core processors, we can expect that in a near future many embedded systems will This article describes a modified Linux kernel called ASMP- make use of multicore processors.
    [Show full text]
  • Computer Architectures an Overview
    Computer Architectures An Overview PDF generated using the open source mwlib toolkit. See http://code.pediapress.com/ for more information. PDF generated at: Sat, 25 Feb 2012 22:35:32 UTC Contents Articles Microarchitecture 1 x86 7 PowerPC 23 IBM POWER 33 MIPS architecture 39 SPARC 57 ARM architecture 65 DEC Alpha 80 AlphaStation 92 AlphaServer 95 Very long instruction word 103 Instruction-level parallelism 107 Explicitly parallel instruction computing 108 References Article Sources and Contributors 111 Image Sources, Licenses and Contributors 113 Article Licenses License 114 Microarchitecture 1 Microarchitecture In computer engineering, microarchitecture (sometimes abbreviated to µarch or uarch), also called computer organization, is the way a given instruction set architecture (ISA) is implemented on a processor. A given ISA may be implemented with different microarchitectures.[1] Implementations might vary due to different goals of a given design or due to shifts in technology.[2] Computer architecture is the combination of microarchitecture and instruction set design. Relation to instruction set architecture The ISA is roughly the same as the programming model of a processor as seen by an assembly language programmer or compiler writer. The ISA includes the execution model, processor registers, address and data formats among other things. The Intel Core microarchitecture microarchitecture includes the constituent parts of the processor and how these interconnect and interoperate to implement the ISA. The microarchitecture of a machine is usually represented as (more or less detailed) diagrams that describe the interconnections of the various microarchitectural elements of the machine, which may be everything from single gates and registers, to complete arithmetic logic units (ALU)s and even larger elements.
    [Show full text]
  • Understanding IBM Pseries Performance and Sizing
    Understanding IBM pSeries Performance and Sizing Comprehend IBM RS/6000 and IBM ^ pSeries hardware architectures Get an overview of current industry benchmarks Understand how to size your system Nigel Trickett Tatsuhiko Nakagawa Ravi Mani Diana Gfroerer ibm.com/redbooks SG24-4810-01 International Technical Support Organization Understanding IBM ^ pSeries Performance and Sizing February 2001 Take Note! Before using this information and the product it supports, be sure to read the general information in Appendix A, “Special notices” on page 377. Second Edition (February 2001) This edition applies to IBM RS/6000 and IBM ^ pSeries as of December 2000, and Version 4.3.3 of the AIX operating system. This document was updated on January 24, 2003. Comments may be addressed to: IBM Corporation, International Technical Support Organization Dept. JN9B Building 003 Internal Zip 2834 11400 Burnet Road Austin, Texas 78758-3493 When you send information to IBM, you grant IBM a non-exclusive right to use or distribute the information in any way it believes appropriate without incurring any obligation to you. © Copyright International Business Machines Corporation 1997, 2001. All rights reserved. Note to U.S Government Users – Documentation related to restricted rights – Use, duplication or disclosure is subject to restrictions set forth in GSA ADP Schedule Contract with IBM Corp. Contents Preface. 9 The team that wrote this redbook. 9 Comments welcome. 11 Chapter 1. Introduction . 1 Chapter 2. Background . 5 2.1 Performance of processors . 5 2.2 Hardware architectures . 6 2.2.1 RISC/CISC concepts . 6 2.2.2 Superscalar architecture: pipeline and parallelism . 7 2.2.3 Memory management .
    [Show full text]
  • IBM POWER Microprocessors
    IBM POWER microprocessors IBM has a series of high performance microprocessors called POWER followed by a number designating generation, i.e. POWER1, POWER2, POWER3 and so forth up to the latest POWER9. These processors have been used by IBM in their RS/6000, AS/400, pSeries, iSeries, System p, System i and Power Systems line of servers and supercomputers. They have also been used in data storage devices by IBM and by other server manufacturers like Bull and Hitachi. The name "POWER" was originally presented as an acronym for "Performance Optimization With Enhanced RISC". The POWERn family of processors were developed in the late 1980s and are still in active development nearly 30 years later. In the beginning, they utilized the POWER instruction set architecture (ISA), but that evolved into PowerPC in later generations and then to Power Architecture, so modern POWER processors do not use the POWER ISA, they use the Power ISA. History Early developments The 801 research project Main article: IBM 801 In 1974 IBM started a project to build a telephone switching computer with, for the time, immense computational power. Since the application was comparably simple, this machine would need only to perform I/O, branches, add register-register, move data between registers and memory, and would have no need for special instructions to perform heavy arithmetic. This simple design philosophy, whereby each step of a complex operation is specified explicitly by one machine instruction, and all instructions are required to complete in the same constant time, would later come to be known as RISC. When the telephone switch project was cancelled IBM kept the design for the general purpose processor and named it 801 after building #801 at Thomas J.
    [Show full text]
  • A Survey of Processors with Explicit Multithreading
    A Survey of Processors with Explicit Multithreading THEO UNGERER University of Augsburg BORUT ROBICˇ University of Ljubljana AND JURIJ SILCˇ Joˇzef Stefan Institute Hardware multithreading is becoming a generally applied technique in the next generation of microprocessors. Several multithreaded processors are announced by industry or already into production in the areas of high-performance microprocessors, media, and network processors. A multithreaded processor is able to pursue two or more threads of control in parallel within the processor pipeline. The contexts of two or more threads of control are often stored in separate on-chip register sets. Unused instruction slots, which arise from latencies during the pipelined execution of single-threaded programs by a contemporary microprocessor, are filled by instructions of other threads within a multithreaded processor. The execution units are multiplexed between the thread contexts that are loaded in the register sets. Underutilization of a superscalar processor due to missing instruction-level parallelism can be overcome by simultaneous multithreading, where a processor can issue multiple instructions from multiple threads each cycle. Simultaneous multithreaded processors combine the multithreading technique with a wide-issue superscalar processor to utilize a larger part of the issue bandwidth by issuing instructions from different threads simultaneously. Explicit multithreaded processors are multithreaded processors that apply processes or operating system threads in their hardware thread
    [Show full text]
  • 3 Processor Cores
    Chapter 3 Processor Cores efore considering systems based on multi-core chips, we first consider the microarchitecture of individual processor cores. W e will consider both simple cores that issue and execute in- structions in the architected program sequence (in-order), as well as more advanced superscalar B cores capable of issuing multiple instructions per cycle, out-of-order with to the architected sequence. Demand for performance has driven single core designs steadily in the direction of powerful uniprocessors, toward wide superscalar processors, in particular. Superscalar cores are dominant in both client and server computers. W hile this trend continues, it has slowed substantially in recent years. Power considerations and the ability to enhance throughput with multiple cores has motivated a re- thinking of simple in-order cores in high performance systems. Consequently both in- and out-of- order cores are likely to be used in future systems. An important technique used in some multi-core systems is multi-threading where a single hardware core can support multiple threads of execution. This is done by implementing hardware that holds multiple sets of architected state (program counters and registers), and sharing a core’s hardware resources among the multiple threads. In essence, multi-threading allows a single hardware core to have many features of a multiprocessor; to software there is no logical difference. Because of our interest in mul- tiprocessor systems, we will provide an overview of conventional processor cores, and then pay special attention to multi-threaded cores. 3.1 In-Order Cores Figure 1 illustrates a simple in-order pipeline where instructions pass through a series of stages as they are executed.
    [Show full text]
  • Motivation for Multithreaded Architectures
    5/12/2011 Multithreaded Architectures Multiprocessors • multiple threads execute on different processors Multithreaded processors • multiple threads execute on the same processor Spring 2011 471 - Multithreaded Processors 1 Motivation for Multithreaded Architectures Performance, again. Individual processors not executing code at their hardware potential • past: particular source of latency • today: all sources of latency • despite increasingly complex parallel hardware • increase in instruction issue bandwidth & number of functional units • out-of-order execution • techniques for decreasing/hiding branch & memory latencies • processor utilization was decreasing & instruction throughput not increasing in proportion to the issue width Spring 2011 471 - Multithreaded Processors 2 1 5/12/2011 Motivation for Multithreaded Architectures Spring 2011 471 - Multithreaded Processors 3 Motivation for Multithreaded Architectures Major cause of low instruction throughput: • more complicated than a particular source of latency • the lack of instruction-level parallelism in a single executing thread Therefore the solution: • has to be more general than building a smarter cache or a more accurate branch predictor • has to involve more than one thread Spring 2011 471 - Multithreaded Processors 4 2 5/12/2011 Multithreaded Processors Multithreaded processors • execute instructions from multiple threads • execute multiple threads without software context switching • hardware support • holds processor state for more than one thread of execution • registers
    [Show full text]
  • Computer Architecture: Multithreading (II)
    Computer Architecture: Multithreading (II) Prof. Onur Mutlu Carnegie Mellon University A Note on This Lecture These slides are partly from 18-742 Fall 2012, Parallel Computer Architecture, Lecture 10: Multithreading II Video of that lecture: http://www.youtube.com/watch?v=e8lfl6MbILg&list=PL5PHm2jkkX mh4cDkC3s1VBB7-njlgiG5d&index=10 2 More Multithreading 3 Readings: Multithreading Required Spracklen and Abraham, “Chip Multithreading: Opportunities and Challenges,” HPCA Industrial Session, 2005. Kalla et al., “IBM Power5 Chip: A Dual-Core Multithreaded Processor,” IEEE Micro 2004. Tullsen et al., “Exploiting choice: instruction fetch and issue on an implementable simultaneous multithreading processor,” ISCA 1996. Eyerman and Eeckhout, “A Memory-Level Parallelism Aware Fetch Policy for SMT Processors,” HPCA 2007. Recommended Hirata et al., “An Elementary Processor Architecture with Simultaneous Instruction Issuing from Multiple Threads,” ISCA 1992 Smith, “A pipelined, shared resource MIMD computer,” ICPP 1978. Gabor et al., “Fairness and Throughput in Switch on Event Multithreading,” MICRO 2006. Agarwal et al., “APRIL: A Processor Architecture for Multiprocessing,” ISCA 1990. 4 Review: Fine-grained vs. Coarse-grained MT Fine-grained advantages + Simpler to implement, can eliminate dependency checking, branch prediction logic completely + Switching need not have any performance overhead (i.e. dead cycles) + Coarse-grained requires a pipeline flush or a lot of hardware to save pipeline state Higher performance overhead with deep pipelines and large windows Disadvantages - Low single thread performance: each thread gets 1/Nth of the bandwidth of the pipeline 5 IBM RS64-IV 4-way superscalar, in-order, 5-stage pipeline Two hardware contexts On an L2 cache miss Flush pipeline Switch to the other thread Considerations Memory latency vs.
    [Show full text]
  • A Case Study of 3 Internet Benchmarks on 3 Superscalar Machines
    A Case Study of 3 Internet Benchmarks on 3 Superscalar Machines + + + Yue Luo , Pattabi Seshadri*, Juan Rubio , Lizy Kurian John and Alex Mericas* + Laboratory for Computer Architecture *IBM Corporation The University of Texas at Austin 11400 Burnet Road Austin, TX 78712 Austin, TX 78758 {luo,jrubio,ljohn}@ece.utexas.edu [email protected] The phenomenal growth of the World Wide Web has resulted in the emergence and popularity of several information technology related computer applications. These applications are typically executed on computer systems that contain state-of-the-art superscalar microprocessors. Superscalar microprocessors can fetch, decode, and execute multiple instructions in each clock cycle. They contain multiple functional units and generally employ large caches. Most of these superscalar processors execute instructions in an order different from the instruction sequence that is fed to them. In order to finish the job as soon as possible, they look further down into the instruction stream and execute instructions from places where sequential execution flow has not reached yet. With the aid of sophisticated branch predictors, they identify the potential path of program flow in order to find instructions to execute in advance. At times, the predictions are wrong and the processor nullifies the extra work that it speculatively performed. Most of the microprocessors that are executing today’s internet workloads were designed before the advent of these emerging workloads. The SPEC CPU benchmarks (See sidebar on SPEC CPU benchmarks) have been used widely in performance evaluation during the last 12 years, but they are different in functionality from the emerging commercial applications.
    [Show full text]
  • Architectures of Processor Chips Basic Architectural Features
    University of Athens Department of Informatics and Telecommunications Architectures of processor chips Ø Basic Architectural features Ø Single core processor chip ü Scalar processors –Pipelining ü Superscalars–ILP -threading Ø Amdahl’s Law Ø Models of Multicores Ø Multicores ü Symmetric MulticoreChips ü Asymmetric MulticoreChips ü Dynamic MulticoreChips Ø Multi-Threading Ø Accelerators ConstantinHalatsis 1 University of Athens Department of Informatics and Telecommunications Basic Architectural features Ø Memory Hierarchy (caches, Main memory, HDs,..) Ø Pipelining Ø Branch prediction Ø Instruction Level Paralelism–ILP Ø Data Level Parallelism –DLP Ø Thread Level Parallelism -TLP ConstantinHalatsis 2 1 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications Example ConstantinHalatsis 3 University of Athens Department of Informatics and Telecommunications ConstantinHalatsis 4 2 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications Basic techniques Based on the CDC 6000 Architecture Scoreboarding Important Feature: Scoreboard Issue: WAW, Decode: RAW, execute and write results: WAR Reorder Buffer TomasuloAlgorithm Implemented in the IBM360/91’s floating point unit. Important Feature: Reservation Station and CDB (Common Data Bus) Issue: tag if not available, copy if they are; Execute: stall RAW monitoring the CDB Write results: Send results to the CDB and dump the store buffercontents;
    [Show full text]
  • Understanding IBM Pseries Performance and Sizing
    Understanding IBM pSeries Performance and Sizing Comprehend IBM RS/6000 and IBM ^ pSeries hardware architectures Get an overview of current industry benchmarks Understand how to size your system Nigel Trickett Tatsuhiko Nakagawa Ravi Mani Diana Gfroerer ibm.com/redbooks SG24-4810-01 International Technical Support Organization Understanding IBM ^ pSeries Performance and Sizing February 2001 Take Note! Before using this information and the product it supports, be sure to read the general information in Appendix A, “Special notices” on page 377. Second Edition (February 2001) This edition applies to IBM RS/6000 and IBM ^ pSeries as of December 2000, and Version 4.3.3 of the AIX operating system. Comments may be addressed to: IBM Corporation, International Technical Support Organization Dept. JN9B Building 003 Internal Zip 2834 11400 Burnet Road Austin, Texas 78758-3493 When you send information to IBM, you grant IBM a non-exclusive right to use or distribute the information in any way it believes appropriate without incurring any obligation to you. © Copyright International Business Machines Corporation 1997, 2001. All rights reserved. Note to U.S Government Users – Documentation related to restricted rights – Use, duplication or disclosure is subject to restrictions set forth in GSA ADP Schedule Contract with IBM Corp. Contents Preface. .ix The team that wrote this redbook. .ix Comments welcome. .xi Chapter 1. Introduction . 1 Chapter 2. Background . 5 2.1 Performance of processors . 5 2.2 Hardware architectures . 6 2.2.1 RISC/CISC concepts . 6 2.2.2 Superscalar architecture: pipeline and parallelism . 7 2.2.3 Memory management . 11 2.2.4 PCI.
    [Show full text]