WCAE 2003 Workshop on Computer Architecture Education

Total Page:16

File Type:pdf, Size:1020Kb

WCAE 2003 Workshop on Computer Architecture Education WCAE 2003 Proceedings of the Workshop on Computer Architecture Education in conjunction with The 30th International Symposium on Computer Architecture DQG 2003 Federated Computing Research Conference Town and Country Resort and Convention Center b San Diego, California June 8, 2003 Workshop on Computer Architecture Education Sunday, June 8, 2003 Session 1. Welcome and Keynote 8:45–10:00 8:45 Welcome Edward F. Gehringer, workshop organizer 8:50 Keynote address, “Teaching and teaching computer Architecture: Two very different topics (Some opinions about each),” Yale Patt, teacher, University of Texas at Austin 1 Break 10:00–10:30 Session 2. Teaching with New Architectures 10:30–11:20 10:30 “Intel Itanium floating-point architecture,” Marius Cornea, John Harrison, and Ping Tak Peter Tang, Intel Corp. ................................................................................................................................. 5 10:50 “DOP — A CPU core for teaching basics of computer architecture,” Miloš BeþváĜ, Alois Pluháþek and JiĜí DanƟþek, Czech Technical University in Prague ................................................................. 14 11:05 Discussion Break 11:20–11:30 Session 3. Class Projects 11:30–12:30 11:30 “Superscalar out-of-order demystified in four instructions,” James C. Hoe, Carnegie Mellon University ......................................................................................................................................... 22 11:50 “Bridging the gap between undergraduate and graduate experience in computer systems studies,” Lori Carter and Scott Rae, Point Loma Nazarene University ........................................................... 28 12:05 “Integration of computer security laboratories into computer architecture courses to enhance undergraduate curriculum,” Jayantha Herath and Ajantha Herath, St. Cloud State U. and U. of Dubuque ............................................................................................................................................ 36 12:20 Discussion Lunch 12:30–2:00 Session 4. Teaching Techniques 2:00–3:30 2:00 “Combining learning strategies in a first course in computer architecture,” Patricia Teller, Manuel Nieto, and Steve Roach, U. of Texas at El Paso ............................................................................... 41 2:25 “Building resources for teaching computer architecture through electronic peer review,” Edward F. Gehringer, North Carolina State University ................................................................................. 49 2:40 “Laboratory options for the computer science major," Christopher Vickery and Tamara Blain, Queens College of CUNY ................................................................................................................ 57 2:55 “Activating computer architecture with Classroom Presenter,” Beth Simon, U. of San Diego; and Richard Anderson and Steven Wolfman, U. of Washington ............................................................ 64 3:15 Discussion Break 3:30–4:00 Session 5. Simulation Environments 4:00–5:40 4:00 “The Liberty simulation environment as a pedagogical tool,” Jason Blome, Manish Vachhajarani, Neil Vachhajarani, and David I. August, Princeton University ........................................................ 72 4:25 “Multimedia components for the visualization of dynamic behavior in computer architectures,” Peter Marwedel and Brigit Sirocic, U. of Dortmund ........................................................................ 79 4:40 “Didactic architectures and simulator for network processor learning,” Henrique Cota de Freitas and Carlos Augusto P. S. Martins, Pontifical Catholic U. of Minas Gerais ...................................... 86 5:00 “On the introduction of reconfigurable hardware into computer architecture education,” Ross Brennan and Michael Manzke, Trinity College, Dublin ................................................................... 96 5:15 “Use of HDLs in teaching of computer hardware courses,” Zvonko Vranesic and Stephen Brown, U. of Toronto .................................................................................................................................. 103 5:30 Discussion Teaching and Teaching Computer Architecture: Two Very different topics (Some Opinions about each) Yale N. Patt teacher, The University of Texas at Austin Abstract Distance Learning produces better education, not cheaper education This year’s Computer Architecture Education work- We pay teachers enough that those who would opt shop is remarkable in its recognition that to teach com- for this career don’t opt for medical school instead puter architecture well, one has to pay attention to two things (a) teaching and (b) computer architecture. Hav- We teach high school English teachers enough En- ing been doing both for a good number of years, I har- glish that students at the University can write two bor a fair number of opinions on what one should do consecutive coherent sentences and what one should not do with respect to each. This We get past this insane preoccupation with political talk will get into some of those opinions. With respect correctness, so we can get on with the business of to teaching, I will discuss some of my Ten Command- teaching and learning ments of Good Teaching, what I think of distance learn- We stop canonizing the use of high tech education. ing, political correctness, emphasis on memorization, the Bad pedagogy is NOT good pedagogy if draped in inability of American students to write English, the value technology of having students study in groups, and what I feel is of- ten the sad misuse of technology. Most importantly, I We stop rewarding memorization ability, so maybe will discuss my motivated bottom up approach to learn- students will learn to think,...and perhaps under- ing. With respect to teaching computer architecture, I stand believe the single most important point to get across is that computer architecture, if it is a science at all, is a Figure 1. Visions (Re: Education) science of tradeoffs. The student is best served if he/she thoroughly understands the fundamental principles so as to be able to make the appropriate tradeoffs in reaching a particular design objective. I also plan to discuss the The remaining figures deal with the two parts of my use (and too often, misuse) of measurements, simulation, talk, a focus on teaching, and a focus on teaching com- and real ISAs as opposed to concocted ISAs. puter architecture. 1 Introduction I have been asked to provide copies of the transparen- 2 Focus on Teaching cies I will use in my Keynote Address. I have added Teaching involves at least three things: how to teach some annotated text to hopefully provide some context. (Section 2.1), what to teach (Section 2.2), and what aids Knowing from past experience that the dynamic to use in the process (Section 2.3). schedule of my talks usually bears only casual resem- blance to the static schedule that I prepared ahead of time, I provide these copies somewhat timidly. Figure 1 is from a talk I gave last fall at the annual 2.1 My Ten Commandments of Good Teaching Visions Lecture of the Computer Sciences Department Someone suggested I come up with a set of command- at The University of Texas at Austin. The subject was ments for good teaching. For historical reasons, they Education. I was asked to give my dream for an ideal thought ten would be a good number. So, I set out to future with respect to education. do it, and came up with nine. Ergo, note the tenth one. On sober reflection, that one in itself is a tenth command- 1 ment. So, the list on Figure 2 really does contain ten, not nine as some have commented. Know the material Want to teach Genuinely respect your students and show it Set the bar high; students will measure up Emphasize understanding; de-emphasize memo- rization Take responsibility for what is covered Don’t even try to cover the material Encourage interruptions; don’t be afraid to digress Don’t forget those three little words Reserved for future use Figure 2. My Ten Commandments of Good Teaching Figure 4. My motivated bottom-up ap- proach 2.2 Emphasis on the Fundamentals My emphasis on teaching the fundamentals has been ”Aha!” happens. How well someone can do that depends part of me forever. No doubt my PhD students get tired on how well that person has mastered the fundamentals. of hearing about it. For example, I believe that an ap- My views crystallized particularly strongly with the propriate PhD qualifier is not a written test on some ad- development of our ”new” introduction to computing, vanced course in the graduate curriculum, but rather an which we first developed at Michigan in the mid-90s[1] oral exam on the fundamental concepts found in relevant and later turned into a textbook, published by McGraw- senior level undergraduate courses. Hill[2]. My view of what is important is expressed in My view is simple: research, development, prob- Figure 3. More specific detail of the motivated bottom- lem solving are all about breaking problems down into up approach is shown in Figure 4. small pieces and working with the small pieces until the 2.3 High Tech in the Classroom Top-down design, Bottom-up learning for under- There seems to be a preoccupation with using tech- standing nology in the classroom. Certainly, there is much that can be done with technology to improve learning. I am Abstraction is vital, but... concerned that in our leap to technologize everything, we Not bottom-up, but ”motivated”
Recommended publications
  • Review Memory Disambiguation Review Explicit Register Renaming
    5HYLHZ5HRUGHU%XIIHU 52% &6 *UDGXDWH&RPSXWHU$UFKLWHFWXUH 8VHRIUHRUGHUEXIIHU /HFWXUH ² ,QRUGHULVVXH2XWRIRUGHUH[HFXWLRQ,QRUGHUFRPPLW ² +ROGVUHVXOWVXQWLOWKH\FDQEHFRPPLWWHGLQRUGHU ,QVWUXFWLRQ/HYHO3DUDOOHOLVP ª 6HUYHVDVVRXUFHRIYDOXHVXQWLOLQVWUXFWLRQVFRPPLWWHG ² 3URYLGHVVXSSRUWIRUSUHFLVHH[FHSWLRQV6SHFXODWLRQVLPSO\WKURZRXW *HWWLQJWKH&3, LQVWUXFWLRQVODWHUWKDQH[FHSWHGLQVWUXFWLRQ ² &RPPLWVXVHUYLVLEOHVWDWHLQLQVWUXFWLRQRUGHU ² 6WRUHVVHQWWRPHPRU\V\VWHPRQO\ZKHQWKH\UHDFKKHDGRIEXIIHU 6HSWHPEHU ,Q2UGHU&RPPLW LVLPSRUWDQWEHFDXVH 3URI-RKQ.XELDWRZLF] ² $OORZVWKHJHQHUDWLRQRISUHFLVHH[FHSWLRQV ² $OORZVVSHFXODWLRQDFURVVEUDQFKHV &6.XELDWRZLF] &6.XELDWRZLF] /HF /HF 5HYLHZ0HPRU\'LVDPELJXDWLRQ 5HYLHZ([SOLFLW5HJLVWHU5HQDPLQJ 4XHVWLRQ*LYHQDORDGWKDWIROORZVDVWRUHLQSURJUDP 0DNHXVHRIDSK\VLFDO UHJLVWHUILOHWKDWLVODUJHUWKDQ RUGHUDUHWKHWZRUHODWHG" QXPEHURIUHJLVWHUVVSHFLILHGE\,6$ ² 7U\LQJWRGHWHFW5$:KD]DUGVWKURXJKPHPRU\ .H\LQVLJKW$OORFDWHDQHZSK\VLFDOGHVWLQDWLRQUHJLVWHU ² 6WRUHVFRPPLWLQRUGHU 52% VRQR:$5:$:PHPRU\KD]DUGV IRUHYHU\LQVWUXFWLRQWKDWZULWHV ,PSOHPHQWDWLRQ ² 5HPRYHVDOOFKDQFHRI:$5RU:$:KD]DUGV ² .HHSTXHXHRIVWRUHVLQSURJRUGHU ² 6LPLODUWRFRPSLOHUWUDQVIRUPDWLRQFDOOHG6WDWLF6LQJOH$VVLJQPHQW ² :DWFKIRUSRVLWLRQRIQHZORDGVUHODWLYHWRH[LVWLQJVWRUHV ª /LNHKDUGZDUHEDVHGG\QDPLFFRPSLODWLRQ" :KHQKDYHDGGUHVVIRUORDGFKHFNVWRUHTXHXH 0HFKDQLVP".HHSDWUDQVODWLRQWDEOH ² ,IDQ\ VWRUHSULRUWRORDGLVZDLWLQJIRULWVDGGUHVVVWDOOORDG ² ,6$UHJLVWHU⇒ SK\VLFDOUHJLVWHUPDSSLQJ ² ,IORDGDGGUHVVPDWFKHVHDUOLHUVWRUHDGGUHVV DVVRFLDWLYHORRNXS ² :KHQUHJLVWHUZULWWHQUHSODFHHQWU\ZLWKQHZUHJLVWHUIURPIUHHOLVW WKHQZHKDYHDPHPRU\LQGXFHG
    [Show full text]
  • Computer Science 246 Computer Architecture Spring 2010 Harvard University
    Computer Science 246 Computer Architecture Spring 2010 Harvard University Instructor: Prof. David Brooks [email protected] Dynamic Branch Prediction, Speculation, and Multiple Issue Computer Science 246 David Brooks Lecture Outline • Tomasulo’s Algorithm Review (3.1-3.3) • Pointer-Based Renaming (MIPS R10000) • Dynamic Branch Prediction (3.4) • Other Front-end Optimizations (3.5) – Branch Target Buffers/Return Address Stack Computer Science 246 David Brooks Tomasulo Review • Reservation Stations – Distribute RAW hazard detection – Renaming eliminates WAW hazards – Buffering values in Reservation Stations removes WARs – Tag match in CDB requires many associative compares • Common Data Bus – Achilles heal of Tomasulo – Multiple writebacks (multiple CDBs) expensive • Load/Store reordering – Load address compared with store address in store buffer Computer Science 246 David Brooks Tomasulo Organization From Mem FP Op FP Registers Queue Load Buffers Load1 Load2 Load3 Load4 Load5 Store Load6 Buffers Add1 Add2 Mult1 Add3 Mult2 Reservation To Mem Stations FP adders FP multipliers Common Data Bus (CDB) Tomasulo Review 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 LD F0, 0(R1) Iss M1 M2 M3 M4 M5 M6 M7 M8 Wb MUL F4, F0, F2 Iss Iss Iss Iss Iss Iss Iss Iss Iss Ex Ex Ex Ex Wb SD 0(R1), F0 Iss Iss Iss Iss Iss Iss Iss Iss Iss Iss Iss Iss Iss M1 M2 M3 Wb SUBI R1, R1, 8 Iss Ex Wb BNEZ R1, Loop Iss Ex Wb LD F0, 0(R1) Iss Iss Iss Iss M Wb MUL F4, F0, F2 Iss Iss Iss Iss Iss Ex Ex Ex Ex Wb SD 0(R1), F0 Iss Iss Iss Iss Iss Iss Iss Iss Iss M1 M2
    [Show full text]
  • System Design for a Computational-RAM Logic-In-Memory Parailel-Processing Machine
    System Design for a Computational-RAM Logic-In-Memory ParaIlel-Processing Machine Peter M. Nyasulu, B .Sc., M.Eng. A thesis submitted to the Faculty of Graduate Studies and Research in partial fulfillment of the requirements for the degree of Doctor of Philosophy Ottaw a-Carleton Ins titute for Eleceical and Computer Engineering, Department of Electronics, Faculty of Engineering, Carleton University, Ottawa, Ontario, Canada May, 1999 O Peter M. Nyasulu, 1999 National Library Biôiiothkque nationale du Canada Acquisitions and Acquisitions et Bibliographie Services services bibliographiques 39S Weiiington Street 395. nie WeUingtm OnawaON KlAW Ottawa ON K1A ON4 Canada Canada The author has granted a non- L'auteur a accordé une licence non exclusive licence allowing the exclusive permettant à la National Library of Canada to Bibliothèque nationale du Canada de reproduce, ban, distribute or seU reproduire, prêter, distribuer ou copies of this thesis in microform, vendre des copies de cette thèse sous paper or electronic formats. la forme de microficbe/nlm, de reproduction sur papier ou sur format électronique. The author retains ownership of the L'auteur conserve la propriété du copyright in this thesis. Neither the droit d'auteur qui protège cette thèse. thesis nor substantial extracts fkom it Ni la thèse ni des extraits substantiels may be printed or otherwise de celle-ci ne doivent être imprimés reproduced without the author's ou autrement reproduits sans son permission. autorisation. Abstract Integrating several 1-bit processing elements at the sense amplifiers of a standard RAM improves the performance of massively-paralle1 applications because of the inherent parallelism and high data bandwidth inside the memory chip.
    [Show full text]
  • ARM Cortex-A* Brian Eccles, Riley Larkins, Kevin Mee, Fred Silberberg, Alex Solomon, Mitchell Wills
    ARM Cortex-A* Brian Eccles, Riley Larkins, Kevin Mee, Fred Silberberg, Alex Solomon, Mitchell Wills The ARM Cortex­A product line has changed significantly since the introduction of the Cortex­A8 in 2005. ARM’s next major leap came with the Cortex­A9 which was the first design to incorporate multiple cores. The next advance was the development of the big.LITTLE architecture, which incorporates both high performance (A15) and high efficiency(A7) cores. Most recently the A57 and A53 have added 64­bit support to the product line. The ARM Cortex series cores are all made up of the main processing unit, a L1 instruction cache, a L1 data cache, an advanced SIMD core and a floating point core. Each processor then has an additional L2 cache shared between all cores (if there are multiple), debug support and an interface bus for communicating with the rest of the system. Multi­core processors (such as the A53 and A57) also include additional hardware to facilitate coherency between cores. The ARM Cortex­A57 is a 64­bit processor that supports 1 to 4 cores. The instruction pipeline in each core supports fetching up to three instructions per cycle to send down the pipeline. The instruction pipeline is made up of a 12 stage in order pipeline and a collection of parallel pipelines that range in size from 3 to 15 stages as seen below. The ARM Cortex­A53 is similar to the A57, but is designed to be more power efficient at the cost of processing power. The A57 in order pipeline is made up of 5 stages of instruction fetch and 7 stages of instruction decode and register renaming.
    [Show full text]
  • Sections 3.2 and 3.3 Dynamic Scheduling – Tomasulo's Algorithm
    EEF011 Computer Architecture 計算機結構 Sections 3.2 and 3.3 Dynamic Scheduling – Tomasulo’s Algorithm 吳俊興 高雄大學資訊工程學系 October 2004 A Dynamic Algorithm: Tomasulo’s Algorithm • For IBM 360/91 (before caches!) – 3 years after CDC • Goal: High Performance without special compilers • Small number of floating point registers (4 in 360) prevented interesting compiler scheduling of operations – This led Tomasulo to try to figure out how to get more effective registers — renaming in hardware! • Why Study 1966 Computer? • The descendants of this have flourished! – Alpha 21264, HP 8000, MIPS 10000, Pentium III, PowerPC 604, … Example to eleminate WAR and WAW by register renaming • Original DIV.D F0, F2, F4 ADD.D F6, F0, F8 S.D F6, 0(R1) SUB.D F8, F10, F14 MUL.D F6, F10, F8 WAR between ADD.D and SUB.D, WAW between ADD.D and MUL.D (Due to that DIV.D needs to take much longer cycles to get F0) • Register renaming DIV.D F0, F2, F4 ADD.D S, F0, F8 S.D S, 0(R1) SUB.D T, F10, F14 MUL.D F6, F10, T Tomasulo Algorithm • Register renaming provided – by reservation stations, which buffer the operands of instructions waiting to issue – by the issue logic • Basic idea: – a reservation station fetches and buffers an operand as soon as it is available, eliminating the need to get the operand from a register (WAR) – pending instructions designate the reservation station that will provide their input (RAW) – when successive writes to a register overlap in execution, only the last one is actually used to update the register (WAW) As instructions are issued, the register specifiers for pending operands are renamed to the names of the reservation station, which provides register renaming • more reservation stations than real registers Properties of Tomasulo Algorithm 1.
    [Show full text]
  • Dop – a Cpu Core for Teaching Basics of Computer Architecture
    DOP – A CPU CORE FOR TEACHING BASICS OF COMPUTER ARCHITECTURE Milos Becvar, Alois Pluhacek and Jiri Danecek Department of Computer Science and Engineering Faculty of Electrical Engineering Czech Technical University in Prague, Abstract: A simple 16-bit processor core called DOP and its teaching environment is presented. The DOP processor illustrates the basic principles of computer organization and is therefore used in the introductory hardware course. Its major features are simplicity, availability of an FPGA implementation and a C compiler. This paper presents the description of the core, HW and SW tools and teaching methodology. computing performance. The DOP processor core was 1. INTRODUCTION developed at our department together with various SW and HW visualization tools (Danecek et al., 1994a). An introductory computer hardware course should teach students to the fundamental principles of computer The goal of this paper is to describe this processor core internal functionality. Students, who are familiar with and its learning environment for teaching basics of programming in high-level languages, are required to computer organization. The paper is organized as understand the interaction between a processor, a follows - section 2 outlines the introductory course and memory and I/O devices, an internal organization of characterizes the students, section 3 describes the DOP processor, computer arithmetic and basics of digital processor core, section 4 describes the SW and HW design. Our experience has shown that it is not an easy tools supporting this processor and finally section 5 task for most of them. The functionality of the processor outlines the use of the DOP in our introductory course.
    [Show full text]
  • Multiple Instruction Issue and Completion Per Clock Cycle Using Tomasulo’S Algorithm – a Simple Example
    Multiple Instruction Issue and Completion per Clock Cycle Using Tomasulo’s Algorithm – A Simple Example Assumptions: . The processor has in-order issue but execution may be out-of-order as it is done as soon after issue as operands are available. Instructions commit as they finish execu- tion. There is no speculative execution on Branch instructions because out-of-order com- pletion prevents backing out incorrect results. The reason for the out-of-order com- pletion is that there is no buffer between results from the execution and their com- mitment in the register file. This restriction is just to keep the example simple. It is possible for at least two instructions to issue in a cycle. The processor has two integer ALU’s with one Reservation Station each. ALU op- erations complete in one cycle. We are looking at the operation during a sequence of arithmetic instructions so the only Functional Units shown are the ALU’s. NOTES: Customarily the term “issue” is the transition from ID to the window for dy- namically scheduled machines and dispatch is the transition from the window to the FU’s. Since this machine has an in-order window and dispatch with no speculation, I am using the term “issue” for the transition to the reservation stations. It is possible to have multiple reservation stations in front of a single functional unit. If two instructions are ready for execution on that unit at the same clock cycle, the lowest station number executes. There is a problem with exception processing when there is no buffering of results being committed to the register file because some register entries may be from instructions past the point at which the exception occurred.
    [Show full text]
  • Out-Of-Order Execution & Register Renaming
    Asanovic/Devadas Spring 2002 6.823 Out-of-Order Execution & Register Renaming Krste Asanovic Laboratory for Computer Science Massachusetts Institute of Technology Asanovic/Devadas Spring 2002 6.823 Scoreboard for In-order Issue Busy[unit#] : a bit-vector to indicate unit’s availability. (unit = Int, Add, Mult, Div) These bits are hardwired to FU's. WP[reg#] : a bit-vector to record the registers for which writes are pending Issue checks the instruction (opcode dest src1 src2) against the scoreboard (Busy & WP) to dispatch FU available? not Busy[FU#] RAW? WP[src1] or WP[src2] WAR? cannot arise WAW? WP[dest] Asanovic/Devadas Spring 2002 Out-of-Order Dispatch 6.823 ALU Mem IF ID Issue WB Fadd Fmul • Issue stage buffer holds multiple instructions waiting to issue. • Decode adds next instruction to buffer if there is space and the instruction does not cause a WAR or WAW hazard. • Any instruction in buffer whose RAW hazards are satisfied can be dispatched (for now, at most one dispatch per cycle). On a write back (WB), new instructions may get enabled. Asanovic/Devadas Spring 2002 6.823 Out-of-Order Issue: an example latency 1 LD F2, 34(R2) 1 1 2 2 LD F4, 45(R3) long 3 MULTD F6, F4, F2 3 4 3 4 SUBD F8, F2, F2 1 5 5 DIVD F4, F2, F8 4 6 ADDD F10, F6, F4 1 6 In-order: 1 (2,1) . 2 3 4 4 3 5 . 5 6 6 Out-of-order: 1 (2,1) 4 4 . 2 3 . 3 5 .
    [Show full text]
  • Dynamic Register Renaming Through Virtual-Physical Registers
    Dynamic Register Renaming Through Virtual-Physical Registers † Teresa Monreal [email protected] Antonio González* [email protected] Mateo Valero* [email protected] †† José González [email protected] † Víctor Viñals [email protected] †Departamento de Informática e Ing. de Sistemas. Centro Politécnico Superior - Univ. de Zaragoza, Zaragoza, Spain. *Departament d’ Arquitectura de Computadors. Universitat Politècnica de Catalunya, Barcelona, Spain. ††Departamento de Ingeniería y Tecnología de Computadores. Universidad de Murcia, Murcia, Spain. Abstract Register file access time represents one of the critical delays of current microprocessors, and it is expected to become more critical as future processors increase the instruction window size and the issue width. This paper present a novel dynamic register renaming scheme that delays the allocation of physical registers until a late stage in the pipeline. We show that it can provide important savings in number of physical registers so it can significantly shorter the register file access time. Delaying the allocation of physical registers requires some artifact to keep track of dependences. This is achieved by introducing the concept of virtual-physical registers, which are tags that do not require any storage location. The proposed renaming scheme shortens the average number of cycles that each physical register is allocated, and allows for an early execution of instructions since they can obtain a physical register for its destination earlier than with the conventional scheme. Early execution is especially beneficial for branches and memory operations, since the former can be resolved earlier and the latter can prefetch their data in advance. 1. Introduction Dynamically-scheduled superscalar processors exploit instruction-level parallelism (ILP) by overlapping the execution of instructions in an instruction window.
    [Show full text]
  • Computer Architecture Out-Of-Order Execution
    Computer Architecture Out-of-order Execution By Yoav Etsion With acknowledgement to Dan Tsafrir, Avi Mendelson, Lihu Rappoport, and Adi Yoaz 1 Computer Architecture 2013– Out-of-Order Execution The need for speed: Superscalar • Remember our goal: minimize CPU Time CPU Time = duration of clock cycle × CPI × IC • So far we have learned that in order to Minimize clock cycle ⇒ add more pipe stages Minimize CPI ⇒ utilize pipeline Minimize IC ⇒ change/improve the architecture • Why not make the pipeline deeper and deeper? Beyond some point, adding more pipe stages doesn’t help, because Control/data hazards increase, and become costlier • (Recall that in a pipelined CPU, CPI=1 only w/o hazards) • So what can we do next? Reduce the CPI by utilizing ILP (instruction level parallelism) We will need to duplicate HW for this purpose… 2 Computer Architecture 2013– Out-of-Order Execution A simple superscalar CPU • Duplicates the pipeline to accommodate ILP (IPC > 1) ILP=instruction-level parallelism • Note that duplicating HW in just one pipe stage doesn’t help e.g., when having 2 ALUs, the bottleneck moves to other stages IF ID EXE MEM WB • Conclusion: Getting IPC > 1 requires to fetch/decode/exe/retire >1 instruction per clock: IF ID EXE MEM WB 3 Computer Architecture 2013– Out-of-Order Execution Example: Pentium Processor • Pentium fetches & decodes 2 instructions per cycle • Before register file read, decide on pairing Can the two instructions be executed in parallel? (yes/no) u-pipe IF ID v-pipe • Pairing decision is based… On data
    [Show full text]
  • Tomasulo's Algorithm
    Out-of-Order Execution Several implementations • out-of-order completion • CDC 6600 with scoreboarding • IBM 360/91 with Tomasulo’s algorithm & reservation stations • out-of-order completion leads to: • imprecise interrupts • WAR hazards • WAW hazards • in-order completion • MIPS R10000/R12000 & Alpha 21264/21364 with large physical register file & register renaming • Intel Pentium Pro/Pentium III with the reorder buffer Autumn 2006 CSE P548 - Tomasulo 1 Out-of-order Hardware In order to compute correct results, need to keep track of: • which instruction is in which stage of the pipeline • which registers are being used for reading/writing & by which instructions • which operands are available • which instructions have completed Each scheme has different hardware structures & different algorithms to do this Autumn 2006 CSE P548 - Tomasulo 2 1 Tomasulo’s Algorithm Tomasulo’s Algorithm (IBM 360/91) • out-of-order execution capability plus register renaming Motivation • long FP delays • only 4 FP registers • wanted common compiler for all implementations Autumn 2006 CSE P548 - Tomasulo 3 Tomasulo’s Algorithm Key features & hardware structures • reservation stations • distributed hazard detection & execution control • forwarding to eliminate RAW hazards • register renaming to eliminate WAR & WAW hazards • deciding which instruction to execute next • common data bus • dynamic memory disambiguation Autumn 2006 CSE P548 - Tomasulo 4 2 Hardware for Tomasulo’s Algorithm Autumn 2006 CSE P548 - Tomasulo 5 Tomasulo’s Algorithm: Key Features Reservation
    [Show full text]
  • 10. Assembly Language, Models of Computation
    10. Assembly Language, Models of Computation 6.004x Computation Structures Part 2 – Computer Architecture Copyright © 2015 MIT EECS 6.004 Computation Structures L10: Assembly Language, Models of Computation, Slide #1 Beta ISA Summary • Storage: – Processor: 32 registers (r31 hardwired to 0) and PC – Main memory: Up to 4 GB, 32-bit words, 32-bit byte addresses, 4-byte-aligned accesses OPCODE rc ra rb unused • Instruction formats: OPCODE rc ra 16-bit signed constant 32 bits • Instruction classes: – ALU: Two input registers, or register and constant – Loads and stores: access memory – Branches, Jumps: change program counter 6.004 Computation Structures L10: Assembly Language, Models of Computation, Slide #2 Programming Languages 32-bit (4-byte) ADD instruction: 1 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 opcode rc ra rb (unused) Means, to the BETA, Reg[4] ß Reg[2] + Reg[3] We’d rather write in assembly language: Today ADD(R2, R3, R4) or better yet a high-level language: Coming up a = b + c; 6.004 Computation Structures L10: Assembly Language, Models of Computation, Slide #3 Assembly Language Symbolic 01101101 11000110 Array of bytes representation Assembler 00101111 to be loaded of stream of bytes 10110001 into memory ..... Source Binary text file machine language • Abstracts bit-level representation of instructions and addresses • We’ll learn UASM (“microassembler”), built into BSim • Main elements: – Values – Symbols – Labels (symbols for addresses) – Macros 6.004 Computation Structures L10: Assembly Language, Models
    [Show full text]