Algorithms and Architectures for Multimedia and Beamforming in Communications

Total Page:16

File Type:pdf, Size:1020Kb

Algorithms and Architectures for Multimedia and Beamforming in Communications Yao, K. "Algorithms and Architectures for Multimedia and Beamforming Communications" The VLSI Handbook. Ed. Wai-Kai Chen Boca Raton: CRC Press LLC, 2000 © 2000 by CRC PRESS LLC 79 Algorithms and Architectures for Multimedia and Beamforming in Communications 79.1 Introduction 79.2 Multimedia Support for General Purpose Computers Extended Instruction Set and Multimedia Support for General Purpose Processors • Multimedia Processors and Flavio Lorenzelli Accelerators • Architectures and Concepts for Multimedia ST Microelectronics, Inc. Processors Kung Yao 79.3 Beamforming Array Processing and Architecture Interference Rejecting Beamforming Antenna Arrays • Smart University of California at Los Angeles Antenna Beamforming Arrays 79.1 Introduction In recent years, we have experienced many explosions in various technologies in information processing, transmission, distribution, and storage. In this chapter, we will address two distinct but equally important algorithms and architectures of multimedia and beamforming array processings in communication systems. The first problem is motivated by our need for efficient and real-time presentation of still image, live-action video, and audio information commonly called “multimedia.” To most users, multimedia presentation is easier and more natural to comprehend compared with the traditional textual form of presentation on paper. The second problem is motivated by a desire to transfer tremendous amounts of information from one location to another over limited frequency-polarization-space-time channel con- straints. Beamforming array processing technologies are used to coherently transmit or receive informa- tion over these channel constraints that can significantly improve the performance of a single transmit or receive antenna system. Since both of these problems are of fundamental interest in VLSI and practical implementations of modern communication systems, we consider in detail the basic signal processing algorithmic and architectural limitations of these problems. In Secection 79.2, we first consider the intense computational requirements for displaying images and video using a general purpose PC. The pros and cons of using a separate processor or accelerator to enhance the main CPU are also introduced. Then, “Extended Instruction Set and Multimedia Support for General Purpose Processors” discusses issues related to the extended instruction set for media support © 2000 by CRC Press LLC for a general purpose PC, while “Multimedia Processors and Accelerators” considers in some detail media processors and accelerators. “Architectures and Concepts for Multimedia Processors” deals with the architectures and concepts for multimedia processors. The very long instruction word (VLIW) architec- ture and the SIMD and associates subword parallelism issues are presented and compared. Four tables are used in Section 79.2 to summarize some of the involved basic operations. In Section 79.3, we discuss the use of beamforming operation originally motivated by various military and aerospace applications, but more recently in many civilian applications. In “Interface Rejecting Beamforming Antenna Arrays”, we consider some early simple, sidelobe-canceller, beamforming arrays based on the LMS adaptive criterion, then we discuss in some detail various aspects of a recursive least- squares criterion QR decomposition-based beamforming array. In “Smart Antenna Beamforming Arrays”, we consider the motivation and evolution of the smart antenna system and its implication. 79.2 Multimedia Support for General Purpose Computers In any computer store, it is fairly common nowadays to see rows of computer screens running video and audio clips. Voices, sounds, and moving pictures are attracting large numbers of users who now require as basic features software applications, which include voice mail and videoconferencing. Advertisements for home computers virtually never miss the buzz words “multimedia,” “interactive,” and “Internet.” Over time, audio and video, as well as network connectivity, have become integral parts of many user appli- cations. This change in users’ expectations has been the cause for a major shift in the operations required of a general-purpose computer. The focus of computer designers and vendors has moved to plain word processing and spreadsheet applications to highly demanding tasks, such as real-time compression and decompression of audio and video streams. CPUs are now required to process large amounts of data fast, and allow for different processes to run simultaneously. For instance, one might have different windows open at the same time, one running a streaming video, another playing an audio file, another showing an open internet connection, etc. In addition, PCs are now expected to provide high-quality graphics, 3-D moving pictures, good quality sound, and full-motion support (e.g., MPEG-2 decoders, etc.). Most computer vendors and processor designers are therefore making huge efforts to alter or enrich their products so as to be able to provide these new multimedia services. In order to better envision the complexity required by some multimedia tasks, consider what is involved in operations such as 3-D image rotation/zooming, etc. Each 3-D object consists of hundreds or even thousands of polygons, usually triangles. The smaller the polygons — and consequently the more numerous — the more detailed the object. When an object moves on the screen (rotates or moves forward/backward), the program must recalculate every vertex of every polygon by performing a number of matrix computations. A typical 3-D object might have 1000 polygons, which means that each time it moves, at least 84,000 multiplies and adds are executed. The new MPEG-4 audio/video standard (release date November 1998) addressed, among other things, the need for high interactive functionality, i.e., interactive access to and manipulation of audio-visual data, and content-based interactivity. The standard will provide an object-layered bit-stream. Each object can be separately decoded and manipulated (scaled, rotated, translated). Even 300-MHz Pentium II’s cannot supply enough coordinates for more than a million polygons per second. Accelerator cards or graphics chips are often required for high-end, graphics intensive applications, to render polygons faster than any CPU can keep up. One of the major drives to multimedia PCs was due to Microsoft’s APIs, particularly the recent DirectX interfaces. With DirectX, it is easier to replace audio and video hardware while retaining software compatibility. However, as long as older applications remain alive, multimedia PCs will have to provide compatibility with existing register level standards, e.g., superVGA and SoundBlaster. Register-level interfaces were not designed to support more advanced audio/video features, e.g., 3-D graphics. Most game programs avoid old APIs because of their poor performance. They run directly under DOS and access audio and video chips directly. Microsoft created DirectX, a new set of APIs, in order to avoid having to modify existing programs to support new graphics devices as they show up on the market. © 2000 by CRC Press LLC Multimedia is likely to reach the electronics consumer market also in ways that are still unclear, for instance, in what has been dubbed home entertainment. Vendors envision gamuts of devices which combine the functions of a PC with traditional TV sets. PCs with dual-purpose monitors and set-top boxes will soon offer functionalities ranging from DVD to videophone and interactive 3-D gaming. The PC might become the basis for a living room entertainment center which includes a TV tuner, DVD, Dolby audio, 3-D graphics, etc. Different companies have chosen different approaches to multimedia and made widely different decisions. Some companies, notably Intel and Sun Microsystems among others, have added circuitry to their new generation processor and enhanced their instruction sets in order to handle multimedia data. Other companies, e.g., Philips and Chromatics, have preferred to build whole new processors from the ground up which could work either independently or in alliance with another processor, solely in charge of audio, video, and network connectivity. There are also those who have been thinking of thoroughly new architectures and concepts, which could open entirely different perspectives and computation par- adigms. In the following, we will provide a (necessarily incomplete and temporary) snapshot of the situation as it appears at the end of the 1990s, when the authors are writing. The solutions pursued by the various companies display tradeoffs. What will be the most successful approach is open to debate and will probably be understood only in hindsight. Nonetheless, some considerations can be made now. The main drawback of external multimedia accelerators is undoubtedly the cost.15 The home PC market has been very sensitive to cost, and most vendors are unwilling to add expensive features which could deter possible customers and therefore reduce their market penetration. On the other hand, software developers are unlikely to generate applications tailored for multimedia processors and accelerators unless there is a sufficiently large installed based. The alternative to a separate processor or accelerator is to enhance the main CPU to allow it to process multimedia data.23 As of today, there are few CPUs which have enough “horsepower” to simultaneously handle the operating system, all standard applications, as well as audio decoding
Recommended publications
  • Computer Science 246 Computer Architecture Spring 2010 Harvard University
    Computer Science 246 Computer Architecture Spring 2010 Harvard University Instructor: Prof. David Brooks [email protected] Dynamic Branch Prediction, Speculation, and Multiple Issue Computer Science 246 David Brooks Lecture Outline • Tomasulo’s Algorithm Review (3.1-3.3) • Pointer-Based Renaming (MIPS R10000) • Dynamic Branch Prediction (3.4) • Other Front-end Optimizations (3.5) – Branch Target Buffers/Return Address Stack Computer Science 246 David Brooks Tomasulo Review • Reservation Stations – Distribute RAW hazard detection – Renaming eliminates WAW hazards – Buffering values in Reservation Stations removes WARs – Tag match in CDB requires many associative compares • Common Data Bus – Achilles heal of Tomasulo – Multiple writebacks (multiple CDBs) expensive • Load/Store reordering – Load address compared with store address in store buffer Computer Science 246 David Brooks Tomasulo Organization From Mem FP Op FP Registers Queue Load Buffers Load1 Load2 Load3 Load4 Load5 Store Load6 Buffers Add1 Add2 Mult1 Add3 Mult2 Reservation To Mem Stations FP adders FP multipliers Common Data Bus (CDB) Tomasulo Review 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 LD F0, 0(R1) Iss M1 M2 M3 M4 M5 M6 M7 M8 Wb MUL F4, F0, F2 Iss Iss Iss Iss Iss Iss Iss Iss Iss Ex Ex Ex Ex Wb SD 0(R1), F0 Iss Iss Iss Iss Iss Iss Iss Iss Iss Iss Iss Iss Iss M1 M2 M3 Wb SUBI R1, R1, 8 Iss Ex Wb BNEZ R1, Loop Iss Ex Wb LD F0, 0(R1) Iss Iss Iss Iss M Wb MUL F4, F0, F2 Iss Iss Iss Iss Iss Ex Ex Ex Ex Wb SD 0(R1), F0 Iss Iss Iss Iss Iss Iss Iss Iss Iss M1 M2
    [Show full text]
  • Sections 3.2 and 3.3 Dynamic Scheduling – Tomasulo's Algorithm
    EEF011 Computer Architecture 計算機結構 Sections 3.2 and 3.3 Dynamic Scheduling – Tomasulo’s Algorithm 吳俊興 高雄大學資訊工程學系 October 2004 A Dynamic Algorithm: Tomasulo’s Algorithm • For IBM 360/91 (before caches!) – 3 years after CDC • Goal: High Performance without special compilers • Small number of floating point registers (4 in 360) prevented interesting compiler scheduling of operations – This led Tomasulo to try to figure out how to get more effective registers — renaming in hardware! • Why Study 1966 Computer? • The descendants of this have flourished! – Alpha 21264, HP 8000, MIPS 10000, Pentium III, PowerPC 604, … Example to eleminate WAR and WAW by register renaming • Original DIV.D F0, F2, F4 ADD.D F6, F0, F8 S.D F6, 0(R1) SUB.D F8, F10, F14 MUL.D F6, F10, F8 WAR between ADD.D and SUB.D, WAW between ADD.D and MUL.D (Due to that DIV.D needs to take much longer cycles to get F0) • Register renaming DIV.D F0, F2, F4 ADD.D S, F0, F8 S.D S, 0(R1) SUB.D T, F10, F14 MUL.D F6, F10, T Tomasulo Algorithm • Register renaming provided – by reservation stations, which buffer the operands of instructions waiting to issue – by the issue logic • Basic idea: – a reservation station fetches and buffers an operand as soon as it is available, eliminating the need to get the operand from a register (WAR) – pending instructions designate the reservation station that will provide their input (RAW) – when successive writes to a register overlap in execution, only the last one is actually used to update the register (WAW) As instructions are issued, the register specifiers for pending operands are renamed to the names of the reservation station, which provides register renaming • more reservation stations than real registers Properties of Tomasulo Algorithm 1.
    [Show full text]
  • Multiple Instruction Issue and Completion Per Clock Cycle Using Tomasulo’S Algorithm – a Simple Example
    Multiple Instruction Issue and Completion per Clock Cycle Using Tomasulo’s Algorithm – A Simple Example Assumptions: . The processor has in-order issue but execution may be out-of-order as it is done as soon after issue as operands are available. Instructions commit as they finish execu- tion. There is no speculative execution on Branch instructions because out-of-order com- pletion prevents backing out incorrect results. The reason for the out-of-order com- pletion is that there is no buffer between results from the execution and their com- mitment in the register file. This restriction is just to keep the example simple. It is possible for at least two instructions to issue in a cycle. The processor has two integer ALU’s with one Reservation Station each. ALU op- erations complete in one cycle. We are looking at the operation during a sequence of arithmetic instructions so the only Functional Units shown are the ALU’s. NOTES: Customarily the term “issue” is the transition from ID to the window for dy- namically scheduled machines and dispatch is the transition from the window to the FU’s. Since this machine has an in-order window and dispatch with no speculation, I am using the term “issue” for the transition to the reservation stations. It is possible to have multiple reservation stations in front of a single functional unit. If two instructions are ready for execution on that unit at the same clock cycle, the lowest station number executes. There is a problem with exception processing when there is no buffering of results being committed to the register file because some register entries may be from instructions past the point at which the exception occurred.
    [Show full text]
  • Geode™ Gxm Processor Integrated X86 Solution with MMX Support
    Geode™ GXm Processor Integrated x86 Solution with MMX Support April 2000 Geode™ GXm Processor Integrated x86 Solution with MMX Support General Description The National Semiconductor® Geode™ GXm processor graphics accelerator provides pixel processing and ren- is an advanced 32-bit x86 compatible processor offering dering functions. high performance, fully accelerated 2D graphics, a 64-bit A separate on-chip video buffer enables >30 fps MPEG1 synchronous DRAM controller and a PCI bus controller, video playback when used together with the CS5530 I/O all on a single chip that is compatible with Intel’s MMX companion chip. Graphics and system memory accesses technology. are supported by a tightly-coupled synchronous DRAM The GXm processor core is a proven design that offers (SDRAM) memory controller. This tightly coupled memory competitive CPU performance. It has integer and floating subsystem eliminates the need for an external L2 cache. point execution units that are based on sixth-generation The GXm processor includes Virtual System Architec- technology. The integer core contains a single, six-stage ture® (VSA™ technology) enabling XpressGRAPHICS execution pipeline and offers advanced features such as and XpressAUDIO subsystems as well as generic emula- operand forwarding, branch target buffers, and extensive tion capabilities. Software handler routines for the Xpress- write buffering. A 16 KB write-back L1 cache is accessed GRAPHICS and XpressAUDIO subsystems can be in a unique fashion that eliminates pipeline stalls to fetch included in the BIOS and provide compatible VGA and 16- operands that hit in the cache. bit industry standard audio emulation. XpressAUDIO tech- In addition to the advanced CPU features, the GXm pro- nology eliminates much of the hardware traditionally asso- cessor integrates a host of functions which are typically ciated with audio functions.
    [Show full text]
  • How Microprocessors Work E 1 of 64 ZM
    How Microprocessors Work e 1 of 64 ZM How Microprocessors Work ZAHIDMEHBOOB +923215020706 [email protected] 2003 BS(IT) PRESTION UNIVERSITY How Microprocessors Work The computer you are using to read this page uses a microprocessor to do its work. The microprocessor is the heart of any normal computer, whether it is a desktop machine, a server or a laptop. The microprocessor you are using might be a Pentium, a K6, a PowerPC, a Sparc or any of the many other brands and types of microprocessors, but they all do approximately the same thing in approximately the same way. If you have ever wondered what the microprocessor in your computer is doing, or if you have ever wondered about the differences between types of microprocessors, then read on. Microprocessor History A microprocessor -- also known as a CPU or central processing unit -- is a complete computation engine that is fabricated on a single chip. The first microprocessor was the Intel [email protected] +923215020706 (2003) How Microprocessors Work e 2 of 64 ZM 4004, introduced in 1971. The 4004 was not very powerful -- all it could do was add and subtract, and it could only do that 4 bits at a time. But it was amazing that everything was on one chip. Prior to the 4004, engineers built computers either from collections of chips or from discrete components (transistors wired one at a time). The 4004 powered one of the first portable electronic calculators. The first microprocessor to make it into a home computer was the Intel 8080, a complete 8- bit computer on one chip, introduced in 1974.
    [Show full text]
  • MICROPROCESSOR REPORT the INSIDERS’ GUIDE to MICROPROCESSOR HARDWARE Slot Vs
    VOLUME 12, NUMBER 1 JANUARY 26, 1998 MICROPROCESSOR REPORT THE INSIDERS’ GUIDE TO MICROPROCESSOR HARDWARE Slot vs. Socket Battle Heats Up Intel Prepares for Transition as Competitors Boost Socket 7 A A look Look by Michael Slater ship as many parts as they hoped, especially at the highest backBack clock speeds where profits are much greater. The past year has brought a great deal The shift to 0.25-micron technology will be central to of change to the x86 microprocessor 1998’s CPU developments. Intel began shipping 0.25-micron A market, with Intel, AMD, and Cyrix processors in 3Q97; AMD followed late in 1997, IDT plans to LookA look replacing virtually their entire product join in by mid-98, and Cyrix expects to catch up in 3Q98. Ahead ahead lines with new devices. But despite high The more advanced process technology will cut power con- hopes, AMD and Cyrix struggled in vain for profits. The sumption, allowing sixth-generation CPUs to be used in financial contrast is stark: in 1997, Intel earned a record notebook systems. The smaller die sizes will enable higher $6.9 billion in net profit, while AMD lost $21 million for the production volumes and make it possible to integrate an L2 year and Cyrix lost $6 million in the six months before it was cache on the CPU chip. acquired by National. New entrant IDT added another com- The processors from Intel’s challengers have lagged in petitor to the mix but hasn’t shipped enough products to floating-point and MMX performance, which the vendors become a significant force.
    [Show full text]
  • Tomasulo's Algorithm
    Out-of-Order Execution Several implementations • out-of-order completion • CDC 6600 with scoreboarding • IBM 360/91 with Tomasulo’s algorithm & reservation stations • out-of-order completion leads to: • imprecise interrupts • WAR hazards • WAW hazards • in-order completion • MIPS R10000/R12000 & Alpha 21264/21364 with large physical register file & register renaming • Intel Pentium Pro/Pentium III with the reorder buffer Autumn 2006 CSE P548 - Tomasulo 1 Out-of-order Hardware In order to compute correct results, need to keep track of: • which instruction is in which stage of the pipeline • which registers are being used for reading/writing & by which instructions • which operands are available • which instructions have completed Each scheme has different hardware structures & different algorithms to do this Autumn 2006 CSE P548 - Tomasulo 2 1 Tomasulo’s Algorithm Tomasulo’s Algorithm (IBM 360/91) • out-of-order execution capability plus register renaming Motivation • long FP delays • only 4 FP registers • wanted common compiler for all implementations Autumn 2006 CSE P548 - Tomasulo 3 Tomasulo’s Algorithm Key features & hardware structures • reservation stations • distributed hazard detection & execution control • forwarding to eliminate RAW hazards • register renaming to eliminate WAR & WAW hazards • deciding which instruction to execute next • common data bus • dynamic memory disambiguation Autumn 2006 CSE P548 - Tomasulo 4 2 Hardware for Tomasulo’s Algorithm Autumn 2006 CSE P548 - Tomasulo 5 Tomasulo’s Algorithm: Key Features Reservation
    [Show full text]
  • Tomasulo's Algorithm
    Lecture-12 (Tomasulo’s Algorithm) CS422-Spring 2018 Biswa@CSE-IITK Another Dynamic One: Tomasulo’s Algorithm • For IBM 360/91 about 3 years after CDC 6600 (1966) • Goal: High Performance without special compilers • Differences between IBM 360 & CDC 6600 ISA – IBM has only 2 register specifiers/instruction vs. 3 in CDC 6600 – IBM has 4 FP registers vs. 8 in CDC 6600 – IBM has memory-register ops • Why Study? lead to Alpha 21264, HP 8000, MIPS 10000, Pentium II, PowerPC 604, … CS422: Spring 2018 Biswabandan Panda, CSE@IITK 2 Tomasulo’s Organization From Mem FP Op FP Registers Queue Load Buffers Load1 Load2 Load3 Load4 Load5 Store Load6 Buffers Add1 Add2 Mult1 Add3 Mult2 Reservation To Mem Stations FP adders FP multipliers CS422: Spring 2018 Biswabandan Panda, CSE@IITK 3 Tomasulo vs Scoreboard • Control & buffers distributed with Function Units (FU) vs. centralized in scoreboard; – FU buffers called “reservation stations”; have pending operands • Registers in instructions replaced by values or pointers to reservation stations(RS); called register renaming ; – avoids WAR, WAW hazards – More reservation stations than registers, so can do optimizations compilers can’t • Results to FU from RS, not through registers, over Common Data Bus that broadcasts results to all FUs • Load and Stores treated as FUs with RSs as wells CS422: Spring 2018 Biswabandan Panda, CSE@IITK 4 Reservation Station Components Op: Operation to perform in the unit (e.g., + or –) Vj, Vk: Value of Source operands – Store buffers has V field, result to be stored Qj, Qk: Reservation stations producing source registers (value to be written) – Note: No ready flags as in Scoreboard; Qj,Qk=0 => ready – Store buffers only have Qi for RS producing result Busy: Indicates reservation station or FU is busy Register result status—Indicates which functional unit will write each register, if one exists.
    [Show full text]
  • Tomasulo Algorithm and Dynamic Branch Prediction
    Lecture 4: Tomasulo Algorithm and Dynamic Branch Prediction Professor David A. Patterson Computer Science 252 Spring 1998 DAP Spr.‘98 ©UCB 1 Review: Summary • Instruction Level Parallelism (ILP) in SW or HW • Loop level parallelism is easiest to see • SW parallelism dependencies defined for program, hazards if HW cannot resolve • SW dependencies/compiler sophistication determine if compiler can unroll loops – Memory dependencies hardest to determine • HW exploiting ILP – Works when can’t know dependence at run time – Code for one machine runs well on another • Key idea of Scoreboard: Allow instructions behind stall to proceed (Decode => Issue instr & read operands) – Enables out-of-order execution => out-of-order completion – ID stage checked both for structural & data dependenciesDAP Spr.‘98 ©UCB 2 Review: Three Parts of the Scoreboard 1.Instruction status—which of 4 steps the instruction is in 2.Functional unit status—Indicates the state of the functional unit (FU). 9 fields for each functional unit Busy—Indicates whether the unit is busy or not Op—Operation to perform in the unit (e.g., + or –) Fi—Destination register Fj, Fk—Source-register numbers Qj, Qk—Functional units producing source registers Fj, Fk Rj, Rk—Flags indicating when Fj, Fk are ready 3.Register result status—Indicates which functional unit will write each register, if one exists. Blank when no pending instructions will write that register DAP Spr.‘98 ©UCB 3 Review: Scoreboard Example Cycle 3 Instruction status Read ExecutionWrite Instruction j k Issue operandscompleteResult LD F6 34+ R2 1 2 3 LD F2 45+ R3 MULTDF0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status dest S1 S2 FU for jFU for kFj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Yes Load F6 R2 Yes Mult1 No Mult2 No Add No Divide No Register result status Clock F0 F2 F4 F6 F8 F10 F12 ..
    [Show full text]
  • WCAE 2003 Workshop on Computer Architecture Education
    WCAE 2003 Proceedings of the Workshop on Computer Architecture Education in conjunction with The 30th International Symposium on Computer Architecture DQG 2003 Federated Computing Research Conference Town and Country Resort and Convention Center b San Diego, California June 8, 2003 Workshop on Computer Architecture Education Sunday, June 8, 2003 Session 1. Welcome and Keynote 8:45–10:00 8:45 Welcome Edward F. Gehringer, workshop organizer 8:50 Keynote address, “Teaching and teaching computer Architecture: Two very different topics (Some opinions about each),” Yale Patt, teacher, University of Texas at Austin 1 Break 10:00–10:30 Session 2. Teaching with New Architectures 10:30–11:20 10:30 “Intel Itanium floating-point architecture,” Marius Cornea, John Harrison, and Ping Tak Peter Tang, Intel Corp. ................................................................................................................................. 5 10:50 “DOP — A CPU core for teaching basics of computer architecture,” Miloš BeþváĜ, Alois Pluháþek and JiĜí DanƟþek, Czech Technical University in Prague ................................................................. 14 11:05 Discussion Break 11:20–11:30 Session 3. Class Projects 11:30–12:30 11:30 “Superscalar out-of-order demystified in four instructions,” James C. Hoe, Carnegie Mellon University ......................................................................................................................................... 22 11:50 “Bridging the gap between undergraduate and graduate experience in
    [Show full text]
  • Verification of an Implementation of Tomasulo's Algorithm by Compositional Model Checking
    Verification of an Implementation of Tomasulo's Algorithm by Compositional Model Checking K. L. McMillan Cadence Berkeley Labs 2001 Addison St., 3rd floor Berkeley, CA 94704-1103 [email protected] Abstract. An implementation of an out-of-order processing unit based on Tomasulo's algorithm is formally verified using compositional model checking techniques. This demonstrates that finite-state methods can be applied to such algorithms, without recourse to higher-order proof sys- tems. The paper introduces a novel compositional system that supports cyclic environment reasoning and multiple environment abstractions per signal. A proof of Tomasulo's algorithm is outlined, based on refinement maps, and relying on the novel features of the compositional system. This proof is fully verified by the SMV verifier, using symmetry to reduce the number of assertions that must be verified. 1 Introduction We present the formal design verification of an "out-of-order" processing unit based on Tomasulo's algorithm [Tom67]. This and related techniques such as "register renaming" are used in modern microprocessors [LR97] to keep multiple or deeply pipelined execution units busy by executing instructions in data-flow order, rather than sequential order. The complex variability of instruction flow in "out-of-order" processors presents a significant opportunity for undetected er- rors, compared to an "in-order" pipelined machine where the flow of instructions is fixed and orderly. Unfortunately, this variability also makes formal verifica- tion of such machines difficult. They are beyond the present capacity of methods based on integrated decision procedures [BD94], and are not amenable to sym- bolic trajectory analysis [JNB96].
    [Show full text]
  • MIPS Architecture with Tomasulo Algorithm [12]
    CALIFORNIA STATE UNIVERSITY NORTHRIDGE TOMASULO ARCHITECTURE BASED MIPS PROCESSOR A graduate project submitted in partial fulfilment for the requirement Of the degree of Master of Science In Electrical Engineering By Sameer S Pandit May 2014 The graduate project of Sameer Pandit is approved by: ______________________________________ ____________ Dr. Ali Amini, Ph.D. Date _______________________________________ ____________ Dr. Shahnam Mirzaei, Ph.D. Date _______________________________________ ____________ Dr. Ramin Roosta, Ph.D., Chair Date California State University Northridge ii ACKNOWLEDGEMENT I would like to express my gratitude towards all the members of the project committee and would like to thank them for their continuous support and mentoring me for every step that I have taken towards the completion of this project. Dr. Roosta, for his guidance, Dr. Shahnam for his ideas, and Dr. Ali Amini for his utmost support. I would also like to thank my family and friends for their love, care and support through all the tough times during my graduation. iii Table of Contents SIGNATURE PAGE .......................................................................................................... ii ACKNOWLEDGEMENT ................................................................................................. iii LIST OF FIGURES .......................................................................................................... vii ABSTRACT .......................................................................................................................
    [Show full text]