Vector Vs. Superscalar and VLIW Architectures for Embedded Multimedia Benchmarks

Total Page:16

File Type:pdf, Size:1020Kb

Vector Vs. Superscalar and VLIW Architectures for Embedded Multimedia Benchmarks Vector Vs. Superscalar and VLIW Architectures for Embedded Multimedia Benchmarks Christoforos Kozyrakis David Patterson Electrical Engineering Department Computer Science Division Stanford University University of California at Berkeley christosOee, stanford, edu pattrsn@cs, berkeley, edu Abstract This paper studies the efficiency of vector architectures for the emerging computing domain of multimedia pro- Multimedia processing on embedded devices requires an grams running on embedded systems. Multimedia pro- architecture that leads to high performance, low power con- grams such as video, speech recognition, and 3D graphics, sumption, reduced design complexity, and small code size. constitute the fastest growing class of applications [5]. They In this paper, we use EEMBC, an industrial benchmark require real-time performance guarantees for data-parallel suite, to compare the VIRAM vector architecture to super- tasks that operate on narrow numbers with limited tem- scalar and VLIW processors for embedded multimedia ap- poral locality [6]. Embedded systems include entertain- plications. The comparison covers the VIRAM instruction ment devices, such as set-top-boxes and game consoles, set, vectorizing compiler, and the prototype chip that inte- and portable electronics, such as PDAs and cellular phones. grates a vector processor with DRAM main memory. They call for low power consumption, small code size, and We demonstrate that executable code for VIRAM is up to reduced design and programming complexity in order to 10 times smaller than VLIW code and comparable to x86 meet the cost and time-to-market requirements of consumer CISC code. The simple, cache-less VIRAM chip is 2 times electronics. The complexity, power consumption, and lack faster than a 4-way superscalar RISC processor that uses a of explicit support for data-level parallelism suggest that su- 5 times faster clock frequency and consumes 10 times more perscalar processors are not necessarily a suitable approach power. VIRAM is also 10 times faster than cache-based for embedded multimedia processing. VLIW processors. Even after manual optimization of the To prove that vector architectures meet the requirements VLIW code and insertion of SIMD and DSP instructions, of embedded media-processing, we evaluate the VIRAM the single-issue VlRAM processor is 60%faster than 5-way vector architecture with the EEMBC benchmarks, an indus- to 8-way VLIW designs. trial suite for embedded systems. Our evaluation covers all three components of VIRAM: the instruction set, the vec- torizing compiler, and the processor microarchitecture. We 1 Introduction show that the compiler can extract a high degree of data- level parallelism from media tasks described in C and can The exponentially increasing performance and general- express it with vector instructions. The VIRAM code is sig- ity of superscalar processors has lead many to believe that nificantly smaller than code for RISC and VLIW architec- vector architectures are doomed to extinction. Even in the tures and is comparable to that for x86 CISC processors. supercomputing domain, the traditional application of vec- We describe a simple, low power, prototype chip that in- tor processors, it is widely considered that interconnecting tegrates the VIRAM architecture with embedded DRAM. superscalar processors into large-scale MPP systems is the The cache-less vector processor is 2 times faster than a 4- most promising approach [4]. Nevertheless, vector archi- way superscalar processors running at a 5 times higher clock tectures provide us with frequent reminders of their capabil- frequency. Despite issuing a single instruction per cycle, it ities. The recently announced Japanese Earth Simulator, a is also 10 times faster than 5-way to 8-way VLIW designs. supercomputer based on NEC SX-6 vector processors, pro- We demonstrate that the vector processor provides perfor- vides 5 times the performance with half the number of nodes mance advantages for both highly vectorizable benchmarks of ASCI White, the most powerful supercomputer based on and partially vectorizable tasks with short vectors. superscalar technology. Vector processors remain the most The rest of this paper is structured as follows. Section effective way to exploit data-parallel applications [20]. 2 summarizes the basic features of the VIRAM architec- 283 0-7695-1859-1/02 $17.00 © 2002 1EEE ture. Section 3 describes the EEMBC embedded bench- Scalar Single-issue 64-bit MIPS pipeline marks. Section 4 evaluates the vectorizing compiler and the Core 8K/8K direct-mapped L1 I/D caches use of the vector instruction set. It also presents a code size Vector 8K vector register file (32 registers) comparison between RISC, CISC, VLIW, and vector archi- Coprocessor 2 pipelined arithmetic units 4 64-bit datapaths per arithmetic unit tectures. Section 5 proceeds with a microarchitecture eval- 1 load-store unit (4 address generators) uation in terms of performance, power consumption, design 256-bit memory interface complexity, and scalability. Section 6 presents related work Memory 13 MBytes in 8 DRAM banks and Section 7 concludes the paper. System 25ns random access latency 256-bit crossbar interconnect 2 Vector Architecture for Multimedia Technology 0.18/~mCMOS process (IBM) 6 layers copper interconnect Transistors 120M (7.5M logic, 112.5M DRAM) In this section, we provide an overview of the three com- Clock 200 MHz ponents of the VIRAM architecture: the instructions set, the Frequency prototype processor chip, and the vectorizing compiler. Power 2 Watts Dissipation 2.1 Instruction Set Overview Peak Int: 1.6/3.2/6.4 Gop/s (64b/32b/16b) Performance FP: 1.6 Gflop/s (32b) VIRAM is a complete, load-store, vector instruction set defined as a coprocessor extension to the MIPS architecture. Table 1. The characteristics of the VIRAM vec- The vector architecture state includes a vector register file tor processor chip. with 32 entries that can store integer or floating-point ele- ments, a 16-entry flag register file that contains vectors with single-bit elements, and a few scalar registers for control The VIRAM architecture includes several features that values and memory addresses. The instruction set contains help with the development of general-purpose systems, integer and floating-point arithmetic instructions that oper- which are not typical in traditional vector supercomputers. ate on vectors stored in the register file, as well as logical It provides full support for paged virtual addressing using a functions and operations such as population count that use separate TLB for vector memory accesses. It also provides the flag registers. Vector load and store instructions support a mechanism that allows the operating system to defer the the three common access patterns: unit stride, strided, and saving and restoring of vector state during context switches, indexed. Overall, VIRAM introduces 90 unique instruc- until it is known that the new process uses vector instruc- tions, which, due to variations, consume 660 opcodes in the tions. In addition, the architecture defines valid and dirty coprocessor 2 space of the MIPS architecture. bits for all vector registers that are used to minimize the To enable the vectorization of multimedia applications, amount of vector state involved in a context switch. VIRAM includes a number of media-specific enhance- Detailed descriptions of the features and instructions in ments. The elements in the vector registers can be 64, 32, the VIRAM architecture are available in [ 13]. or 16 bits wide. Multiple narrow elements are placed in the storage location for one wide element. Similarly, each 2.2 Microarchitecture Overview 64-bit datapath is partitioned in order to execute multiple narrower element operations in parallel. Instead of spec- ifying the element and operation width in the instruction The VIRAM prototype processor is a simple implemen- tation of the VIRAM architecture. It includes an on-chip opcode, we use a control register which is typically set main memory system I based on embedded DRAM technol- once per group of nested loops. Integer instructions sup- ogy that provides the high bandwidth necessary for a vector port saturated and fixed-point arithmetic. Specifically, VI- processor at moderate latency. Table 1 summarizes the basic RAM includes a flexible multiply-add model that supports features of the chip. arbitrary fixed-point formats without using accumulators or Figure 1 presents the chip microarchitecture, focusing on extended-precision registers. Three vector instructions im- the vector hardware and the memory system. The register plement element permutations within vector registers. Their scope is limited to the vectorization of dot-products (reduc- and datapath resources in the vector coprocessor are parti- tioned vertically into four identical vector lanes. Each lane tions) and FFFs, which makes them regular and simple to implement. Finally, VIRAM supports conditional execution contains a number of elements from each vector and flag of element operations for virtually all vector instructions us- 1The processor can address additional, off-chip, main memory. Data ing the flag registers as sources of element masks [21]. transfers between on-chip and off-chip are under software control. 284 64b ......................... .......................... 2.3 Vectorizing Compiler ,, LANE 0 LANE 1 LANE 2 LANE 3 , ~ ~ ~ The VIRAM compiler is based on the PDGCS compi- lation system for Cray supercomputers such as C90-YMP, T3E, and SV2. The front-end allows
Recommended publications
  • GPU Implementation Over IPTV Software Defined Networking
    Esmeralda Hysenbelliu. Int. Journal of Engineering Research and Application www.ijera.com ISSN : 2248-9622, Vol. 7, Issue 8, ( Part -1) August 2017, pp.41-45 RESEARCH ARTICLE OPEN ACCESS GPU Implementation over IPTV Software Defined Networking Esmeralda Hysenbelliu* Information Technology Faculty, Polytechnic University of Tirana, Sheshi “Nënë Tereza”, Nr.4, Tirana, Albania Corresponding author: Esmeralda Hysenbelliu ABSTRACT One of the most important issue in IPTV Software defined Network is Bandwidth Issue and Quality of Service at the client side. Decidedly, it is required high level quality of images in low bandwidth and for this reason it is needed different transcoding standards (Compression of image as much as it is possible without destroying the quality of it) as H.264, H265, VP8 and VP9. During a test performed in SMC IPTV SDN Cloud Network, it was observed that with a server HP ProLiant DL380 g6 with two physical processors there was not possible to transcode in format H.264 more than 30 channels simultaneously because CPU’s achieved 100%. This is the reason why it was immediately needed to use Graphic Processing Units called GPU’s which offer high level images processing. After GPU superscalar processor was integrated and done functional via module NVENC of FFEMPEG Program, number of channels transcoded simultaneously was tremendous increased (more than 100 channels). The aim of this paper is to real implement GPU superscalar processors in IPTV Cloud Networks by achieving improvement of performance to more than 60%. Keywords - GPU superscalar processor, Performance Improvement, NVENC, CUDA --------------------------------------------------------------------------------------------------------------------------------------- Date of Submission: 01 -05-2017 Date of acceptance: 19-08-2017 --------------------------------------------------------------------------------------------------------------------------------------- I.
    [Show full text]
  • Computer Hardware Architecture Lecture 4
    Computer Hardware Architecture Lecture 4 Manfred Liebmann Technische Universit¨atM¨unchen Chair of Optimal Control Center for Mathematical Sciences, M17 [email protected] November 10, 2015 Manfred Liebmann November 10, 2015 Reading List • Pacheco - An Introduction to Parallel Programming (Chapter 1 - 2) { Introduction to computer hardware architecture from the parallel programming angle • Hennessy-Patterson - Computer Architecture - A Quantitative Approach { Reference book for computer hardware architecture All books are available on the Moodle platform! Computer Hardware Architecture 1 Manfred Liebmann November 10, 2015 UMA Architecture Figure 1: A uniform memory access (UMA) multicore system Access times to main memory is the same for all cores in the system! Computer Hardware Architecture 2 Manfred Liebmann November 10, 2015 NUMA Architecture Figure 2: A nonuniform memory access (UMA) multicore system Access times to main memory differs form core to core depending on the proximity of the main memory. This architecture is often used in dual and quad socket servers, due to improved memory bandwidth. Computer Hardware Architecture 3 Manfred Liebmann November 10, 2015 Cache Coherence Figure 3: A shared memory system with two cores and two caches What happens if the same data element z1 is manipulated in two different caches? The hardware enforces cache coherence, i.e. consistency between the caches. Expensive! Computer Hardware Architecture 4 Manfred Liebmann November 10, 2015 False Sharing The cache coherence protocol works on the granularity of a cache line. If two threads manipulate different element within a single cache line, the cache coherency protocol is activated to ensure consistency, even if every thread is only manipulating its own data.
    [Show full text]
  • Superscalar Fall 2020
    CS232 Superscalar Fall 2020 Superscalar What is superscalar - A superscalar processor has more than one set of functional units and executes multiple independent instructions during a clock cycle by simultaneously dispatching multiple instructions to different functional units in the processor. - You can think of a superscalar processor as there are more than one washer, dryer, and person who can fold. So, it allows more throughput. - The order of instruction execution is usually assisted by the compiler. The hardware and the compiler assure that parallel execution does not violate the intent of the program. - Example: • Ordinary pipeline: four stages (Fetch, Decode, Execute, Write back), one clock cycle per stage. Executing 6 instructions take 9 clock cycles. I0: F D E W I1: F D E W I2: F D E W I3: F D E W I4: F D E W I5: F D E W cc: 1 2 3 4 5 6 7 8 9 • 2-degree superscalar: attempts to process 2 instructions simultaneously. Executing 6 instructions take 6 clock cycles. I0: F D E W I1: F D E W I2: F D E W I3: F D E W I4: F D E W I5: F D E W cc: 1 2 3 4 5 6 Limitations of Superscalar - The above example assumes that the instructions are independent of each other. So, it’s easily to push them into the pipeline and superscalar. However, instructions are usually relevant to each other. Just like the hazards in pipeline, superscalar has limitations too. - There are several fundamental limitations the system must cope, which are true data dependency, procedural dependency, resource conflict, output dependency, and anti- dependency.
    [Show full text]
  • The Multiscalar Architecture
    THE MULTISCALAR ARCHITECTURE by MANOJ FRANKLIN A thesis submitted in partial ful®llment of the requirements for the degree of DOCTOR OF PHILOSOPHY (Computer Sciences) at the UNIVERSITY OF WISCONSIN Ð MADISON 1993 THE MULTISCALAR ARCHITECTURE Manoj Franklin Under the supervision of Associate Professor Gurindar S. Sohi at the University of Wisconsin-Madison ABSTRACT The centerpiece of this thesis is a new processing paradigm for exploiting instruction level parallelism. This paradigm, called the multiscalar paradigm, splits the program into many smaller tasks, and exploits ®ne-grain parallelism by executing multiple, possibly (control and/or data) depen- dent tasks in parallel using multiple processing elements. Splitting the instruction stream at statically determined boundaries allows the compiler to pass substantial information about the tasks to the hardware. The processing paradigm can be viewed as extensions of the superscalar and multiprocess- ing paradigms, and shares a number of properties of the sequential processing model and the data¯ow processing model. The multiscalar paradigm is easily realizable, and we describe an implementation of the multis- calar paradigm, called the multiscalar processor. The central idea here is to connect multiple sequen- tial processors, in a decoupled and decentralized manner, to achieve overall multiple issue. The mul- tiscalar processor supports speculative execution, allows arbitrary dynamic code motion (facilitated by an ef®cient hardware memory disambiguation mechanism), exploits communication localities, and does all of these with hardware that is fairly straightforward to build. Other desirable aspects of the implementation include decentralization of the critical resources, absence of wide associative searches, and absence of wide interconnection/data paths.
    [Show full text]
  • This Thesis Has Been Submitted in Fulfilment of the Requirements for a Postgraduate Degree (E.G
    This thesis has been submitted in fulfilment of the requirements for a postgraduate degree (e.g. PhD, MPhil, DClinPsychol) at the University of Edinburgh. Please note the following terms and conditions of use: This work is protected by copyright and other intellectual property rights, which are retained by the thesis author, unless otherwise stated. A copy can be downloaded for personal non-commercial research or study, without prior permission or charge. This thesis cannot be reproduced or quoted extensively from without first obtaining permission in writing from the author. The content must not be changed in any way or sold commercially in any format or medium without the formal permission of the author. When referring to this work, full bibliographic details including the author, title, awarding institution and date of the thesis must be given. Towards the Development of Flexible, Reliable, Reconfigurable, and High- Performance Imaging Systems by Jalal Khalifat A thesis submitted in partial fulfilment of the requirements for the degree of DOCTOR OF PHILOSOPHY The University of Edinburgh May 2016 Declaration I hereby declare that this thesis was composed and originated entirely by myself except where explicitly stated in the text, and that this work has not been submitted for any other degree or professional qualifications. Jalal Khalifat May 2016 Edinburgh, U.K. I Acknowledgements All praises are due to Allah, the most gracious, the most merciful, after three and a half years of continuous work, this work comes to end and it is the moment that I have to thank all the people who have supported me throughout my PhD study and who have contributed to the completion of this thesis.
    [Show full text]
  • Performance of Image and Video Processing with General-Purpose
    Performance of Image and Video Processing with General-Purpose Processors and Media ISA Extensions y Parthasarathy Ranganathan , Sarita Adve , and Norman P. Jouppi Electrical and Computer Engineering y Western Research Laboratory Rice University Compaq Computer Corporation g fparthas,sarita @rice.edu [email protected] Abstract Media processing refers to the computing required for the creation, encoding/decoding, processing, display, and com- This paper aims to provide a quantitative understanding munication of digital multimedia information such as im- of the performance of image and video processing applica- ages, audio, video, and graphics. The last few years tions on general-purpose processors, without and with me- have seen significant advances in this area, but the true dia ISA extensions. We use detailed simulation of 12 bench- promise of media processing will be seen only when ap- marks to study the effectiveness of current architectural fea- plications such as collaborative teleconferencing, distance tures and identify future challenges for these workloads. learning, and high-quality media-rich content channels ap- Our results show that conventional techniques in current pear in ubiquitously available commodity systems. Fur- processors to enhance instruction-level parallelism (ILP) ther out, advanced human-computer interfaces, telepres- provide a factor of 2.3X to 4.2X performance improve- ence, and immersive and interactive virtual environments ment. The Sun VIS media ISA extensions provide an ad- hold even greater promise. ditional 1.1X to 4.2X performance improvement. The ILP One obstacle in achieving this promise is the high com- features and media ISA extensions significantly reduce the putational demands imposed by these applications.
    [Show full text]
  • Trends in Processor Architecture
    A. González Trends in Processor Architecture Trends in Processor Architecture Antonio González Universitat Politècnica de Catalunya, Barcelona, Spain 1. Past Trends Processors have undergone a tremendous evolution throughout their history. A key milestone in this evolution was the introduction of the microprocessor, term that refers to a processor that is implemented in a single chip. The first microprocessor was introduced by Intel under the name of Intel 4004 in 1971. It contained about 2,300 transistors, was clocked at 740 KHz and delivered 92,000 instructions per second while dissipating around 0.5 watts. Since then, practically every year we have witnessed the launch of a new microprocessor, delivering significant performance improvements over previous ones. Some studies have estimated this growth to be exponential, in the order of about 50% per year, which results in a cumulative growth of over three orders of magnitude in a time span of two decades [12]. These improvements have been fueled by advances in the manufacturing process and innovations in processor architecture. According to several studies [4][6], both aspects contributed in a similar amount to the global gains. The manufacturing process technology has tried to follow the scaling recipe laid down by Robert N. Dennard in the early 1970s [7]. The basics of this technology scaling consists of reducing transistor dimensions by a factor of 30% every generation (typically 2 years) while keeping electric fields constant. The 30% scaling in the dimensions results in doubling the transistor density (doubling transistor density every two years was predicted in 1975 by Gordon Moore and is normally referred to as Moore’s Law [21][22]).
    [Show full text]
  • On Implementation of MPEG-2 Like Real-Time Parallel Media Applications on MDSP Soc Cradle Architecture
    On Implementation of MPEG-2 like Real-Time Parallel Media Applications on MDSP SoC Cradle Architecture Ganesh Yadav1, R. K. Singh2, and Vipin Chaudhary1 1 Dept. of Computer Science, Wayne State University fganesh@cs.,[email protected] 2 Cradle Technologies, [email protected] Abstract. In this paper we highlight the suitability of MDSP 3 architec- ture to exploit the data, algorithmic, and pipeline parallelism o®ered by video processing algorithms like the MPEG-2 for real-time performance. Most existing implementations extract either data or pipeline parallelism along with Instruction Level Parallelism (ILP) in their implementations. We discuss the design of MP@ML decoding system on shared memory MDSP platform and give insights on building larger systems like HDTV. We also highlight how the processor scalability is exploited. Software implementation of video decompression algorithms provides flexibility, but at the cost of being CPU intensive. Hardware implementations have a large development cycle and current VLIW dsp architectures are less flexible. MDSP platform o®ered us the flexibilty to design a system which could scale from four MSPs (Media Stream Processor is a logical clus- ter of one RISC and two DSP processors) to eight MSPs and build a single-chip solution including the IO interfaces for video/audio output. The system has been tested on CRA2003 board. Speci¯c contributions include the multiple VLD algorithm and other heuristic approaches like early-termination IDCT for fast video decoding. 1 Introduction Software programmable SoC architectures eliminate the need for designing ded- icated hardware accelerators for each standard we want to work with. With the rapid evolution of standards like MPEG-2, MPEG-4, H.264, etc.
    [Show full text]
  • Intro Evolution of Superscalar Processor
    Superscalar Processors Superscalar Processors vs. VLIW • 7.1 Introduction • 7.2 Parallel decoding • 7.3 Superscalar instruction issue • 7.4 Shelving • 7.5 Register renaming • 7.6 Parallel execution • 7.7 Preserving the sequential consistency of instruction execution • 7.8 Preserving the sequential consistency of exception processing • 7.9 Implementation of superscalar CISC processors using a superscalar RISC core • 7.10 Case studies of superscalar processors TECH Computer Science CH01 Superscalar Processor: Intro Emergence and spread of superscalar processors • Parallel Issue • Parallel Execution • {Hardware} Dynamic Instruction Scheduling • Currently the predominant class of processors 4Pentium 4PowerPC 4UltraSparc 4AMD K5- 4HP PA7100- 4DEC α Evolution of superscalar processor Specific tasks of superscalar processing Parallel decoding {and Dependencies check} Decoding and Pre-decoding • What need to be done • Superscalar processors tend to use 2 and sometimes even 3 or more pipeline cycles for decoding and issuing instructions • >> Pre-decoding: 4shifts a part of the decode task up into loading phase 4resulting of pre-decoding f the instruction class f the type of resources required for the execution f in some processor (e.g. UltraSparc), branch target addresses calculation as well 4the results are stored by attaching 4-7 bits • + shortens the overall cycle time or reduces the number of cycles needed The principle of perdecoding Number of perdecode bits used Specific tasks of superscalar processing: Issue 7.3 Superscalar instruction issue • How and when to send the instruction(s) to EU(s) Instruction issue policies of superscalar processors: Issue policies ---Performance, tread-----Æ Issue rate {How many instructions/cycle} Issue policies: Handing Issue Blockages • CISC about 2 • RISC: Issue stopped by True dependency Issue order of instructions • True dependency Æ (Blocked: need to wait) Aligned vs.
    [Show full text]
  • Parallel Architectures MICHAEL J
    Parallel Architectures MICHAEL J. FLYNN AND KEVIN W. RUDD Stanford University ^[email protected]&; ^[email protected]& PARALLEL ARCHITECTURES currently performing different phases of processing an instruction. This does not Parallel or concurrent operation has achieve concurrency of execution (with many different forms within a computer system. Using a model based on the multiple actions being taken on objects) different streams used in the computa- but does achieve a concurrency of pro- tion process, we represent some of the cessing—an improvement in efficiency different kinds of parallelism available. upon which almost all processors de- A stream is a sequence of objects such pend today. as data, or of actions such as instruc- Techniques that exploit concurrency tions. Each stream is independent of all of execution, often called instruction- other streams, and each element of a level parallelism (ILP), are also com- stream can consist of one or more ob- mon. Two architectures that exploit ILP jects or actions. We thus have four com- are superscalar and VLIW (very long binations that describe most familiar parallel architectures: instruction word). These techniques schedule different operations to execute (1) SISD: single instruction, single data concurrently based on analyzing the de- stream. This is the traditional uni- pendencies between the operations processor [Figure 1(a)]. within the instruction stream—dynami- (2) SIMD: single instruction, multiple cally at run time in a superscalar pro- data stream. This includes vector cessor and statically at compile time in processors as well as massively par- a VLIW processor. Both ILP approaches allel processors [Figure 1(b)].
    [Show full text]
  • XOS Advanced Media Processor
    XOS Advanced Media Processor The XOS Advanced Media Processor is a high performance live media processor for broadcast and streaming applications. Key Business Benefits Application versatility The XOS Advanced Media Processor is the latest generation of Harmonic software-based video appliances. XOS can be used for either broadcast or streaming applications, and is adapted to multiple deployment environments. Classic infrastructures are supported with SDI, ASI, and satellite RF interfaces. Full-IP architectures are also supported: XOS handles MPEG compressed formats, as well as the latest SMPTE ST 2022-6 and SMPTE ST 2110 standards. GRAPHICS TRANSCODE STATMUX MULTIPLEXING ENCRYPT SDI TS over IP PACKAGING SAT RECEPTION OTT INGEST DECRYPT 2022-6 BROADCAST 2110 DVB-S2X XOS HLS STREAMING XOS Advanced Media Processor Inputs and Functionality XOS is packed with features to address any kind of video processing application. In addition to its market-leading compression engine, XOS integrates a broad range of audio codecs, including Dolby AC-4, an advanced video pre-processing engine, a broadcast multiplexer with statmux support, and a state-of-the-art packager for streaming applications. From decoding to encoding, from HDR processing to audio loudness management, Harmonic has you covered. Improved cost of ownership XOS Advanced Media Processor’s unparalleled function integration and performance dramatically reduce the number of appliances required for a given application, significantly improving your cost of ownership. As a software solution, XOS
    [Show full text]
  • Modern Processor Design: Fundamentals of Superscalar
    Fundamentals of Superscalar Processors John Paul Shen Intel Corporation Mikko H. Lipasti University of Wisconsin WAVELAND PRESS, INC. Long Grove, Illinois To Our parents: Paul and Sue Shen Tarja and Simo Lipasti Our spouses: Amy C. Shen Erica Ann Lipasti Our children: Priscilla S. Shen, Rachael S. Shen, and Valentia C. Shen Emma Kristiina Lipasti and Elias Joel Lipasti For information about this book, contact: Waveland Press, Inc. 4180 IL Route 83, Suite 101 Long Grove, IL 60047-9580 (847) 634-0081 info @ waveland.com www.waveland.com Copyright © 2005 by John Paul Shen and Mikko H. Lipasti 2013 reissued by Waveland Press, Inc. 10-digit ISBN 1-4786-0783-1 13-digit ISBN 978-1-4786-0783-0 All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means without permission in writing from the publisher. Printed in the United States of America 7 6 5 4 3 2 1 Table of Contents PrefaceAbout the Authors x ix 1 Processor Design 1 1.1 The Evolution of Microprocessors 2 1.21.2.1 Instruction Digital Set Systems Processor Design Design 44 1.2.2 Architecture,Realization Implementation, and 5 1.2.3 Instruction Set Architecture 6 1.2.4 Dynamic-Static Interface 8 1.3 Principles of Processor Performance 10 1.3.1 Processor Performance Equation 10 1.3.2 Processor Performance Optimizations 11 1.3.3 Performance Evaluation Method 13 1.4 Instruction-Level Parallel Processing 16 1.4.1 From Scalar to Superscalar 16 1.4.2 Limits of Instruction-Level Parallelism 24 1.51.4.3 Machines Summary for Instruction-Level
    [Show full text]