Scalable Vector Media-Processors for Embedded Systems

Scalable Vector Media-processors for Embedded Systems Christoforos Kozyrakis Report No. UCB/CSD-02-1183 May 2002 Computer Science Division (EECS) University of California Berkeley, California 94720 Scalable Vector Media-processors for Embedded Systems by Christoforos Kozyrakis Grad. (University of Crete, Greece) 1996 M.S. (University of California at Berkeley) 1999 A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Computer Science in the GRADUATE DIVISION of the UNIVERSITY of CALIFORNIA at BERKELEY Committee in charge: Professor David A. Patterson, Chair Professor Katherine Yelick Professor John Chuang Spring 2002 Scalable Vector Media-processors for Embedded Systems Copyright Spring 2002 by Christoforos Kozyrakis Abstract Scalable Vector Media-processors for Embedded Systems by Christoforos Kozyrakis Doctor of Philosophy in Computer Science University of California at Berkeley Professor David A. Patterson, Chair Over the past twenty years, processor designers have concentrated on superscalar and VLIW architectures that exploit the instruction-level parallelism (ILP) available in engineering applications for workstation systems. Recently, however, the focus in computing has shifted from engineering to multimedia applications and from workstations to embedded systems. In this new computing environment, the performance, energy consumption, and development cost of ILP processors renders them ineffective despite their theoretical generality. This thesis focuses on the development of efficient architectures for embedded multimedia systems. We argue that it is possible to design processors that deliver high performance, have low energy consumption, and are simple to implement. The basis for the argument is the ability of vector architectures to exploit efficiently the data-level parallelism in multimedia applications. Furthermore, the increasing density of CMOS chips enables the design of cost-effective, on-chip, memory systems that can support the high bandwidth necessary for a vector processor. To test our hypothesis, we present VIRAM, a vector architecture for multimedia processing. We demonstrate that the vector instructions in VIRAM can capture the data-level parallelism in multimedia tasks and lead to smaller code size than RISC, CISC, and VLIW architectures. We also describe two scalable microarchitectures for vector media-processors: VIRAM-1 and CODE. VIRAM-1 integrates a simple, yet highly parallel, vector processor with an embedded DRAM memory system in a prototype chip with 120 million transistors. CODE uses a composite and decoupled organization for the vector processor in order to simplify the vector register file design, tolerate high memory latency, and allow for precise exceptions support. Both microarchitectures provide up to 10 times higher performance than alternative approaches without using out-of-order or wide instruction issue techniques that exacerbate energy consumption and design complexity. i To my parents, Litsa and Manolis, for their unconditional love and support. Στoυ& γoνις µoυ; Λιτσα και Mαν!λη, για την απριoριστη αγαπη και στηριξη τoυ&: ii Contents List of Figures v List of Tables viii 1 Introduction 1 2 Background and Motivation 4 2.1 Applications, Systems, and Technology Trends . 4 2.1.1 Multimedia Applications . 4 2.1.2 Embedded Systems . 5 2.1.3 Technology Constraints . 6 2.2 The Case against Superscalar and VLIW Processors for Embedded Multimedia Pro- cessing . 6 2.3 Research Goals . 8 3 Multimedia Benchmarks 10 3.1 The EEMBC Benchmark Suite . 10 3.2 Benchmark Description . 12 3.2.1 Consumer Benchmarks . 12 3.2.2 Telecommunications Benchmarks . 13 3.3 Discussion . 14 3.4 Related Work . 15 3.5 Summary . 16 4 Vector Instruction Set Architecture for Multimedia 17 4.1 Introduction to Vector Instruction Set Architectures . 17 4.2 Vector Architecture Enhancements for Multimedia . 20 4.2.1 Support for Narrow Data Types . 20 4.2.2 Support for Fixed-point Arithmetic . 21 4.2.3 Support for Element Permutations . 22 4.2.4 Support for Conditional Execution . 23 4.3 Vector Architecture Enhancements for General Purpose Systems . 24 4.3.1 Support for Virtual Memory . 25 4.3.2 Support for Arithmetic Exceptions . 25 4.3.3 Support for Context Switching . 26 4.4 The VIRAM Instruction Set Extension for MIPS . 26 4.5 Instruction Level Analysis of Multimedia Benchmarks . 29 4.5.1 Benchmark Vectorization . 29 4.5.2 Dynamic Instruction Set Use . 30 4.5.3 Code Size . 32 iii 4.5.4 Basic Block Size . 34 4.5.5 Comparison to SIMD Extensions . 35 4.6 Evaluation of the VIRAM ISA Decisions . 36 4.7 Related Work . 37 4.8 Summary . 38 5 The Microarchitecture of the VIRAM-1 Processor 39 5.1 Project Background . 39 5.2 The VIRAM-1 Organization . 40 5.2.1 Scalar Core Organization . 40 5.2.2 Vector Coprocessor Organization . 41 5.2.3 Vector Pipeline . 42 5.2.4 Chaining Control . 43 5.2.5 Memory System and IO . 43 5.2.6 System Support . 44 5.3 The VIRAM-1 Implementation . 45 5.3.1 Scalable Design Using Vector Lanes . 45 5.3.2 Design Statistics . 46 5.3.3 Design Methodology . 47 5.4 Performance Evaluation . 48 5.4.1 Performance of Compiled Code . 49 5.4.2 Performance Comparison . 50 5.4.3 Performance Scaling . 57 5.5 Lessons from VIRAM-1 . 58 6 A New Vector Microarchitecture for Multimedia Execution 60 6.1 Basic Design Approach . 60 6.1.1 Composite Vector Organization . 60 6.1.2 Decoupled Execution . 61 6.2 Microarchitecture Description . 62 6.2.1 Vector Core Organization . 63 6.2.2 Communication Network . 63 6.2.3 Vector Issue Logic . 65 6.3 Issue Policies and Execution Control . 67 6.3.1 Issue Policies . 67 6.3.2 Vector Chaining . 68 6.3.3 Execution Example . 68 6.4 Multi-lane Implementation of Composite Organizations . 70 6.5 CODE vs. VIRAM-1 . 71 6.5.1 Centralized vs. Distributed Vector Register File . 71 6.5.2 Decoupling vs. Delayed Pipeline . 73 6.6 CODE vs. Out-of-Order Processors . 74 6.7 Microarchitecture Tuning . 76 6.7.1 Core Selection Policy . 77 6.7.2 Register Replacement Policy . 78 6.7.3 Number of Local Vector Registers . 79 6.8 Related Work . 80 6.8.1 Composite Processors . 80 6.8.2 Decoupled Processors . 81 6.9 Summary . 81 iv 7 Precise Virtual Memory Exceptions for a Vector Architecture 82 7.1 The Challenges of Precise Exceptions . 83 7.2 Vector Architecture Support for Precise Virtual Memory Exceptions . 84 7.3 Precise Virtual Memory Exceptions in CODE . 85 7.3.1 Hardware Support . 85 7.3.2 Implications to Performance . 87 7.4 Evaluation of Performance Impact . 87 7.5 Related Work . 91 7.6 Summary . 92 8 Embedded Memory System Architecture 93 8.1 Memory System Background . 93 8.2 Embedded DRAM Technology . 95 8.3 Memory System Design Space for Embedded DRAM . 97 8.3.1 Memory Banks and Sub-banks . 98 8.3.2 Basic Bank Configuration . 100 8.3.3 Caching in the DRAM Memory System . 100 8.3.4 Address Interleaving . 102 8.3.5 Memory to.

Scalable Vector Media-Processors for Embedded Systems

Pipelining and Vector Processing

Data-Level Parallelism

Diploma Thesis

How Data Hazards Can Be Removed Effectively

2.5 Classification of Parallel Computers

EEMBC and the Purposes of Embedded Processor Benchmarking Markus Levy, President

Vector-Thread Architecture and Implementation by Ronny Meir Krashinsky B.S

Lecture 14: Vector Processors

Pipelining: Basic Concepts and Approaches

Powerpc 601 RISC Microprocessor Users Manual

Cloud Workbench a Web-Based Framework for Benchmarking Cloud Services

Automatic Benchmark Profiling Through Advanced Trace Analysis Alexis Martin, Vania Marangozova-Martin