Pseudo-Vector Machine for Embedded Applications
Total Page:16
File Type:pdf, Size:1020Kb
Pseudo-Vector Machine For Embedded Applications by Lea Hwang Lee A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy (Computer Science and Engineering) at the The University of Michigan 2000 Doctoral Committee: Prof. Trevor Mudge, Chairperson Prof. Richard Brown Prof. Ed Davidson Prof. Marios Papaefthymiou Prof. Karem Sakallah “We are moving into a third stage of computing. In the first, the mainframe world, there was one computer for many people. In the second, the PC world, there was a computer for each person. In the next stage there will be many computing devices for each person...” Roy Want, Palo Alto Research Center, Xerox Corp., Palo Alto, CA. Source: PC Week Online, January 3, 2000 Lea Hwang Lee © 2000 All Rights Reserved To My Mother, Chern, Der-Shin, And My Sister, Lee, Deek Ann, For their unfailing care and support. ii ACKNOWLEDGEMENTS I joint the M-CORETM Technology Center (MTC), Motorola Incorp., Austin, Texas, as a summer intern in 1994. At that time, the group (under a different name) had just embarked on developing a new ISA for targeting mid-to-low end embedded markets. I spent the next two years (1995 and 1996) working and traveling between Austin, Texas and Ann Arbor, Michigan. I became a full-time employee towards the end of 1996. This dissertation work is not formally nor directly funded by any organization. However, I did receive a lot of assistance from the MTC. In particular, they had given me access to various software tools and benchmark programs. For a brief period of time, they had also kept me on the payroll while I was working full-time on my dissertation - what a perfect way to fund a research project. I am deeply indebted to my manager, John Arends, for all the constant guidance he had provided to me. From day one, he has always been there to renter his help, technically or otherwise. I am very grateful to my co-worker and my best friend, Jeff Scott. He is one of the most talented, energetic, motivated and creative engineer I have ever met. I am also very grateful to Bill Moyer for all his technical assistance and insightful advises. His insights has always made my “research life” much more interesting and challenging. I am also deeply indebted to Prof. Trevor Mudge for all his technical assistance and advises, without which, this dissertation work would be impossible. Lastly, but not the least, I would like to thank the entire MTC team, for all their support they have given to me. I feel honored, to have such a great opportunity to work with this talented and motivated group of people. iii PREFACE The focus of this dissertation is on designing low-cost, low-power and high performance processors for mid-to-low end mobile embedded applications. In particular, the main goal of this work is to explore ways to add a minimum amount of hardware to a single issued machine to improve its performance on critical loop executions - since many of these applications spend a significant amount of their execution time on a handful of critical loops. Improve the performance on these loops provides the biggest bang for the buck. This dissertation borrows many existing architectural ideas from vector processors and DSP processors, and combines them into a single execution model. Vector processing paradigm is well known for its excellent cost/performance trade-off. The processing paradigm proposed in this dissertation, called the pseudo-vector machine, exploits, as much as possible, the low-cost, low-power and high performance aspects of vector processing paradigm. As we will see later in this dissertation, the characteristics of the critical loops found in our benchmarks vary greatly, from highly vectorizable,todifficult (and costly) to vectorize,to impossible to vectorize. For example, a loop that performs a vector operation described by C[i] = A[i] * B[i], for i=0,...,n-1, for some vectors A, B and C of length n, is a highly vectorizable loop. A loop that performs a vector operation described by C[i] = (A[i]>B[i])? A[i]2 : A[i]+B[i], for i=0,...,n-1, is more difficult (or costlier ) to vectorize. The vector arithmetic represented by some hard-to-vectorize loops, in this work, is called pseudo-vector arithmetic (PVA). For this type of loops, the DSP’s style of processing, which focuses on optimizing the program loop executions, is more suitable. These optimization techniques include: (i) using the “while” or “repeat” instructions to remove loop control overheads; (ii) using data streaming buffers to remove overhead associated with constant-stride iv memory operations. The pseudo-vector machine proposed in this dissertation can execute two types of vector arithmetic: a “true” vector arithmetic and a “pseudo” vector arithmetic, as described in the preceding paragraph. Depending on the type of loops, the machine sometime behaves like a vector processor; it sometime behaves like a DSP processor. The compilers, for this machine, decide which execution modes to use for each critical loop. In addition, this machine uses a single datapath to execute all the vector arithmetic as well as the scalar portion (the non-loop portion) of the program code - an efficient reuse of the hardware resources. v TABLE OF CONTENTS DEDICATION ii ACKNOWLEDGEMENTS iii PREFACE iv TABLE OF CONTENTS vi LIST OF FIGURES xv LIST OF TABLES xviii FREQUENTLY USED M-CORE INSTRUCTIONS xiv vi CHAPTER I INTRODUCTION: LOW-COST, LOW-POWER DESIGNS FOR EMBED- DED APPLICATIONS ...................................................1 1.1 The World Of Mobile Computing . ....................................1 1.2.1 Alternate Operating Modes . 3 1.2.2 Performance vs. Instantaneous Power ............................4 1.2 Important Characteristics Of Mobile Systems . .........................3 1.3 Execution Modes For Mobile Systems..................................5 1.4 Pseudo-Vector Machine - An Architectural Overview....................6 1.5 The Strength of Vector Processing.....................................9 1.6 Vector Processing vs. Pseudo-Vector Processing . 12 1.7 The Basic Framework For This Dissertation . 13 1.8 Profile-Based Performance Evaluations - An Example...................14 1.9 Contributions Of This Dissertation....................................15 1.10 A Note On Vector Processing Paradigm . ...........................16 vii CHAPTER II RELATED WORK......................................................18 2.1 Software Loop Unrolling . ........................................18 2.2 Software Pipelining and Register Rotation.............................18 2.3 Stream Data Buffers In WM and SMC Architectures.....................20 2.4 Data Address Generators . 21 2.5 Compute And Memory Move Instructions . 22 2.6 Special Loop Instructions For Removing Loop Control Overheads . 23 2.6.1 The TriCoreTM ISA ...... ....................................23 2.6.2 The SHARC ADSP ISA . .24 2.7 Vector Processing . ................................................25 2.7.1 Cray-1 Vector Machine - A SISD Vector Machine...... ..........25 2.7.2 PowerPC AltiVec - A SIMD Vector Processor . ..... ..............26 2.9 Decoupled Access/Execute Machine - Astronautics ZS-1................29 2.10 The Transmeta’s CrusoeTM Processors . 31 2.11 Pseudo-Vector Machine - Comparisons With Related Work . .....32 2.12 General Comparisons With Related Work . ...........................36 viii CHAPTER III VECTOR ARITHMETIC ...............................................38 3.1 Canonical Vector Arithmetic .........................................38 3.1.1 Compound CVA .............................................40 3.1.2 Reduction CVA .............................................41 3.1.3 Hybrid CVA.................................................42 3.1.4 Some Examples Of CVA . ....................42 3.2 Pseudo-Vector Arithmetic............................................44 3.3 Vector Arithmetic with Early Termination . .............................46 ix CHAPTER IV PROGRAMMING MODELS............................................50 4.1 Execution Modes . .................................................50 4.2 Constant-Stride Load/Store Operations................................50 4.2.1 cs-load And cs-store For CVA Executions . .....................51 4.2.2 cs-load And cs-store For PVA Executions .......................55 4.3 Special Registers For Vector Executions . 57 4.3.1 Stride and Size Register . ............................57 4.3.2 Count Index Register (CIR) . .58 4.3.3 Register For Storing Constant-Stride Load Addresses.............59 4.3.4 Register For Storing Constant-Stride Store Addresses . ...........59 4.3.5 Scalar Results For Reduction And Hybrid CVA . ........59 4.3.6 Scalar Source Operands For CVA Executions....................59 4.4 Vector Instructions ..................................................60 4.5 Terminating Conditions.............................................62 4.5.1 Early Termination for CVA Executions . .62 4.5.2 Early Termination for PVA Executions . ..................63 4.6 Register Overlay . ..............................................63 4.7 Machine States Maintenance For Vector Executions.....................65 4.7.1 Saving The Execution Modes . ..............................65 4.7.2 Saving The Minimum Vector Contexts..........................66 4.7.3 Updates of Temporary and Overlaid Instances of R0 and R1 . .67 4.8 Memory Organization...............................................67 4.8.1 Memory Bandwidth Requirements For Vector Executions . .....68 4.8.2 Memory Map For M0, M1, TM................................68