Efficient Embedded Computing a Dissertation Submitted To

EFFICIENT EMBEDDED COMPUTING A DISSERTATION SUBMITTED TO THE DEPARTMENT OF ELECTRICAL ENGINEERING AND THE COMMITTEE ON GRADUATE STUDIES OF STANFORD UNIVERSITY IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY James David Balfour May 2010 © 2010 by James David Balfour. All Rights Reserved. Re-distributed by Stanford University under license with the author. This work is licensed under a Creative Commons Attribution- Noncommercial 3.0 United States License. http://creativecommons.org/licenses/by-nc/3.0/us/ This dissertation is online at: http://purl.stanford.edu/nb912df4852 ii I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. William Dally, Primary Adviser I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. Mark Horowitz I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. Christoforos Kozyrakis Approved for the Stanford University Committee on Graduate Studies. Patricia J. Gumport, Vice Provost Graduate Education This signature page was generated electronically upon submission of this dissertation in electronic format. An original signed hard copy of the signature page is on file in University Archives. iii Abstract Embedded computer systems are ubiquitous, and contemporary embedded applications exhibit demanding computation and efficiency requirements. Meeting these demands presently requires the use of application- specific integrated circuits and collections of complex system-on-chip components. The design and imple- mentation of application-specific integrated circuits is both expensive and time consuming. Most of the effort and expense arises from the non-recurring engineering activities required to manually lower high-level descriptions of systems to equivalent low-level descriptions that are better suited to hardware realization. Programmable systems, particularly those that can be targeted effectively using high-level programming lan- guages, offer reduced development costs and faster design times. They also offer the flexibility required to upgrade previously deployed systems as new standards and applications are developed. However, programmable systems are less efficient than fixed-function hardware. This significantly limits the class of applications for which programamble processors are an acceptable alternatives to application-specific fixed- function hardware, as efficiency demands often preclude the use of programmable hardware. With most contemporary computer systems limited by efficiency, improving the efficiency of programmable systems is a critical challenge and an active area of computer systems research. This dissertation describes Elm, an efficient programmable system for high-performance embedded applications. Elm is significantly more efficient than conventional embedded processors on compute-intensive kernels. Elm allows software to exploit parallelism to achieve performance while managing locality to achieve efficiency. Elm implements a novel distributed and hierarchical system organization that allows software to exploit the abundant parallelism, reuse, and locality that are present in embedded applications. Elm provides a variety of mechanisms to assist software in mapping applications efficiently to massively parallel systems. To improve efficiency, Elm allows software to explicitly schedule and orchestrate the movement and placement of instructions and data. This dissertation proposes and critically analyzes concepts that encompass the interaction of computer architecture, compiler technology, and VLSI circuits to increase performance and efficiency in modern embedded computer systems. A central theme of this dissertation is that the efficiency of programmable embedded systems can be improved significantly by exposing deep and distributed storage hierarchies to software. This allows software to exploit temporal and spatial reuse and locality at multiple levels in applications in order to reduce instruction and data movement. iv Acknowledgments First, I would like to thank my dissertation advisor, Professor William J. Dally, who has been great mentor and teacher. I benefited immensely from Bill’s technical insight, vast experience, boundless enthusiasm, and steadfast support. I would like to thank the other members of my dissertation committee, Professors Mark Horowitz, Chris- tos Kozyrakis, and Boris Murmann. I had the distinct pleasure of having all three as lecturers at various times while a graduate student, and benefited greatly from their insights and experience. I imagine all will recognize some aspects of their teaching and influence within this dissertation. I was fortunate to befriend some fantastic folks while at Stanford. In particular, I would like to thank Michael Linderman for years of engaging early morning discussions, and Catie Chang for years of engaging late night discussions and encouragement; my first few years at Stanford would not have been nearly as tol- erable without their friendship. I would like the thank James Kierstead for several years of Sunday mornings spent discussing the plight of graduate students over waffles and coffee. I particularly want to thank Tanya for her friendship, patience, support, assistance, and understanding. The last year would have been difficult without her. I should like to acknowledge the other members of the concurrent VLSI architecture group at Stanford that I had the benefit of interacting with. In particular, I would like to acknowledge the efforts of the other members of the efficient embedded computing group. In approximate chronological order: David Black- Schaffer, Jongsoo Park, Vishal Parikh, R. Curtis Harting, and David Sheffield. I also had the pleasure of sharing an office with Ted Jiang, which I appreciated. Ted was always around and cheery on weekends, which made Sundays in the office more enjoyable, and kept the bookshelves well stocked. While a student at Stanford, I was funded in part by a Stanford Graduate Fellowship that was endowed by Cadence Design Systems. I remain grateful for the support. Finally, I would like to thank my family for all the love and support they have provided through the years. v Dedicated to the memory of Harold T. Fargey, who passed away before I could share my writing with him as he always shared with me. vi Contents Abstract iv Acknowledgments v 1 Introduction 1 1.1 Embedded Computing . 2 1.2 Technology and System Trends . 4 1.3 A Simple Problem of Productivity and Efficiency . 5 1.4 Collaboration and Previous Publications . 6 1.5 Dissertation Contributions . 6 1.6 Dissertation Organization . 8 2 Contemporary Embedded Architectures 10 2.1 The Efficiency Impediments of Programmable Systems . 10 2.2 Common Strategies for Improving Efficiency . 16 2.3 A Survey of Related Contemporary Architectures . 20 2.4 Chapter Summary . 23 3 The Elm Architecture 24 3.1 Concepts . 24 3.2 System Architecture . 25 3.3 Programming and Compilation . 32 3.4 Chapter Summary . 36 4 Instruction Registers 37 4.1 Concepts . 37 4.2 Microarchitecture . 44 4.3 Examples . 47 4.4 Allocation and Scheduling . 59 4.5 Evaluation and Analysis . 67 4.6 Related Work . 78 vii 4.7 Chapter Summary . 80 5 Operand Registers and Explicit Operand Forwarding 81 5.1 Concepts . 81 5.2 Microarchitecture . 85 5.3 Example . 87 5.4 Allocation and Scheduling . 89 5.5 Evaluation . 94 5.6 Related Work . 102 5.7 Chapter Summary . 102 6 Indexed and Address-Stream Registers 104 6.1 Concepts . 104 6.2 Microarchitecture . 106 6.3 Examples . 113 6.4 Evaluation and Analysis . 122 6.5 Related Work . 128 6.6 Chapter Summary . 131 7 Ensembles of Processors 132 7.1 Concept . 132 7.2 Microarchitecture . 137 7.3 Examples . 139 7.4 Allocation and Scheduling . 143 7.5 Evaluation and Analysis . 146 7.6 Related Work . 149 7.7 Chapter Summary . 152 8 The Elm Memory System 153 8.1 Concepts . 153 8.2 Microarchitecture . 165 8.3 Example . 177 8.4 Evaluation and Analysis . 179 8.5 Related Work . 189 8.6 Chapter Summary . 191 9 Conclusion 192 9.1 Summary of Thesis and Contributions . 192 9.2 Future Work . 193 viii A Experimental Methodology 196 A.1 Performance Modeling and Estimation . 196 A.2 Power and Energy Modeling . 197 B Descriptions of Kernels and Benchmarks 200 B.1 Description of Kernels . 200 B.2 Benchmarks . 204 C Elm Instruction Set Architecture Reference 206 C.1 Control Flow and Instruction Registers . 206 C.2 Operand Registers and Explicit Operand Forwarding . 210 C.3 Indexed Registers and Address-Stream Registers . 213 C.4 Ensembles and Processor Corps . 217 C.5 Memory System . 222 Bibliography 234 ix List of Tables 1.1 Estimates of Energy Expended Performing Common Operations . 7 2.1 Embedded RISC Processor Information . 11 2.2 Average Energy Per Instruction for an Embedded RISC Processor . 17 4.1 Elm Instruction Fetch Energy . 76 5.1 Energy Consumed by Common Operations.

Efficient Embedded Computing a Dissertation Submitted To

Parallel Architectures

"LISARM: Embedded ARM Platform Design and Optimization" Thesis

Energy Efficient Branch Prediction

Constant-Time Foundations for the New Spectre Era

ARM Cortex-A Series Programmer's Guide

A Systematic Evaluation of Transient Execution Attacks and Defenses

SPECCFI: Mitigating Spectre Attacks Using CFI Informed Speculation

Control Flow Speculation for Distributed Architectures

Spectre Returns! Speculation Attacks Using the Return Stack Buffer

Hyperflow: a Processor Architecture for Nonmalleable, Timing-Safe Information Flow Security

Shared Frontend for Manycore Server Processors

Cost Effective Speculation with the Omnipredictor Arthur Perais, André Seznec