High Performance and Energy Efficient Multi-Core Systems for DSP

High Performance and Energy Efficient Multi-core Systems for DSP Applications By ZHIYI YU B.S. (Fudan University) June, 2000; M.S. (Fudan University) June, 2003 DISSERTATION Submitted in partial satisfaction of the requirements for the degree of DOCTOR OF PHILOSOPHY in Electrical and Computer Engineering in the OFFICE OF GRADUATE STUDIES of the UNIVERSITY OF CALIFORNIA DAVIS Approved: Chair, Dr. Bevan M. Baas Member, Dr. Vojin G. Oklobdzija Member, Dr. Rajeevan Amirtharajah Member, Dr. Soheil Ghiasi Committee in charge 2007 –i– c Copyright by Zhiyi Yu 2007 All Rights Reserved Abstract This dissertation investigates the architecture design, physical implementation, result evaluation, and feature analysis of a multi-core processor for DSP applications. The system is composed of a 2-D array of simple single-issue programmable processors in- terconnected by a reconfigurable mesh network, and processors operate completely asyn- chronously with respect to each other in a Globally Asynchronous Locally Synchronous fashion. The processor is called Asynchronous Array of simple Processors (AsAP). A 6×6 array has been fabricated in a 0.18 µm CMOS technology. The physical design con- cerns timing issues for robust implementations, and takes full advantages of their potential scalability. Each processor occupies 0.66 mm2, is fully functional at a clock rate of 520– 540 MHz under 1.8 V, and dissipates 94 mW while the clock is 100% active. Compared to the high performance TI C62x DSP processor, AsAP achieves performance 0.8–9.6 times greater, energy efficiency 10–75 times greater, with an area 7–19 times smaller. The system is also easily scalable, and is well-suited to future fabrication technologies. An asymmetric interprocessor communication architecture is proposed. It assigns different buffer resources to the nearest neighbor interconnect and the long distance interconnect, can reduce the communication circuitry area by approximately 2 to 4 times compared to the traditional Network on Chip (NoC), with similar routing capability. A wide design exploration space is investigated, including supporting long distance communication in GALS systems, static/dynamic routing, varying numbers of ports (buffers) for the processing core, and varying numbers of links at each edge. The use of GALS style typically introduces performance penalties due to addi- tional communication latency between clock domains. GALS chip multiprocessors with large inter-processor FIFOs as AsAP can inherently hide much of the GALS performance penalty, and the penalty can even be driven to zero. Furthermore, adaptive clock and volt- age scaling for each processor provides an approximately 40% power savings without any performance reduction. – iii – Acknowledgments I would like to thank all of the individuals who made this work possible. I want to thank professor Bevan Baas, for his academic guidance and financial support. His devotion and enthusiasm on research will effect me strongly in my future career. I want to thank my dissertation reading committee members including professor Vojin G. Oklobdzija, professor Rajeevan Amirtharajah, and professor Soheil Ghiasi, for their useful comments on my research. I want to thank previous VCL group members including Michael Lai, Omar Sattari, Ryan Apperson and Michael Meeuwsen. It was with them that I had a happy time when I first came to Davis, and their efforts and contributions helped to make the AsAP project successful, which becomes the strong basis of my dissertation. I want to thank current VCL group members including Eric Work, Tinoosh Mohsenin, Jeremy Webb, Wayne Cheng, Toney Jocobson, Zhibin Xiao, Paul Mejia, and Anh Tran; I always found I can learn something from them and they keep VCL group active. I want to thank Intel Corporation, Intellasys Corporation, Xilinx, National Science Foun- dation (grant No. 0430090 and CAREER Award 0546907), UC MICRO, ST Microelectronics, SEM, SRC, and UC Davis Faculty Research Grant, for their generous financial donations to our research. I also want to thank Artisan, MOSIS, and TSMC; it was with their help that we had a successful fabrication of the AsAP chip. I want to thank my family, my relatives and all of my friends. You might not know the details of my research area, but the support and help I get from you all might be more important than the pure academic help. It is because of you that I am a happy person and can keep pursuing my dreams. –iv– Contents Abstract iii Acknowledgments iv List of Figures viii List of Tables xi 1 Introduction 1 1.1 Challenges ....................................... 2 1.1.1 High performance and energy efficiency . ................ 2 1.1.2 Future fabrication technologies . ....................... 5 1.2 Solution — multi-core systems . ......................... 6 1.3 Contributions ..................................... 8 1.4 Dissertation organization . ............................. 9 2 Architecture of the Multi-core System 11 2.1 Key features of the multi-core processor . ...................... 11 2.1.1 Chip multiprocessor and task level parallelism ............... 12 2.1.2 Memory requirements of the targeted tasks . ................ 13 2.1.3 Simple single issue datapath . ....................... 14 2.1.4 GALS clocking style . .......................... 15 2.1.5 Wires and on chip communication ...................... 17 2.2 The AsAP processor system . ............................ 18 2.2.1 Single AsAP processor design . ...................... 19 2.2.2 Inter-processor communication — Reconfigurable 2-D mesh network . 26 2.3 Application implementations and software ...................... 29 2.3.1 Application implementations ......................... 29 2.3.2 Software ................................... 31 2.4 Related work . .................................... 32 2.4.1 Traditional DSP processors .......................... 32 2.4.2 Pioneering multiprocessor systems and multi-core processors . .... 33 2.4.3 Modern multi-core systems . ........................ 36 2.4.4 Distinguishing multi-core processors . ................. 40 2.5 Summary . .................................... 44 –v– 3 An Low-area Multi-link Interconnect Architecture 45 3.0.1 Background: traditional dynamic routing architecture . ........... 46 3.0.2 Background: static nearest neighbor interconnect .............. 47 3.1 Low-area interconnect architecture .......................... 48 3.1.1 Asymmetric architecture . .......................... 48 3.1.2 Theoretical analysis . ............................ 50 3.1.3 Static routing vs. dynamic routing ...................... 53 3.2 Design space exploration . .............................. 56 3.2.1 Single port vs. multiple ports for the processing core ............ 56 3.2.2 Single link vs. multiple links . ....................... 59 3.3 Supporting GALS long distance communication . ............. 67 3.3.1 Source synchronization for long distance communication .......... 69 3.3.2 Care more about the clock delay, less about skew or jitter .......... 70 3.4 Implementation and results . ............................. 70 3.4.1 Area ...................................... 71 3.4.2 Performance comparison ........................... 72 3.5 Summary ....................................... 75 4 Physical Implementation of the GALS Multi-core Systems 77 4.1 Timing issues of GALS multi-core systems ..................... 77 4.1.1 Inter-processor timing issues . ....................... 78 4.1.2 Inter-chip timing issues ............................ 81 4.2 Scalability issues of GALS chip multiprocessors . .............. 82 4.2.1 Clocking and buffering of global signals . .............. 83 4.2.2 Power distribution .............................. 84 4.2.3 Position of IO pins .............................. 85 4.3 A design example — implementation of AsAP . ................. 85 4.3.1 Physical design flow ............................. 86 4.3.2 Implementation for high speed ........................ 88 4.3.3 Testing .................................... 89 4.4 Summary ....................................... 90 5 Results and Evaluation of the Multi-core System 93 5.1 Area, speed, and power . ............................ 93 5.1.1 Small area and high area efficiency ..................... 94 5.1.2 High speed . .............................. 96 5.1.3 High peak performance and low average power consumption . .... 96 5.1.4 Result of communication circuitry ...................... 98 5.2 High performance and low power consumption for DSP applications ........ 98 5.2.1 Performance scaling with the processor number ............... 102 5.3 Summary ....................................... 103 6 System Feature Analysis: GALS vs. Synchronous 105 6.1 Exploring the key GALS chip multiprocessor design options ............ 106 6.1.1 Clock domain partition of GALS chip multiprocessors ........... 106 6.1.2 Inter-processor network . .......................... 107 6.2 Simulation platform — the GALS and non-GALS chip multiprocessors . .... 108 6.3 Reducing and eliminating GALS performance penalties ............... 109 –vi– 6.3.1 Related work . ............................. 109 6.3.2 Comparison of application performance: GALS versus synchronous .... 110 6.3.3 Analysis of the performance effects of GALS . .......... 111 6.3.4 Eliminating performance penalties . ................. 117 6.4 Scalability analysis of GALS chip multiprocessors . .............. 119 6.4.1 Auto generated clock trees for different sizes of chip multiprocessors . 119 6.4.2 The effect of clock tree on system performance ............... 121

High Performance and Energy Efficient Multi-Core Systems for DSP

Intel® Strongarm® SA-1110 High- Performance, Low-Power Processor for Portable Applied Computing Devices

Comparison of Contemporary Real Time Operating Systems

IXP400 Software's Programmer's Guide

Comparative Architectures

Arm C Language Extensions Documentation Release ACLE Q1 2019

AASP Brief 031704F.Pdf

Computer Architectures an Overview

Network Processors: Building Block for Programmable Networks

Strongarm™ SA-1100 Microprocessor for Portable

Sok: Introspections on Trust and the Semantic Gap

The New Intel® Xscale™ Microarchitecture

Tornado-Releasenotes