Architectural and Design Exploration for Application Specific Instruction-Set Processor Technologies

FACULTY OF ENGINEERING SCHOOL OF COMPUTER SCIENCE AND ENGINEERING ADAPT : Architectural and Design exploration for Application specific instruction-set Processor Technologies Seng Lin Shee B.E. (Hons I) Computer Engineering (UNSW) A thesis submitted for the degree of Doctor of Philosophy in Computer Science and Engineering September 2007 c Copyright by Seng Lin Shee 2008 All Rights Reserved Acknowledgements This thesis would never have been possible without the guidance, assistance and support of the individuals mentioned below. In this section, I would like to share with the reader the gratefulness, enjoyment, frustration and satisfaction I felt during the course of my PhD program. I would like to express my deep appreciation and thanks to my supervisor, A. Prof. Sri Parameswaran, for his endless support and insightful guidance throughout the course of my PhD program. Whenever there were difficulties and intricacies in my projects and research work, Sri would always provide the necessary encouragement and advice to counter such hurdles. I have appreciated the constant push and pressure needed during the course of my program, for without them, the current accomplishments I see today would not have been possible. I would also like to thank staff members Aleksandar Ignjatovic and Annie Hui Guo for their valuable feedback and advice on the various projects and research matters during my PhD program. It was good experience collaborating with researchers who are so passionate in their related fields. I have gained a lot of experience from working with them. It has been a great pleasure working alongside dedicated people in my research lab ever since the start of my undergraduate thesis project. I would like to thank Jeremy Chan See Wei for his constant criticism of my work, wherever I was and whatever iv I did; Jorgen Peddersen for being a good guide and mentor; Newton Cheung for being the legendary PhD student and Andhi Janapsatya for helping me with a lot of stuff along the way. Many thanks to the other members of the lab, namely Ivan Lu, Shannon Koh, Lih Wen Koh, Michael Chong, Angelo Ambrose, Krutartha Patel and Carol He for their companionship, friendship and fun throughout my life in university. Special thanks to Andrea Erdos who was a summer student in the research lab in 2006. Andrea worked hard alongside me in two published works. It has been a great pleasure to collaborate with such an excellent and fun individual. Life would not be complete without my residential college friends in International House. Those boring and slow moments in research have been well compensated for with just plain fun and the interesting characters in college. Last but not least, I would like to thank my parents for the encouragement and constant motivation which have enabled me to endure and persevere over the hardest hurdles during my PhD program. I am grateful for all the constant phone calls from overseas pestering me on about the progress of my work. To all the above, I would like to dedicate this work of literature. You may be surprised, but here it is: my PhD thesis. :) v List of Publications 1. S. L. Shee, A. Erdos, and S. Parameswaran, “Architectural Exploration of Het- erogeneous Multiprocessor Systems for JPEG,” International Journal of Paral- lel Programming (IJPP), 36(1):140–162, February 2008. 2. K. Patel, S. Parameswaran, and S. L. Shee, “Ensuring Secure Program Exe- cution in Multiprocessor Embedded Systems: A Case Study,” Proceedings of the 5th International Conference on Hardware - Software Codesign and System Synthesis (CODES+ISSS’07), pp. 57–62, Salzburg, Austria, September 2007. 3. S. L. Shee and S. Parameswaran, “Design Methodology for Pipelined Heteroge- neous Multiprocessor System,” Proceedings of the 44th Annual Conference on Design Automation Conference 2007 (DAC’07), pp. 811–816, San Diego, CA, June 2007. 4. S. L. Shee, A. Erdos, and S. Parameswaran, “Heterogeneous Multiprocessor Implementations for JPEG : A Case Study”, Proceedings of the 4th Inter- national Conference on Hardware - Software Codesign and System Synthesis (CODES+ISSS’06), pp. 217–222, Seoul, Korea, October 2006. 5. S. L. Shee, S. Parameswaran, and N. Cheung, “Novel Architecture for Loop Acceleration : A Case Study”, Proceedings of the 3rd IEEE/ACM/IFIP Inter- national Conference on Hardware - Software Codesign and System Synthesis vi (CODES+ISSS’05), pp. 297–302, Jersey City, NJ, September 2005. 6. J. M. D. Peddersen, S. L. Shee, A. Janapsatya, and S. Parameswaran, “Rapid Embedded Hardware/Software System Generation”, Proceedings of the 18th In- ternational Conference on VLSI Design (VLSI’05), pp. 111–116, Kolkata, India, January 2005. vii Abstract The miniaturization of the transistor has made it possible for billions of transistors to be integrated into a single microprocessor chip package. With new two-dimensional methods of manufacturing chips, ever more features and functionalities can be con- densed into a small area of silicon. However, as in any typical engineering situation, this silicon area is considered a resource that should be conservatively used to mini- mize power and chip packaging size. This thesis presents a suite of design automation methodologies for the design of a customized processor for specific application domains for an extensible processor platform. The work presents first a single processor approach for customization; a methodology that can rapidly create different processor configurations by the removal of unused instructions sets from the architecture. A profile directed approach is used to identify frequently used instructions and to eliminate unused opcodes from the available instruction pool. A coprocessor approach is next explored to create an SoC (System-on-Chip) to speedup the application while reducing energy consumption. Loops in applications are identified and accelerated by tightly coupling a coprocessor to an ASIP (Application Specific Instruction-set Processor). Latency hiding is used to exploit the parallelism provided by this architecture. A case study has been performed on a JPEG encoding algorithm; comparing two different coprocessor approaches: a high-level synthesis viii approach and our custom coprocessor approach. The thesis concludes by introducing a heterogenous multi-processor system using ASIPs as processing entities in a pipeline configuration. The problem of mapping each algorithmic stage in the system to an ASIP configuration is formulated. We have also proposed an estimation technique to calculate runtimes of the configured multiprocessor system without running cycle-accurate simulations, which could take a significant amount of time. We present two heuristics to efficiently search the design space of a pipeline-based multi ASIP system and compare the results against an exhaustive approach. In our first approach, the reduction of the instruction set and the generation of a processor can be performed within an hour. For five benchmark applications, we show that, on average, processor size can be reduced by 30%, energy consumption by 24%, while performance is improved by 24%. In the coprocessor approach, the high level synthesis method provides a faster method of generating coprocessors. However, compared with the use of a main processor alone, a loop performance improvement of 2.57 is achieved using the custom coprocessor approach, as against 1.58 for the × × high level synthesis method, and 1.33 for the customized instruction approach. En- × ergy savings within the loop are 57%, 28% and 19%, respectively. Our multiprocessor design provides a performance improvement of at least 4.03 for JPEG and 3.31 × × for MP3, for a single processor design system. The minimum cost obtained with the use of our heuristic was within 0.43% and 0.29% of the optimum values for the JPEG and MP3 benchmarks respectively. ix Contents Statement of Originality . i Copyright Statement . ii Authenticity Statement . iii Acknowledgements . iv List of Publications . vi Abstract . viii 1 Introduction 1 1.1 Microprocessor Generations . 3 1.1.1 First Generation – 1940-1956 . 3 1.1.2 Second Generation – 1956-1963 . 4 1.1.3 Third Generation – 1964-1971 . 5 1.1.4 Fourth Generation – 1971-Present . 7 1.2 Design Challenges . 9 1.2.1 Performance . 10 Pipeline . 10 SIMD . 12 Superscalar . 13 Coprocessors . 14 Multicore / multiprocessor . 15 x 1.2.2 Area . 16 Moore’s Law............................ 16 Specialization . 18 1.2.3 Energy . 18 Circuit Level . 18 Logic Level . 19 Architecture / System Level . 20 1.3 Extensible Processor Platform . 21 1.3.1 Base Configurations . 22 1.3.2 Extensible And Customized Instructions . 23 1.3.3 Architectural Configurations . 23 1.3.4 Heterogeneous Multiprocessor Via An Extensible Platform . 24 1.4 Design Automation . 25 1.5 Motivation . 27 1.6 Research Goals and Contributions . 30 1.7 Thesis Overview . 31 2 Literature Review 34 2.1 Introduction . 34 2.2 Embedded Systems . 34 2.2.1 Integration of logic-based circuits . 39 2.2.2 Functional Upgrades . 39 2.2.3 Analogue Replacement . 39 2.3 Architectural Designs . 42 2.3.1 General Purpose Processors . 42 2.3.2 Coprocessor Systems . 45 xi 2.3.3 Digital Signal Processors . 49 2.3.4 Multiprocessor Systems . 52 2.4 Customization of Architectures . 56 2.4.1 Field Programmable Grid Array . 57 2.4.2 Application Specific Integrated Circuits . 60 2.4.3 Extensible Processor Architectures . 61 Application Specific Instruction-set Processors . 62 Design tools and framework . 65 2.5 Parallelizing Architectures . 68 2.5.1 Instruction Level Parallelism . 69 2.5.2 Task Level Parallelism . 70 2.6 Design Space Exploration . 74 2.6.1 Processor Generation . 74 2.6.2 System Generator - coprocessor generation . 76 2.6.3 Multiprocessor / Heterogeneity . 79 3 Approach to Customization 84 3.1 Introduction . 84 3.2 Shortcomings of Previous Research . 84 3.3 Modus Operandi . 87 4 Customizing by Removing Instructions 94 4.1 Introduction . 94 4.2 Motivation . 95 4.3 Microprocessor Generation Framework . 96 4.4 Application Specific Processor Generation .

Architectural and Design Exploration for Application Specific Instruction-Set Processor Technologies

Atmel SMART | SAM V7: Cortex-M7 Tutorial Using the SAMV7 Xplained ULTRA Evaluation Board ARM Keil MDK 5 Toolkit Summer 2017 V 1.83 [email protected]

Latticemico32 Development Kit User's Guide for Latticeecp

Latticemico32 Software Developer User Guide

Insider's Guide STM32

ARM Architecture

Implementation, Verification and Validation of an Openrisc-1200

Programmable Logic Design Grzegorz Budzyń Lecture 11: Microcontroller

Openpiton: an Open Source Manycore Research Framework

Implementation of PS2 Keyboard Controller IP Core for on Chip Embedded System Applications

Μc/OS-II™ Real-Time Operating System

CDA 4253 FPGA System Design the Picoblaze Microcontroller

FPGA Design Guide