FACULTY OF ENGINEERING SCHOOL OF COMPUTER SCIENCE AND ENGINEERING

Design Automation Methodologies for Extensible Processor Platform

Newton Cheung

A thesis presented to the faculty of the University of New South Wales in candidacy for the degree of Doctor of Philosophy

March 2005 °c Copyright 2005 by Newton Cheung

All right reserved. I hereby declare that this submission is my own work and to the best of my knowl- edge it contains no material previously published or written by another person, nor material which to a substantial extent has been accepted for the award of any other degree or diploma at UNSW or any other educational institution, except where due acknowledgement is made in the thesis. Any contribution made to the research by others, with whom I have worked at UNSW or elsewhere, is explicitly acknowledged in the thesis.

I also declare that the intellectual content of this thesis is the product of my own work, except to the extent that assistance from others in the project’s design and con- ception or in style, presentation and linguistic expression is acknowledged.

————————————————

Newton Lim-Tung Cheung

Abstract

This thesis addresses two ubiquitous trends in the embedded system world - the increasing importance of design turnaround time as a design metric, and the move to- wards closing the design productivity gap. Adopting the right choice of design approach has been recognised as an integral part of the design flow in order to meet desired char- acteristics such as increasing software content, satisfying the growing complexities of an application, reusing off-the-shelf components, and exploring design metrics tradeoff, which closes the design productivity gap. The importance of design turnaround time is motivated by the intensive competition between manufacturers, especially makers of mainstream electronic consumer products, who shrinks the product life cycle and requires faster time-to-market to maximise economic benefits.

This thesis presents a suite of design automation methodologies to automatically design embedded systems for an application in the state-of-the-art design approach - the extensible processor platform. These design automation methodologies systematise the extensible processor platform’s design flow, with particular emphasis on solving four challenging design problems: i) code segment identification; ii) instruction generation; iii) architectural customisation selection; and iv) processor evaluation.

Our suite of design automation methodologies includes: i) a semi-automatic design system - to design an extensible processor that maximises the application performance while satisfying the area constraint. By specifying a fitting function to identify suitable

i ii ABSTRACT code segments within an application, a two-level hierarchy selection algorithm is used to first select a predefined processor and then select the right instruction, and a perfor- mance estimator is used to estimate an application’s performance; ii) a tool to match instructions - to automatically match the pre-designed instructions with computation- ally intensive code segments, reducing verification time and effort; iii) an instructions estimation model - to estimate the area overhead, latency, power consumption of exten- sible instructions, exploring larger design space; and iv) an instructions generation tool

- to generate new extensible instructions that maximises the speedup while minimising power dissipation.

A number of techniques such as system decomposition, combinational equivalence checking and regression analysis etc., have been heavily relied upon in the creation of the final design system. This thesis shows results at every stage to demonstrate the efficacy of our design methodologies in the creation of extensible processors. The methodologies and results presented in this thesis demonstrate that automating the design process for an extensible processor platform results in significant performance increase - on average, an increase of 4.74× (up to 15.71×) compared to the original base processor. Our system achieves significant design turnaround time savings (2.5% of the full simulation time for the entire design space) with majority Pareto points obtained

(91% on average), and can lead to fewer and faster design iterations. Our instruction matching tool is 7.3× faster on average compared to the best known approaches to the problem (partial simulations). Our estimation model has a mean absolute error as small as 3.4% (6.7% max.) for area overhead, 5.9% (9.4% max.) for latency, and

4.2% (7.2% max.) for power consumption, compared to estimation through the time consuming synthesis and simulation steps using commercial tools. Finally, the instruc- tion generation tool reduces energy consumption by a further 5.8% on average (up to ABSTRACT iii

17.7%) compared to extensible instructions generated by previous approaches. iv ABSTRACT Acknowledgments

This thesis could not have been completed without the help and encouragement of many people directly and indirectly, all of whom are impossible to mention here. I express my greatest gratitude to all, for making this thesis possible.

First, I would like to thank my supervisor, A/Prof. Sri Parameswaran, for his insightful guidance and continuous support throughout the course of my Ph.D. degree.

His ingenious approach to research and passionate attitude to work are qualities that any person would have appreciated. I would also like to thank Prof. J¨org Henkel, from whose invaluable advice and thoughtful comments I was lucky to benefit during the past three years. His knowledge of design automation and system-level design constituted the development of deeper and more inventive ideas, and his kind encouragement has significantly contributed to this thesis. My working experience at NEC Laboratory

America, Inc. helped me gain several practical insights and skills. I am grateful to

Prof. J¨org Henkel and A/Prof. Sri Parameswaran for giving me this opportunity to work on the project. Venkata Jakkula’s help during the project is greatly appreciated.

I would also like to thank the computer engineering group faculty: A/Prof. Sri

Parameswaran, Dr. Oliver Diessel and Dr. Annie Guo, administrative staff: Rochelle

McDonald and Karen Corrigan, and the graduate students: Andhi Janapsatya, Jorgen

Peddersen, Ashley Partis, Keith So, Usama Malik, George Ferizis, Ivan Lu, Jeremy

Chan, Swarna Radhakrishnan, Seng Lin Shee, Lih Wen Koh and Shannon Koh for

v vi ACKNOWLEDGMENTS providing an excellent environment for learning and research. The time I spent at the school during my Ph.D. degree was memorable to say the least. I am grateful to

A/Prof. Jingling Xue, A/Prof. Hossam ElGindy, Dr. Aleksandar Ignjatovi´c, Dr. Frank

Engel, Dr. Oliver Diessel, Dr. Annie Guo and Dr. Manuel Chakravarty for various interesting and informative technical discussions. I would like to thank Andhi Janap- satya, a great friend and fellow graduate student, for providing instant feedback and thoughtful comments. Sincere thanks are due to Jorgen Peddersen for kindly proof- reading papers and providing constructive improvements. I would also like to thank

Keith So for providing algorithmic advice and interesting challenges.

The support and encouragement of my family and friends has been the cornerstone upon which I built my thesis. My parents, David and Kitty Cheung, and my sister, Jane

Cheung, have been valuable to me for giving their genuine love and caring supports for the past twenty-five years, hence it is to them that I dedicate my achievements.

My wonderful girlfriend, Amy Tso, has supported me in a special way that no one else could have. I would like to thank all my classmates at the University of New

South Wales, Sydney and the University of Queensland, Brisbane for their support and companionship during the course of my university life. I would like to thank all my relatives and friends for their loving care and prayers. Last but not least, I would like to thank God for His merciful grace and endless love through His works in my life. Table of Contents

Abstract i

Acknowledgments v

Table of Contents vii

List of Figures xi

List of Tables xv

List of Publications xvii

1 Introduction 1

1.1 EmbeddedSystemsChallenges...... 2

The Trends Towards Designing Embedded Systems ...... 5

DesignApproachandAutomation...... 6

1.2 ExtensibleProcessorPlatform ...... 7

1.2.1 FeaturingaBaseProcessor...... 10

1.2.2 Designing Extensible Instructions ...... 12

1.2.3 Including/Excluding Predefined Blocks ...... 14

1.2.4 SettingArchitecturalParameters ...... 15

Design Automation in Extensible Processor Platform ...... 15

vii viii TABLE OF CONTENTS

1.3 ThesisOverview...... 16

2 Literature Review 21

2.1 EmbeddedSystemsandtheirEarlyHistory ...... 21

2.2 Design Approaches for Embedded Systems ...... 24

2.2.1 Application Specific Integrated Circuits ...... 24

2.2.2 GeneralPurposeProcessors ...... 28

2.2.3 Digital Signal Processors ...... 29

2.2.4 Field Programmable Gate Arrays ...... 31

2.2.5 Application Specific Instruction-set Processors ...... 34

2.3 Architecture of Application Specific Processors ...... 36

2.3.1 Very Long Instruction Word Processors ...... 38

2.3.2 Reconfigurable Processors ...... 41

2.3.3 ExtensibleProcessors...... 44

2.4 Problems in Designing Extensible Processors ...... 47

2.4.1 Code Segment Identification ...... 47

2.4.2 Extensible Instruction Generation ...... 53

2.4.3 Architectural Customisation Selection ...... 58

2.4.4 Processor Evaluation and Estimation ...... 62

3 Methodology Overview 67

3.1 ExistingDesignFlow...... 67

3.2 Overview of Our Automation Methodologies ...... 71

3.3 Modified Design Flow for Extensible Processors ...... 74

3.4 Contributions ...... 77

4 Semi-automatic Design System 81 TABLE OF CONTENTS ix

4.1 Motivations ...... 81

4.2 SystemOverview ...... 83

4.2.1 Phase I: Pre-configured Processor Selection ...... 86

4.2.2 Phase II: Instruction Identification Model ...... 86

4.2.3 Phase III: Extensible Instruction Selection ...... 90

4.2.4 Phase IV: Performance Estimation Model ...... 91

4.2.5 Overall Design Flow Algorithm ...... 92

4.3 ExperimentalResults...... 92

4.3.1 ExperimentalSetup...... 94

4.3.2 EvaluationResults ...... 96

4.4 ConclusionsandFutureWork ...... 104

5 Matching Instructions Tool 107

5.1 Background ...... 109

5.2 RelatedWork ...... 112

5.3 OverviewoftheMINCETool ...... 115

5.3.1 TheTranslator ...... 117

5.3.2 Filtering Algorithm in MINCE ...... 120

5.3.3 Combinational Equivalence Checking Model ...... 122

5.4 ExperimentalResults...... 123

5.4.1 EvaluationResults ...... 124

5.5 ConclusionsandFutureWork ...... 132

6 Instruction Estimation Models 135

6.1 Motivation...... 136

6.2 BackgroundandTheory ...... 137

6.3 ExtensibleInstructionsModel ...... 140 x TABLE OF CONTENTS

6.3.1 Overview ...... 141

6.3.2 Customisation Parameters ...... 141

6.3.3 Characterisation for Various Constraints ...... 143

6.3.4 Estimating Characteristics of Extensible Instructions ...... 146

6.4 ExperimentalResults...... 149

6.4.1 ExperimentalSetup...... 149

6.4.2 EvaluationResults ...... 153

6.5 Conclusions ...... 157

7 Instructions Generation 159

7.1 Motivation...... 159

7.2 Background ...... 161

7.3 Problem Statements and Preliminaries ...... 163

7.4 InstructionGeneration ...... 164

7.4.1 Instruction Generation Algorithm ...... 165

7.4.2 Battery-Awareness Algorithm ...... 167

7.5 ExperimentalResults...... 170

7.5.1 ExperimentalSetup...... 170

7.5.2 EvaluationResults ...... 173

7.6 Conclusions ...... 176

8 Conclusions 179

Bibliography 184 List of Figures

1.1 Potential design complexity and designer productivity ...... 3

1.2 A simplified generic design flow of an extensible processor ...... 9

1.3 Reasons for using an extensible processor platform ...... 10

1.4 Benefits of a base processor: performance and code size ...... 11

1.5 Benefits of extensible instructions: performance ...... 12

2.1 Anearlyhistoryofembeddedsystems ...... 25

2.2 Performance and flexibility comparison of different design approaches . 26

2.3 A design flow of field programmable gate arrays ...... 32

2.4 Different types of application specific instruction-set processors . . . . . 37

2.5 AVLIWarchitecture...... 39

2.6 AgenericVLIWinstructionformat ...... 39

2.7 Areconfigurableprocessorarchitecture ...... 42

2.8 Anextensibleprocessorarchitecture...... 44

2.9 A simplified generic design flow of an extensible processor ...... 45

2.10 An example function for demonstrating the complexity of code segment

identification ...... 49

2.11 A computationally intensive code segment for demonstrating the com-

plexityofinstructiongeneration ...... 54

xi xii LIST OF FIGURES

2.12 An example demonstrating the complexity of instructions selection . . . 60

2.13 Accuracy and time tradeoffs between different approaches ...... 63

3.1 A generic existing design flow of the extensible processor platform . . . 69

3.2 Our design methodologies for an extensible processor platform ..... 76

4.1 Motivationexample...... 82

4.2 Our semi-automatic system design flow for configuring extensible pro-

cessor (double square box: commercial tools; grey box: our contributions) 84

4.3 An example of a code segment to demonstrate how the fitting function

works...... 90

4.4 Overall algorithm of the semi-automatic design system ...... 93

4.5 The relationship between the fitting function and speedup/area ratio of

theinstruction ...... 99

4.6 GSM decoder’s design space and Pareto points ...... 102

4.7 MPEG2 decoder’s design space and Pareto points ...... 103

5.1 A motivation example: to match pre-designed extensible instructions

withcodesegmentsoftheapplication ...... 108

5.2 A generic design flow for designing an extensible processor and how

MINCEtoolfitsinthedesignflow ...... 110

5.3 Verificationtimedistribution...... 112

5.4 A code segment and an extensible instruction and their BDD represen-

tations...... 113

5.5 MINCE: an automated tool for matching extensible instructions . . . . 116

5.6 An example for translating to Verilog HDL in a form that allows match-

ing through the combinational equivalence checking model ...... 118 LIST OF FIGURES xiii

5.7 Algorithm filtering for eliminating the number of extensible instructions

into the combinational equivalence checking model ...... 122

5.8 Experimental and verification platform ...... 123

5.9 Time reduction: the comparison between MINCE and Simulation . . . 128

5.10 Results in terms of computation time for the instruction matching step:

Simulation vs. MINCE (part1) ...... 129

5.11 Results in terms of computation time for the instruction matching step:

Simulation vs. MINCE (part2) ...... 130

6.1 A motivation example: four varieties to design an instruction which

replacesacodesegment ...... 137

6.2 An overview for characterising and estimating the models of the exten-

sibleinstructions ...... 140

6.3 Experimentalmethodology...... 152

6.4 A design space example of the extensible instructions (around 6ns) . . . 152

6.5 The accuracy of the estimation models for multiple instructions (sets of

instructions: set 1 contains a single instruction; set 2 contains a group

oftwoinstructions,etc.) ...... 155

6.6 The accuracy of the estimation models in real-world applications . . . . 156

7.1 A motivation example: separates instruction to reduce energy consumption161

7.2 An overview of the automatic instruction generation tool ...... 165

7.3 Algorithm InstGen for generating extensible instruction that minimises

theexecutiontime ...... 167

7.4 An example for separating instructions to reduce power dissipation . . 169

7.5 An example for utilising the slack of the instruction ...... 169

7.6 Algorithm BattAware for optimising battery lifetime in the instruction 171 xiv LIST OF FIGURES

7.7 An experimental platform for verifying our automatic instruction gen-

erationtool ...... 172

7.8 Trendlines of energy reduction for extensible instructions ...... 175 List of Tables

2.1 Summary of different design approaches for embedded systems . . . . . 36

4.1 Characteristics of pre-configured processors ...... 95

4.2 Asubsetofextensibleinstructionslibrary ...... 97

4.3 Theefficacyofthefittingfunction...... 98

4.4 The efficiency of the heuristic algorithm ...... 100

4.5 Summary of the semi-automatic design system results ...... 101

5.1 A subset of complex module with limited implementations ...... 121

5.2 Experimental results on hardware instructions on different kinds of soft-

warecodesegments...... 126

5.3 Experimental results on hardware instructions on different kinds of soft-

warecodesegments(part2) ...... 127

5.4 Number of instructions matched, matching time used and speedup gained

bydifferentsystems...... 131

6.1 Customisation parameters of extensible instructions ...... 143

6.2 The coefficients of the extensible instructions for the purpose of calibrat-

ingthroughregression ...... 151

xv xvi LIST OF TABLES

6.3 The mean absolute error of the estimation models in different types of

extensibleinstructions ...... 154

7.1 The characteristics of the generated instructions (set 1 and set 2) for the

fiftycodesegments ...... 173

7.2 The characteristics of the application when different instructions gener-

atedinthetoolareapplied ...... 176 List of Publications

Book Chapter

1. N. Cheung, J. Henkel, and S. Parameswaran, “Rapid Configuration and In-

struction Selection for an ASIP: A Case Study”, in Embedded Software for SoC

(A. A. Jerraya, S. Yoo, and N. Wehn - Editors), pages 403–417, Kluwer Academic

Publishers, 2003 [56]

Journal Article

1. N. Cheung, S. Parameswaran, and J. Henkel, “Rapid Semi-automatic Design

System to Configure Extensible Processors”, submitted for IEEE Transactions

for Very Large Scale Integration (VLSI) Systems

Conference Paper

1. N. Cheung, S. Parameswaran, and J. Henkel, “Battery-Aware Instruction Gen-

eration for Embedded Processors”, in Proceedings of the Asia and South Pacific

Design Automation Conference (ASP-DAC), pages 553–556, IEEE Computer So-

ciety, January 2005 [60]

2. N. Cheung, S. Parameswaran, and J. Henkel, “A Quantitative Study and Estima-

tion Models for Extensible Instructions in Embedded Processors”, in Proceedings

of the International Conference on Computer-Aided Design (ICCAD), pages 183–

189, IEEE Computer Society, November 2004 [59]

xvii xviii LIST OF PUBLICATIONS

3. N. Cheung, S. Parameswaran, J. Henkel, and J. Chan, “MINCE: Matching IN-

structions using Combinational Equivalence for Extensible Processor”, in Pro-

ceedings of Design Automation and Test in Europe (DATE), pages 1020–1025,

IEEE Computer Society, February 2004 [61]

4. N. Cheung, S. Parameswaran, and J. Henkel, “INSIDE: INstruction Selection /

Identification & Design Exploration for Extensible Processor”, in Proceedings of

the International Conference on Computer-Aided Design (ICCAD), pages 291–

297, IEEE Computer Society, November 2003 [58]

5. N. Cheung, J. Henkel, and S. Parameswaran, “Rapid Configuration and Instruc-

tion Selection for an ASIP: A Case Study”, in Proceedings of Design Automation

and Test in Europe (DATE), pages 802–807, IEEE Computer Society, March

2003 [57] Chapter 1

Introduction

The growing design productivity gap and intense competition amongst manufac- turers present many challenges for the embedded systems industry. The challenges related to the growing design productivity gap include all aspects of design, imple- mentation and verification of integrated circuits. In addition, the intense competition amongst manufacturers, especially makers of mainstream consumer products, shortens the product life cycle significantly. To address these integrated circuit challenges and product life cycle concerns, it is necessary to develop design approaches and automation methodologies for designing embedded systems efficiently.

Some of the recent research in the area of embedded system design has revolved around extensible processors. Extensible processors represent the state-of-the-art in ap- plication specific instruction-set processors, consisting of a base processor and a base instruction set. The designer can optimise the base processor through three architec- tural customisations: instructions extension, inclusion/exclusion of predefined blocks, and parameterisations. Through these customisations, an extensible processor platform creates processors that are able to meet the increasing software content requirement, satisfy the growing complexities of an application, and reuse off-the-shelf components.

Research and development work in extensible processor platforms has shown that such processors have significant performance and power benefits when compared to

1 2 CHAPTER 1. INTRODUCTION other design approaches such as application specific integrated circuits, general pur- pose processors, digital signal processors, and field programmable gate array. However, designing an extensible processor requires a great deal of expertise and is often under- taken manually. Therefore, it is very important to develop design automation method- ologies and tools for extensible processor platforms. Several studies have shown that large exploration and design times are reduced through automation methodologies and algorithms, which reduces the time-to-market pressure significantly. In addition, au- tomation tools can provide early feedback about the designs, allowing tradeoff decisions to be made rapidly and can explore a large design space. The aim of this thesis is to develop methodologies and tools to design extensible processors automatically, closing the design productivity gap and relieving time-to-market pressures.

1.1 Embedded Systems Challenges

Embedded systems are ubiquitous and constitute the behind-the-scenes computing power in our everyday lives, ranging from office equipment, automotive control sys- tems, security surveillance systems, medical equipment and home automation systems to mainstream consumer products such as mobile phones, personal digital assistants

(PDAs), portable music players, digital and video cameras, etc. The number of embed- ded systems produced annually is in the order of millions, and this number is expected to increase significantly. In fact, embedded systems contributed significantly to the

US$200 billion dollar semiconductor industry in 2004 (Garnter, Inc.).

The semiconductor industry is experiencing increasing challenges, many of which are brought about by the growing design productivity gap. That is, while Moore’s Law predicts to grow at an average annual rate, as a measure in terms of the ability to manufacture transistors (around 58%), the productivity of designers, as a measure of CHAPTER 1. INTRODUCTION 3

Potential Design Complexity and Designer Productivity

Logic transistors per chip 10,000 Transistors per staff-month 100,000 (Source: Sematech) 1,000 10,000 58% 100 annual Design 1,000 complexity Productivity 10 growth rate Gap 100

1 10 Productivity x x x x x x x 0.1 x x x 25% 1 x x x x x annual (k) Transistors / staff-month Logic Transistors per Chip (M) 0.01 productivity 0.1 growth rate 1981 1983 1985 1987 1989 1991 1993 1995 1997 1999 2001 2003 2005 2007 2009 Year

Figure 1.1: Potential design complexity and designer productivity the ability to design and implement correct and testable transistors per staff-month, continues to grow at an average annual rate of under 25%. This leads to a major gap in the ability to effectively utilise the improving silicon manufacturing process.

The semiconductor industry simply does not have sufficient manpower to follow the manufacturing footsteps of the advancing silicon technology. In other words, designers cannot produce sufficiently complex and efficient circuits that utilise and satisfy all the design requirements with the current silicon technology. Furthermore, this design productivity gap is constantly growing. Thus, this unbalanced phenomenon is one of the main challenges for circuit designers. Figure 1.1 shows the potential design complexity and designer productivity.

The challenges related to this growing design productivity gap include all aspects of design, implementation and verification of integrated circuits. First, the need to satisfy tight design constraints and emerging new standards presents a constant challenge. For example, minimum feature size, low power consumption, as well as high performance 4 CHAPTER 1. INTRODUCTION efficiency are of paramount importance, yet are contradictory goals in today’s embed- ded systems. Failure to address the above concerns may result in reduced portability and incorrect operation due to factors such as system weight degradation, battery life shortening, thermal-related failure, and degraded performance.

Second, the effort and cost to implement integrated circuits are mounting. This is due to phenomena such as a rapid increase in the complexity of integrated circuits and increasing non-recurring engineering (NRE) costs of making mask sets and migrating to new silicon technology. The cost of making mask sets is reaching US$1 million in the current silicon technology, with an average of only 500 wafers per set. These high costs significantly decrease the number of designs deemed worthy of implementation and thus, increase the risk to which the manufacturing company is exposed in the case of implementation failure. In addition, even if designs are deemed worthy of implementation, increased NRE cost applies extreme pressure to the verification of the circuit.

Third, the time and effort invested in verification are sky-rocketing due to factors such as the increasing complexity of integrated circuits as well as the mounting design risks for implementing integrated circuits. As the design productivity gap grows, these challenges become harder to resolve.

Furthermore, challenges for embedded systems are not only caused by the design productivity gap in the semiconductor industry. Rapid growth in the embedded system industry has intensified competition between major manufacturers, especially makers of mainstream consumer products, significantly shortening the life cycle of products using embedded systems. This implies that the time for design, implementation and verifica- tion of integrated circuits must be significantly shortened as well. All of these challenges markedly affect the way methodologies and tools are developed in the electronic de- CHAPTER 1. INTRODUCTION 5 sign automation industry for different design approaches. The design productivity gap and product life cycle concerns pose challenges for embedded system designers, effec- tively forcing the semiconductor industry, software industry, and the electronic design automation industry to work together to overcome them.

The Trends – Designing Embedded Systems

Three significant trends have emerged in recent years towards designing embedded systems to overcome the aforementioned challenges in system design. These can be summarised as increasing software content, customising segments of an application to a specific hardware unit, and reusing off-the-shelf components:

• Increasing software content: As the design time shortens, the trend is to increase

software content and components such as microprocessors in embedded systems.

This trend is a tradeoff between the high performance and low power consumption

that designers can achieve in hardware design, and the time and effort involved

in designing a complex integrated circuit. The benefits include decreased design

time, reduced NRE costs, and the ability to make late design changes.

• Customising segments of an application to specific hardware unit: As full custom

design takes a significant amount of time and given that microprocessors cannot

achieve all hardware specifications (such as energy consumption, feature area and

performance), the trend is to customise smaller segment of integrated circuits -

targeting the critical parts which often consume large amounts of energy, occupy

a large area, and have a long execution time. The main purpose of customisation

(customising small segments instead of full custom design) is to decrease the

complexity of integrated circuits, providing significant reduction in design time

and effort, and hence increasing the productivity of designers. These integrated 6 CHAPTER 1. INTRODUCTION

circuits are often designed for a particular application or domain of applications,

in which designers can optimise the integrated circuits by exploiting application

characteristics.

• Reusing off-the-shelf components: The trend is to reuse as many off-the-shelf

components as possible (even those provided by different manufacturers), in em-

bedded systems. This is achieved through the advance in communication meth-

ods and protocols between components as well as the increasing analysis methods

for the multiple component system (Multiprocessors or Network-on-chip). This

trend reduces the tedious process of verifying newly designed integrated circuits,

thus decreasing design time and effort.

All of these trends significantly reduce the time-to-market pressure and enable dif- ferent design tradeoffs between performance, power, area and delay. More importantly, these trends are changing the way embedded systems are designed within design ap- proaches and automation tools for exploring design space and utilising novel design technology.

Design Approach and Automation

Design approaches have been proposed as tradeoffs between increasing software con- tent, customising segments of an application, and reusing off-the-shelf components in order to utilise silicon technology in the semiconductor industry. These approaches can be classified into five categories: i) Application Specific Integrated Circuits (ASICs) - integrated circuits, hardware-only solution, that serve a particular function or applica- tion; ii) General Purpose Processors (GPPs) - microprocessors that perform a particu- lar function in software; iii) Field Programmable Gate Arrays (FPGAs) - re-fabricated circuits that designers can electrically configure to meet design functions; iv) Digital CHAPTER 1. INTRODUCTION 7

Signal Processors (DSPs) - a specialised microprocessor which contains hardware archi- tecture specifically designed for digital signal processing; and v) Application Specific

Instruction-set Processors (ASIPs) - processors with the capability to customise new instructions as specific hardware units for a particular application. Each of these de- sign approaches provides different design characteristics to fulfill different requirements for embedded systems. These are described in detail in the Chapter 2 literature re- view. Unfortunately, aggressive time-to-market requirements indicate that developing a design approach alone is not sufficient to handle the challenges in designing modern embedded systems. Hence, the electronic design automation industry has developed design automation methodologies and tools to incorporate these design approaches in order to reduce the design time for embedded systems. Automation tools often reduce tedious design efforts and minimise the necessary expertise level for designing embed- ded systems. Typical benefits of using automation tools include reducing the design time to relieve the mounting time-to-market pressure and exploring the design space to optimise the area, power and delay. As mentioned previously, the challenges for designing embedded systems must be met by a combination of effort from the semi- conductor and electronic design automation industries. Therefore, design approaches and automation tools must be carefully developed and researched in order to bridge the growing design productivity gap and handle time-to-market pressure for designing embedded systems efficiently.

1.2 Extensible Processor Platform

Some of the recent research and development efforts in designing embedded systems has been evolving around the extensible processor platform. Extensible processors rep- resent the state-of-the-art in application specific instruction-set processors, consisting 8 CHAPTER 1. INTRODUCTION of a base processor core that contains a base instruction set. These processors can be customised at three architectural levels: i) Instructions extension: the designer can extend this instruction set through new extensible instructions. Extensible instructions are customised to replace computationally intensive code segments (groups of primi- tive instructions) in the application; ii) Inclusion/exclusion of predefined blocks: the designer can choose to include or exclude predefined blocks as part of the base pro- cessor. Predefined block examples include floating-point unit, digital signal processing unit, special function registers, multiply-and-accumulate operations block; iii) Param- eterisations: the designer can set extensible processor parameters such as instruction and data cache sizes.

Figure 1.2 shows a simplified generic design flow of an extensible processor. The design goal of the extensible processor is typically to maximise performance of an em- bedded application, while meeting area and power constraints. The designer often begins by profiling the given application using an Instruction-Set Simulator (ISS) of the target processor. The profiling reveals computationally intensive code segments for which customisation might improve performance, area and energy characteristics. The designer then defines these architectural customisations in the extensible processor. To evaluate these customisations, the designer usually employs simulation tools (Instruc- tion Set Simulator) to determine whether the application meets design constraints.

The designer then reiterates these steps to explore the design space until constraints are met. Once design constraints are met, the platform uses the base processor config- urations, and customisations to generate the extensible processor’s synthesisable RTL.

This synthesisable RTL can be taped out or synthesised for prototyping. However, designing extensible processors requires a great deal of expertise and is therefore often conducted manually. CHAPTER 1. INTRODUCTION 9

Application written in C/C++

Compilation Power * * * ** Analysis and Profiling * * * * ***** ***** ****** ********* *********** ******** ************ *** ******************* Identify computationally intensive * *************** ************** code segments * ***************** Area ************* * * * *********** ** * * ** **** * * Performance Generate extensible instructions for code segments Explore Explore extensibleextensible Select extensible instructions, processorprocessor predefined blocks, parameters design space Synthesizable RTL of: design space Base processor Evaluate performance and design Predefined blocks constraints of the processor Extensible Instructions Parameter settings Design that satisfies design constraints (power, Generate extensible processor area, performance)

Synthesis and Prototyping tape-out

Figure 1.2: A simplified generic design flow of an extensible processor

By featuring a base processor, extending specific instructions, including/excluding predefined blocks, and setting architectural parameters in the extensible processor, the designer is provided with many design options to execute any software applications, define specific hardware units, and reuse off-the-shelf components to optimise an em- bedded system. The benefits of using an extensible processor platform are summarised in Figure 1.3, which are described in the next section. 10 CHAPTER 1. INTRODUCTION

Designing Extensible Instructions - the designer can extend the base instruction set through new extensible Base processor and base instructions which customize to instruction set replace computational intensive code - the designer can run any applications segments Benefits Benefits - Reduce the time-to-market pressure - Application specific design - Automate the design process using - Increase product functionality compiler, linker, debugger, and - Coexist with the base instruction set simulator - Prepare for late changes in the design specifications - Increase flexibility Extensible Processors Platform Including/excluding predefined blocks - the designer can choose to include or exclude predefined blocks as part of Setting architectural the base processor parameters - the designer can set extensible Benefits processor parameters - Reduce verification efforts and time for new designs Benefits - Communication techniques and - Meet tight design constraints (e.g. channels are customized between compact feature size, low power predefined blocks and base processor dissipation etc) - Optimize all parts of the extensible processor for a particular application with its characteristics

Figure 1.3: Reasons for using an extensible processor platform

1.2.1 Featuring a Base Processor

A base processor with a base instruction set in the extensible processor platform enables the designer to run any application. The base instruction set consists of a small, yet sufficient number of instructions to execute any program. Each instruction is compact in bit width. For example, a 32-bit base processor often has a 16-bit and 24- bit base instruction set. A compact instruction set reduces the application’s code size, minimises instruction memory/cache required to execute the application and reduces energy consumption and chip area. Furthermore, an efficient instruction set reduces code size significantly, enabling the application to execute in a shorter amount of time, CHAPTER 1. INTRODUCTION 11

Performance and Code Size in Different Processors

Base Processor 600 Xtensa V General Purpose Processor

500 Source: EEMBC (updated February 2003)

400

300 ARC Tangent A4 Performance 200 Motorola PPC7455 100 MIPS 64 ARM 1020E (EEMBC Consumer Benchmark Rating)

50000 70000 90000 110000 130000 150000 Code Size (Bytes)

Figure 1.4: Benefits of a base processor: performance and code size leading to an increase in application performance. Figure 1.4 shows the benefits of a base processor in terms of performance and code size. The figure indicates that the base processors of commercial vendors, such as the Xtensa V from Tensilica Inc. [23] and ARC Tangent A4 from ARC Inc. [1], achieve higher performance as well as smaller code sizes than several well-known, general-purpose processors, such as ARM 1020E from Inc. [9], Motorola PPC7455 from Motorola Inc. [16], and MIPS64 from MIPS

Inc. [14].

Using the base processor, a designer can execute any software application written in a high-level language (e.g., C/C++, Java, etc.). The design process of a software application is often automated by the use of a compiler, linker, debugger and simulator.

Therefore, the base processor and base instruction set reduces time-to-market pressure.

As the product life cycle shrinks and the competition between major manufacturers intensifies, the need to launch a product to the market in the shortest amount of time 12 CHAPTER 1. INTRODUCTION

Performance Improvement vs General Purpose Processors

JPEG (cameras) Base + 7500 gates

W-CDMA (wireless base station) Base + 4000 gates

JPEG 2000 (Lifting / Wavelet Transform) Base + 40000 gates

FIR Filter (telecom) Base + 4000 gates

H.263 Codec Base + 45000 gates

IP Packet Forwarding / Routing Base + 6000 gates

Applications DES Encryption (IPSEC, SSH) Base + 4500 gates

Turbo Coding (MaxLogMap) Base + 65000 gates

Motion Estimation (video) Base + 30000 gates

GSM Viterbi Decoding (bufferfly) Base + 11000 gates

2x 4x 6x 8x 10x 50x 100x 250x ARM 10 / MIPS64 class equivalent Source: EEMBC and Tensilica, Inc. ARM 9 / MIPS32 class equivalent (updated February 2003) Speedups

Figure 1.5: Benefits of extensible instructions: performance to the market is vital. As a result, manufacturers now release at least two major product lines each year compared with only one just a few years ago. For example, digital camera manufacturers now produce several new models of digital SLR cameras and compact digital cameras each year. For designers, the time-to-market pressure is mounting by the minute. The base processor and base instruction set allows the ma- jority of the functionality to be designed in software. Thus, design time is significantly reduced in comparison to placing the same functionality in hardware. Increasing soft- ware content also increases flexibility, enabling any late changes in design specifications to be handled easily and seamlessly.

1.2.2 Designing Extensible Instructions

The designer is permitted to extend the base instruction set through new extensible instructions. An extensible instruction is a specific hardware unit that executes a ded- icated function in the execution stage of the base processor, replacing computationally CHAPTER 1. INTRODUCTION 13 intensive code segments in the software application. The main goal of instructions extension is to increase performance and satisfy energy consumption (which the base processor cannot achieve alone). In addition, extensible instructions coexist with the base instruction set. The application is compiled into assembly code, composed of base instructions and extensible instructions. The major benefit of designing extensible in- structions is the ability to increase the product performance of embedded systems in a small amount of time. As product functionality and thus the underlying complexity of embedded systems rapidly increases, the need for high product functionality also rises. Nowadays, embedded systems feature far greater functionality than their core functionality. For example, digital camera features not only limited to taking digi- tal photos and providing automatic focus, they also include video recording, a high precision image stabiliser, direct print compatibility, and a histogram indicator, etc.

New features are continuously developed and integrated into next generation prod- ucts. Thus, instructions extension allows the rapid creation of customised processors to meet tight design constraints. Figure 1.5 shows the benefits of designing extensible instructions. The figure shows the performance comparison of an extensible proces- sor with ARM 10/MIPS64 and ARM 9/MIPS32 class equivalent processors. In the

figure, “Base” symbolises the base processor, which typically has an area of ranging from 30k-80k gates, depending upon the silicon technology and parameter settings.

The number of gates refers to the area of the extensible instructions (the complexity of the extensible instructions). As the figure shown, the performance is far better than that of off-the-shelf processors when extensible instructions are customised to different applications. 14 CHAPTER 1. INTRODUCTION

1.2.3 Including/Excluding Predefined Blocks

The designer can choose to include or exclude predefined blocks as part of the base processor. Predefined blocks are pre-designed, commonly used hardware units that perform specific functions for a domain of applications, such as digital signal processing unit, floating-point unit, etc. Each predefined block has an isomorphic in- struction set which coexists with the base instruction set. Once the predefined blocks are included, the compiler automatically uses the predefined block’s instructions in the assembly code (if necessary). Communication techniques and methods between the base processor and predefined blocks are also specially designed. Floating-point unit, digital signal processing unit, and multiply-and-accumulate operations blocks are ex- amples of predefined blocks for multimedia application domain that boost performance significantly.

The need to include predefined blocks can be questioned, given that both predefined blocks and extensible instructions both increase performance. The advantage of using predefined blocks is that the area is smaller than a group of instructions and reduces the verification effort. As product functionality increases, the complexity and the number of extensible instructions needed also increases. Thus, the verification time and effort for extensible instructions also mounts. Therefore, the primary benefit of predefined blocks is not limited to the increased performance and reduction in verification time and effort. As a matter of fact, including and excluding predefined blocks can be considered as the coarse-grain customisation for an application in an extensible processor, whereas designing extensible instructions is the fine-grain customisation. Enabling reusability of the predefined blocks minimises verification time, thus reducing the design turnaround time for an application in the extensible processor. CHAPTER 1. INTRODUCTION 15

1.2.4 Setting Architectural Parameters

The designer is able to set extensible processor parameters in order to further optimise the extensible processor for a particular application. Parameter examples include: configuring memory management unit; selecting set associativity and size of local data and instruction caches; configuring RAM and ROM areas for data and instruction storage; setting interface options; processor interface width, the number of interrupts and its priority, etc. These parameters affect the design characteristics of the extensible processor including area, power, and performance. For example, if the code size of an application is 20k byte, then the maximum instruction cache does not need to be more than 20k byte. Setting extra instruction cache increases area overhead of the processor, consumes unnecessary static and dynamic power, causes extra delays for instruction caches, and in turn decreases the performance of the application.

Furthermore, these settings also affect the product characteristics. For example, compact feature size leads to stylish design and a light weight product; low power dissipation extends battery life and ensures portability in products (e.g.,notebooks, mobile phones, personal digital assistants (PDAs) etc.); and high performance shortens execution time and reduces energy consumption. Therefore, the need to customise every aspect of the product is increasing, and the customisations discussed above are vital in optimising the application in the embedded system.

Design Automation in Extensible Processor Platform

As shown previously, the extensible processor platform enables the design of em- bedded systems in a short amount of time with significant performance, low energy consumption and compact area. However, the existing design flow shown in Figure

1.2 requires a great deal of expertise and is often conducted manually. Four major 16 CHAPTER 1. INTRODUCTION problems associated with the existing design flow are: code segment identification, extensible instruction generation, architectural customisation selection, and processor evaluation (these are described in detail Chapter 2 literature review). Several research and development works in extensible processor platforms have shown that large explo- ration times are reduced, and that near-optimal designs are obtained through automa- tion methodologies and tools [33,58,67,88,193]. Therefore, the development of design automation methodologies is very important for the extensible processor platform.

1.3 Thesis Overview

This thesis presents a suite of methodologies and tools to automatically design ex- tensible processors for embedded systems. The methodologies include generating exten- sible instructions, selecting predefined blocks and parameters, matching pre-designed instructions, and design space exploration to perform tradeoffs. While design automa- tion methodologies for various parts of extensible processors have been researched in the past, the work presented in this thesis represents some of the first known design automation methodologies for the extensible processor platform.

Chapter 2 of this thesis provides the necessary literature review on various design approaches for embedded systems, ranging from application specific integrated circuits and general-purpose processors to field programmable gate arrays, digital signal pro- cessors and application specific instruction-set processors. This chapter also describes the problems encountered in designing extensible processors and some of the proposed solutions.

Chapter 3 presents the overview of the design methodologies and tools that will be described in the subsequent chapters. Our methodologies include a semi-automated system, matching instruction tool, estimation model, and instruction generation tool. CHAPTER 1. INTRODUCTION 17

This chapter describes how our design automation methodologies fits with the existing manual design flows, and how they work as a single system to design an extensible processor for an application. Chapter 3 also outlines the contributions of this thesis.

Chapter 4 describes a semi-automated system for configuring an extensible proces- sor, which maximises the performance of an application, satisfies the area constraint, and significantly reduces the design turnaround time. These results are achieved through the identification of efficient code segments to implement as extensible in- structions and a two-level hierarchical selection approach: first, the design space is limited through selection of a pre-configured processor (including predefined blocks); and second, a set of pre-designed extensible instructions is selected from a library for that extensible processor. In addition, execution time estimation for an application program running on an already configured extensible processor is performed. Using our semi-automated system we have demonstrated how ten different real-world benchmarks can be designed within the extensible processor platform. The design space exploration time of the system is, on average, 2.5% of the design space exploration time, using full simulation for a given set of benchmarks. The fitting function for identifying the cor- rect code segment relates to the speed/area ratio of the instruction. In addition, our heuristic algorithm was able to locate, on average, 91% of all pareto points from the entire design space in all benchmarks. The execution time estimation for the proposed extensible processor is, on average, within 5.68% of results obtained with an ISS, and is typically generated in less than a second. Finally, the application program execution time is reduced by up to 15.71× (4.74× on average), with an average area overhead of

65% on the benchmarks. Although the matching code segments with instructions in the library and the generating instructions are both manually performed in the system

(effectively making it “semi-automatic”), the system is still useful in many real-world 18 CHAPTER 1. INTRODUCTION applications. Automation tools for these two manually done steps are proposed in the following two chapters.

Chapter 5 attempts to address the matching code segments with instructions in the library with an automation tool, namely MINCE (Matching INstructions using Com- binational Equivalence). Designing extensible instructions for extensible processors is a computationally complex task because of the exposure to a large design space. The task of automatically matching candidate instructions in an application (e.g.,written in a high-level language) to a pre-designed library of extensible instructions is especially challenging. Previous approaches have focused on identifying extensible instructions

(e.g.,through profiling), synthesising extensible instructions and estimating expected performance gains, etc. In this chapter, we introduce our approach of automatically matching extensible instructions - a key, missing step in automating the entire design

flow of extensible processor platform with extensible instruction capabilities. Since matching using simulation is practically infeasible (due to simulation time), and tradi- tional pattern-matching approaches would not yield reliable results (ambiguity related to a functionally equivalent code that can be represented in many different ways), combinational equivalence checking is the preferred option. The MINCE tool consists of a translator, a filtering algorithm and a combinational equivalence checking tool.

Matching times of extensible instructions using MINCE are 7.3× faster on average (us- ing Mediabench applications) compared to the best known approaches to the problem

(partial simulations). In all experiments, MINCE matched correctly and the outcome of the matching step yielded an average speedup of the application of 2.47×. The work represents a key step towards automating the whole design flow of an extensible processor with extensible instruction capabilities.

Chapter 6 presents the estimation models for extensible instructions that facilitate CHAPTER 1. INTRODUCTION 19 efficient exploration of the design space. In this chapter, three estimation models for extensible instructions are provided: area overhead, latency, and power consumption under a wide range of customisation parameters. System decomposition and regression analysis are used as the underlying methods to characterise and analyse extensible instructions. These estimation models are verified using automatically and manually generated extensible instructions, plus extensible instructions used in large real-world applications. The mean absolute error of our estimation models is as small as: 3.4%

(6.7% max.) for area overhead, 5.9% (9.4% max.) for latency, and 4.2% (7.2% max.) for power consumption, compared to estimation through the time consuming synthesis and simulation steps using commercial tools. These estimation models achieve an average speedup of three orders of magnitude over the commercial tools and thus enable the designer to conduct a fast and extensive design space exploration that would otherwise not be possible. These estimation models are integrated into our extensible processor tool suite.

Chapter 7 describes a tool for automatic instruction generation, which is an efficient method to satisfy growing performance demands and meet design constraints. A typical approach to instruction generation is to combine a large group of primitive instructions into a single extensible instruction for maximising speedup. However, this approach often leads to large power dissipation and discharge current, posing a challenge for battery-powered products. This chapter details a proposed battery-aware automatic tool to design extensible instructions, which minimises power dissipation distribution by separating an instruction into multiple instructions. The automatic tool is verified using 50 different code segments and five large real-world applications. The tool re- duces energy consumption by a further 5.8% on average (up to 17.7%) compared to extensible instructions generated by previous approaches. For real-world applications, 20 CHAPTER 1. INTRODUCTION energy consumption is reduced by 6.6% on average (up to 16.53%) and performance is increased in most cases. The automatic instruction generation tool is integrated into our extensible processor tool suite.

Chapter 8 concludes with a summary of this thesis, detailing the main achievements and outlining directions for future work. Chapter 2

Literature Review

This chapter provides the necessary background and a literature review of design automation methodologies for an extensible processor platform. The chapter begins with an overview of embedded systems and their early history, describing various design approaches for embedded systems. The focus is then shifted to application specific instruction-set processors and associated architectures (such as very long instruction word processors, reconfigurable processors, and extensible processors). Finally, the problems related to the design automation of an extensible processor platform and proposed solutions are presented.

2.1 Embedded Systems and their Early History

An embedded system is a specialised computer system designed to perform sophis- ticated functions for dedicated applications. Embedded systems differentiate them- selves from general-purposed computers by having the following characteristics: i) application-specific algorithms - the operation performed by embedded systems is usu- ally very specific. Due to the sophisticated nature of functions in embedded systems, designers often take advantage of the application characteristics to optimise the design.

For example, the embedded system that controls a high-definition television set-top box must perform complicated digital signal processing algorithms to optimise the image

21 22 CHAPTER 2. LITERATURE REVIEW quality of the high-definition television; ii) real time systems - embedded systems often must work in real time. That is, if data has not arrived at a certain deadline, then op- erational failure occurs. For example, if the high-definition television set-top box does not provide sufficient data at a frequency of 100Hz, then image degradation occurs; iii) low manufacturing costs - the cost of manufacturing must be low in many cases. The manufacturing costs are determined by many factors such as the type of processor and the amount of memory used, etc.; iv) low power consumption - embedded systems are often low power designs. Power affects the battery life, portability, as well as thermal packaging; v) part of an electronic product - an electronic product can contain multi- ple or even hundreds of embedded systems. For example, a high-end automobile has hundreds of embedded systems in various parts, including brake control, fuel injection, stability control, climate control, automated seats and windows etc. Each part has its specific function to perform and control. Therefore, these five characteristics are greatly associated with the way embedded systems are designed, ranging from hardware-only implementations to software-based solutions and hardware/software co-designs.

Electronic embedded systems are said to have begun in 1951 with the MIT Whirl- wind computer, the first digital video terminal capable of displaying real-time texts and graphics. The Whirlwind was intended for aircraft stability and control flight simulations, and was ultimately adopted by the United States Air Force for use in the Semi-Automatic Ground Environment (SAGE) air defence system, which became operational in 1958 for collecting, tracking and intercepting enemy bomber aircraft.

In 1964, Seiko introduced the first printing timer, built for the 1964 Tokyo Olympic

Games, which was the catalyst for the development of the EP-101 digital printer. In

1968, the Apollo Guidance Computer ran the inertial guidance systems in Apollo 7 as it made its debut orbiting the earth. The Apollo Guidance Computer was the first to CHAPTER 2. LITERATURE REVIEW 23 use integrated circuits with approximately 9,700 NOR logic gates. In the same year, the Volkswagen 1600 used a microprocessor in its fuel injection system, making it the

first microprocessor-based embedded system used in the automotive industry. In 1972, the world’s first scientific pocket calculator “HP-35” was introduced, with transcen- dental functions and reverse polish notation that used a multi-chip CPU, consisting of a control and timing (C&T) chip, arithmetic and register (A&R) chip etc. Also in

1972, Intel and Texas Instruments introduced 4-bit microprocessors, namely and Texas TMC1795, which marked a major advance in general purpose processors in embedded systems. In 1979, mobile phones were tested in Japan and Chicago. Later in the year, Sony launched the first portable headphone stereo cassette Walkman, TPS-

L2. In the same year, NEC started production of µPD7710, the world’s first complete digital signal processor. Four years later in 1983, the digital signal processor produced by Texas Instruments, TMS32010, proved to be great success. In 1984, Psion launched the first Personal Digital Assistant (PDA), Psion Organiser 1, with a serial port for attachment to a modem. In the mid-’80s, Xilinx introduced field programmable gate arrays (FPGAs), which changed the embedded systems industry significantly. In 1989,

Sony introduced the first digital camera, SONY ProMavica MVC-5000, which recorded images as magnetic impulses on a compact 2-inch still-video floppy disk. The early ’90s saw the introduction of application specific instruction-set processors, which combined hardware specification with a software-based processor. From the mid-’90s, several variations were developed in the field of the ASIPs. In the late ’90s, JVC designed an

ASIP for their digital video cameras. LG electronics integrated ASIPs into their mobile phones and portable devices such as PDAs and digital televisions. Consumer electronic companies such as Fujitsu, Olympus, and Epson, used ASIPs in their imaging products such as digital cameras and printers. In the new century, NEC launched a TCP/IP 24 CHAPTER 2. LITERATURE REVIEW offload engine containing ten ASIPs and a W-CDMA infrastructure chip consisting of two ASIPs.

2.2 Design Approaches for Embedded Systems

As discussed in the previous section, different design approaches to embedded sys- tems have been introduced at different times: Application Specific Integrated Circuits in the ’50s, general purpose processors in the ’70s, digital signal processors in the ’80s,

field programmable gate arrays in the mid-’80s, and application specific instruction-set processors in the ’90s. Figure 2.1 shows the timeline of the early history of embedded systems and introduction of different design approaches. Throughout the years, design approaches have evolved to adapt to the rapid changes in the characteristics of embed- ded systems. Two of the major differences between these five design approaches are their performance and flexibility. Figure 2.2 summarises the performance and flexibility of these different design approaches. Application Specific Integrated Circuits have very high performance with very little flexibility, as this design approach is a hardware-only implementation. On the other hand, general-purpose processors are a software-based solution, and thus have very high flexibility with low performance. In this section, each design approach is described in terms of its characteristics, advantages/disadvantages, and market trends.

2.2.1 Application Specific Integrated Circuits

Application Specific Integrated Circuits (ASICs) are integrated circuits hard-wired and hard-coded to run a specific application. ASICs are a hardware-only implemen- tation and do not contain any software components, with flexibility being at most run-time configurable parameters. These circuits require a great deal of design ex- pertise and are optimised manually by integrated circuit designers. These processes CHAPTER 2. LITERATURE REVIEW 25 2003 Engine TCP/IP Offload 2005 1994 groups 1989 Digital SONY Camera, ProMavica MVC-5000 various research various Application Specific (ASIPs) - proposed by - (ASIPs) proposed Instruction-set Processors 1984 Psion Digital Personal Assistant, Organizer 1 Organizer 1985 1983 DSP, 1979 DSP, TMS32010 uPD7710 Field Programmable Gate(FPGA) Arrays - introduced introduced - by Xilinx 1979 TPS-L2 Cassette Walkman, Headphone 1979 1979 Mobile Phones - introduced - by by NEC and Digital signal 1972 Pocket HP-35 processor (DSP) processor Scientific Texas Instruments Texas Calculator, Fuel 1968 System Injection EmbeddedSystems EarlyHistory 1972 1968 Apollo Figure 2.1: An early history of embedded systems Instruments commercially Guidance Computer Microprocessors Motorola and Texas - introduced - by introduced Intel, 1964 Timer Printing 1958 1951 Semi-Automatic defense system (ASICs) Ground Environ- Ground ment (SAGE) air 1950 Integrated Circuits Application Specific 1951 Computer MIT Whirlwind 26 CHAPTER 2. LITERATURE REVIEW

Performance and Flexibility for Different Design Approaches

Application Specific Integrated Circuits

Digital Signal Application Processors Specific Instruction-set Field Performance Processors Program- mable Gate Arrays General Purpose Processors

Flexibility

Figure 2.2: Performance and flexibility comparison of different design approaches include logic mapping, delay analysis, function verification, and performance optimi- sation. Therefore, this kind of embedded system typically has high application perfor- mance and low hardware cost. However, long design turnaround time and high initial design and fabrication costs are the drawbacks of ASICs. Another disadvantage of this approach is its inflexibility.

Before the 1990s, the design process for the majority of ASICs was based on capture- and-simulate design methodology, which simulates the transistors-level design exten- sively. In the ’90s, logic synthesis was recognised as an integral part of the design pro- cess, and so began an evolution of capture-and-simulate methodology into a describe- and-synthesise methodology. This new paradigm shift allows a design to be described on a higher abstraction level with the final implementation being achieved through au- tomatic synthesis rather than manual refinement, relieving designers from the tedious tasks of logic mapping and optimisation. Logic synthesis began to grow during the ’60s CHAPTER 2. LITERATURE REVIEW 27 and ’70s, as computers became more complex. Although many theoretical advances were made, it was not until the mid-’80s that the first company, Synopsys, offered logic synthesis technology [74,90]. Synopsys successfully brought logic synthesis to the commercial market, closing the design productivity gap significantly. Logic synthesis includes combinational and sequential synthesis [76,167]; two-level and multi-level op- timisation methods [44,45,188]; redundancy and related ATPG methods [54]; technol- ogy mapping for area, power and delay [152]; delay analysis [53, 77, 155]; performance optimisation [75, 125]; don’t cares and other flexibilities [180–182]; and multi-valued synthesis [43, 134]. One of the key improvements in logic synthesis technology is the development of the ROBDD [49] and other improved BDD packages [41,190]. In recent years, another significant improvement has been the development of more efficient SAT solvers [153, 156]. Both BDD and SAT solvers have had a major impact on the way that logic expressions are represented and analysed, which is the very basis for efficient logic synthesis and verification methods in integrated circuit designs.

The number of new ASICs being designed has been decreasing since the mid-’90s due to increasing product complexity, high manufacturing costs and high cost of manpower.

Although design numbers are down, the market consumption of ASICs is forecast to grow from $6.8 billion in 2003 to $10.7 billion in 2008, translating to a forecast compound annual growth rate of 9.7% between 2003 and 2008 (InStat/MDR Inc.).

Also contributing to this growth rate is the fact that manufacturers only produce

ASICs when the consumption is expected to be very large to overcome the massive costs. For example, Intel still uses this ASICs for their ARM processors. However, in the embedded systems world, fewer designs are fully customised using ASICs, as a result of decreasing design time. 28 CHAPTER 2. LITERATURE REVIEW

2.2.2 General Purpose Processors

The second type of design approach utilises General Purpose Processors (GPPs), aiding the decrease in design time. A GPP is a microprocessor that executes specific functions in software. As this design approach is a software-based solution, a GPP is able to run a wide range of applications. An application is first compiled into machine code (a sequence of instructions), and then the machine code is stored in the instruction memory in order to be fed to the microprocessor, executing the application. GPPs have two types of architecture: complex instruction set computer (CISC) and reduced instruction set computer (RISC). Each architecture is associated with a different type of instruction set. This design approach has a short design turnaround time and is highly flexibile. The specification of an application can still changed at a later stage of the design cycle without adversely affecting the manufacturing process. In comparison to ASICs, this approach requires very low design effort. However, this approach has a number of drawbacks: high hardware area costs, high power consumption and low performance.

The design process in the majority of GPPs is automatically assisted by the use of compilers and linkers. However, the optimisation techniques for compilers and linkers were not significantly advanced in the past, and designers often hand-optimise the application’s assembly code in order to satisfy design constraints. Compiler algorithms are improving in the areas of loop optimisations, data-flow optimisations, and back-end compiler optimisations, etc. Loop optimisations are used to improve the efficiency of the executable output for recursive code in terms of running time or resource usage.

There are many optimisation techniques designed to operate on a loop, including inner and outer loop interchange [30,94,110,202]; loop unrolling [73,178,208]; loop tiling [24,

162,200]; loop fission [121]; and loop fusion [151,159,185]. Data-flow optimisations are CHAPTER 2. LITERATURE REVIEW 29 used to improve the propagation of data, and are often conducted by data-flow analysis

[93, 140]. Back-end compiler optimisations are performed at the machine language level in order to increase application performance and to optimise code size. Examples of back-end compiler optimisations are register allocation [130, 131, 205], instruction selection [100,144], and instruction scheduling [29,160,187].

GPPs are still widely used in embedded systems ranging from high-end 32-bit micro- processors to low-end 8-bit microprocessors. The 32-bit embedded architecture dom- inates the customer-specific and cell-based worldwide product market such as mobile phones, DVD players and portable music players. The worldwide revenue from GPPs is expected to increase from $2.9 billion in 2002 (up 5.6% from 2001) to $5.2 billion by the year 2007. This translates to a compound annual growth rate of 12.0% between 2002 and 2007 (InStat/MDR Inc.). Although this growth rate is relatively good, it is not as strong as the growth rate of specialised processors such as digital signal processors.

2.2.3 Digital Signal Processors

Digital Signal Processors (DSPs) are off-the-shelf, specialised microprocessors that contain hardware architecture specifically designed for high speed algorithmic and nu- merical computations on discrete number sequences. Dedicated hardware architec- ture examples include multipliers, floating-point units, multiple-and-accurate opera- tion units, etc. The fact that DSPs have dedicated hardware makes them significantly different from GPPs. Although both have, for example, a multiply instruction, the operation is executed as microcode in the GPP (thus is considerably slower) and exe- cuted in dedicated multipliers in the DSP. Furthermore, DSPs have an instruction set optimised for the task of digital signal processing, so that a wide range of digital signal processing applications can be executed efficiently. Therefore, the advantages of DSPs include short design turnaround time, high flexibility, and low design effort. DSPs also 30 CHAPTER 2. LITERATURE REVIEW offer better performance than GPPs, especially in digital signal processing domains such as multimedia, security, and networking. However, additional hardware area cost and high power consumption are the drawbacks of DSPs. This is because although executing multiple hardware units in parallel can increase application performance, it also increases power consumption.

The design process of using DSPs for embedded systems is very similar to that of

GPPs, where software applications are optimised using compilers and linkers. How- ever, the compiler is often specifically designed for DSPs in order to optimise for the available parallelism. Recent research has shown that specific compilers provide bet- ter optimisations in the software application [89, 104, 171, 209], thus achieving large reduction of execution time [143] and energy consumption [85, 139, 169]. In addition, there are several variations on DSP architecture in terms of the number of dedicated hardware units that can be executed in parallel. There is a wide range of DSPs with different off-the-shelf configurations on the market, from which designers can choose for their applications.

In the early ’80s, NEC and Texas Instruments enjoyed success with the µPD7710 and TMS32010 DSPs. Today, there are hundreds of different types of DSPs on the market, provided by various manufacturers. In addition, the demand for personal computing in multimedia and communication are continuously increasing, thus the market value is increasing at a rapid pace. The market for DSP chips is set to grow by

25% in 2004, on top of a 24% growth in 2003 that saw it touch the US$6 billion mark

(Forward Concepts, Inc.). Texas Instruments is the single largest DSP manufacturer, with more than 50% of the market share in this domain. The DSP market is predicted to continue to outpace the semiconductor market, with a compound annual growth rate of 23.5% to $14 billion in 2007. CHAPTER 2. LITERATURE REVIEW 31

2.2.4 Field Programmable Gate Arrays

The fourth design approach employs Field Programmable Gate Arrays (FPGAs), as opposed to off-the-shelf processors. FPGAs are pre-fabricated circuit modules that are electrically configured by the designer to implement specific design functions. FPGA architecture consists of programmable logic blocks, programmable interconnects and switches between the blocks. Programmable logic blocks are used to implement any type of logic gates but are limited to a finite number of logic gates. Programmable in- terconnects and switches serve as wires between programmable logic blocks to configure the FPGA to perform a specific function.

There are four design implementation phases in designing a FPGA. These are syn- thesis, placement, routing, and bitstream generation, all of which affect the physical characteristics of the final embedded systems. In the synthesis phase, a specification in hardware description language (HDL) is synthesised and mapped onto logic ele- ments such as lookup tables (LUTs), flip-flops and multiplexors, which are the basic building blocks of the target FPGA architecture. The logic elements in the design netlist are placed on the FPGA, which dictates the configuration of the logic element sites. Next, the logic elements are connected together in the routing phase. Routing determines the configuration of the routing fabric for achieving the connections of the design. Together, the configuration of the logic sites and the routing fabric constitutes the bitstream for the FPGA, which can be loaded on the device to implement the user design on the FPGA. Figure 2.3 shows the design flow for a FPGA. FPGAs are mainly used for processing drivers and prototyping semiconductor devices for verification and testing. This approach has the advantage of flexibility with programmability at the logic level, fast time-to-market, and low fixed costs. In the past few years, FPGAs have become comparable with leading edge technologies in terms of area and performance. 32 CHAPTER 2. LITERATURE REVIEW

Synthesis User Design Functions entity gcd is port( clk, rst, go_i: in std_logic; x_i, y_i : in unsigned(3 downto 0); d_o: out unsigned(3 downto 0)); end gcd; Placement architecture algorithm of gcd is S E T S S Q E A /D C onverter S T Q Vin B 1

G N D R Q C L begin Vref B 8 R R Q C S ign L R e g is te r R

A Q 1 S V o lta g e C o n v e rte r E E N B T + + D Q S Q V C 4 S E process(rst, clk) T G N D S Q R Q E N B C V C- L out R

RC Q L type S_Type is (ST0, ST1, ST2); R A /D C o n v e rte r P re lo a d C o u n te r Dec oder V D A B 1 in 1 D /A C onverter S D 1 1 B 1 Vout G N D D 4 B 4 ...... S 2 D 8 H B 8 V ref U /D L o a d S 3 V ref S ig n G N D E N B Res et C a rry o u t E N B begin E N B E N B if (rst='1') then d_o <= "0000"; ...... Routing elsif (clk'event and clk='1') then case State is when ST0 => if (go_i='1') then State := ST1; else State := ST0; ...... end case; Generate Bitstream end if; ???????????????????????p?????? end process; end algorithm; ??-??.????.????.????????,??.????. ????.????.???@???????>?/???>?/? ??>?/?.????.??

Figure 2.3: A design flow of field programmable gate arrays

However, power consumption is still the main drawback of this design approach.

When FPGAs were invented in the mid-’80s, designers had to manually implement their design functions on a FPGA. As automation came in the form of Computer Aided

Design (CAD) tools, designers could enter their design on a standard schematic editor.

The netlist generated by the schematic editor was processed by automatic tools that placed and routed these designs, and generated the corresponding bitstream. Further- more, high-level synthesis has moved the abstraction level of the design to hardware description language, allowing designers to use languages like VHDL and Verilog to design FPGAs. In recent years, many research teams have focused on CAD tools for synthesis, placement and routing to advance the automation and design process. In early FPGAs, synthesis and technology mapping for FPGAs was heavily influenced by

ASIC approaches such as ESPRESSO [175–177] and MIS [45]. It was not until the

’90s that Francis, Rose and Chung [83] proposed the first technology mapping algo- CHAPTER 2. LITERATURE REVIEW 33 rithm for FPGAs architecture, called Chortle. A combination of decomposition and covering, allowed the number of LUTs (area) on FPGAs to be minimised. Later, the authors enhanced the algorithm to minimise delay, by reducing the number of logic levels. Cong and Ding [69] presented a polynomial time algorithm for the LUT-based

FPGA technology mapping problem using depth minimisation on Boolean networks.

This is a significant break through since, the technology mapping in general ASICs using direct acyclic graphs is NP-hard [126]. In terms of placement, the problem is to place logic elements onto the FPGA sites, which is very similar to the placement problem for ASICs. Therefore, the existing placement algorithms are deployed, such as min-cut approach [122], simulated annealing [127, 183], and quadratic placement optimisations [112,128]. Betz and Rose [36] presented a FPGA research tool for place- ment and routing. In contrast to placement, FPGA routing is very different from ASIC routing. FPGA routing algorithms were initially based on basic maze routing meth- ods devised by Lee [136]. As time progressed, Brown [48] developed the first routers for FPGAs. Ebeling, McMurchie, Hauck and Burns [80] presented a pathfinder algo- rithm for FPGA routing, which is one of the most popular FPGA routing algorithms.

Nam, Aloul, Sakallah and Rutenbar [158] formulated the FPGA routing problem as a Boolean Satisfiability Problem and used the GRASP SAT solver [153] to solve the routing problem.

Design starts on FPGAs are forecast to decline through 2007 (Gartner, Inc.), which have made FPGA vendors such as Xilinx, Inc. [7] and , Inc. [17], look for new opportunities. These included platform-based FPGAs, such as FPGA-based embedded processors and FPGA-based co-processors, which are expected to increase over time.

This type of platform FPGA covers hardware customisation and software flexibility, which is a kind of hardware/software co-design. The worldwide market for embedded 34 CHAPTER 2. LITERATURE REVIEW

FPGAs is forecast to increase from $2.9 million in 2001 to $603.1 million by 2006. This translates to a forecast compound annual growth rate, over the 2001-2006 time frame, of 191.6%.

2.2.5 Application Specific Instruction-set Processors

Application Specific Instruction-set Processors (ASIPs) are designed for specific ap- plications or application domains in embedded systems. ASIPs typically consist of a base processor and a base instruction set, plus the capability to extend this instruction set through new specific instructions. Specific instructions are hardware modules in the execution stage that replace computationally intensive code segments in the appli- cation. Thus, code segments of the application are executed in the specific instructions rather than in the arithmetic logic unit as microcodes. Such execution can improve per- formance and reduce energy consumption. In addition, ASIP tool suites often contain synthesis tools for specific instruction creation such as synthesisers, hardware compilers and verification tools, as well as software tools such as compilers, linkers, instruction- set simulators, etc. An ASIP solution is a hardware/software co-design approach, which combines hardware customisability (by using specific instructions), and software

flexibility (by featuring a base processor). The main advantage of designing specific in- structions and application optimisation is that hardware design and software programs can be developed in parallel, shortening the design cycle significantly. In addition,

ASIPs can minimise the energy consumption of embedded systems by decreasing the execution time significantly with an incremental increase in power consumption.

Research and development into design approaches for ASIPs has been carried out for approximately ten years. (For a detailed survey of ASIPs, see Jain Balakrish- nan and Kumar [114]). Early design approaches for ASIPs can be divided into three main categories: architecture description languages [2, 31, 38, 105, 106, 201], compil- CHAPTER 2. LITERATURE REVIEW 35 ers [63, 95, 161, 206], and design methodology for ASIP design [92, 111, 129, 132]. The

first category (architecture description languages for ASIPs) is further classified into three sub-categories based on a different primary focus: the structure of the processor

(such as the MIMOLA system [142]); the instruction set of the processor (as given in nML [81] and ISDL [98]); and a combination of both structure and instruction set (as in HMDES [97], EXPRESSION [99], LISAtek [106], ASIP-Meister [2], and

FlexWare [164]). The architectural description language approach generates a retar- getable environment from an input processor architecture description. This retar- getable environment includes retargetable compilers, Instruction Set Simulators (ISS) of the target architecture, and synthesisable HDL models. The generated tools allow valid assembly code generation and performance estimation for an application in the architecture described (i.e. “retargetable”). In the second category, the compiler is the main focus of the design process using exploratory techniques such as data flow graphs, control flow graphs, etc. The process takes an application written in a high-level de- scription language such as C/C++, and produces customised architecture for ASIPs.

Based on the characteristics of the application, a processor for that particular applica- tion can be constructed. Zhao, Mesman and Basten [206] used static resource models to explore possible functional units that can be added to the data path to enhance performance. Onion, Nicolau and Dutt [161] proposed a feedback methodology for an optimising compiler in the design of an ASIP, so that more information is provided at the compile stage of the design cycle, producing a better hardware processor model.

In the third category, researchers proposed design methodologies for ASIP that solved various problems, such as identifying functionalities to speedup the application and in- troducing the hardware resources for the functionalities [92,111,129,132]. Gschwind [92] described the ASIP selection problem as a hardware/software co-design methodology, 36 CHAPTER 2. LITERATURE REVIEW

Characteristic \ Design Approach ASIC DSP ASIP FPGA GPP Hardware Cost XXX XX XX X × Power Consumption XXX XX XX X × Performance XXX XX XX X × Flexibility × X XX XX XXX Design Turnaround Time × X XX XX XXX

Table 2.1: Summary of different design approaches for embedded systems allowing early evaluation of ASIP options in rapid prototyping techniques. Imai et al. [111, 129] proposed a PEAS series ASIP environment to customise ASIP by defin- ing and selecting various instructions. K¨u¸c¨uk¸cakar [132] proposed a methodology to design ASIPs by customising an existing processor instruction set and architecture rather than creating a new ASIP. Leupers and Marwedel [141] proposed an instruction set modelling technique for retargetable code generation in ASIP, and explored a range of instruction formats and inter-instruction restrictions. Gong, Gajski and Nicolau [86] proposed a parameterised model and retargetable scheduler for performance evaluation for application specific architecture. Sudarsanam and Malik [191] proposed a memory bank and register allocation scheme for ASIP, maximising the benefit of the application architectural feature. In the mid-’90s, various architectures were proposed for ASIPs.

The characteristics of the different design approaches such as hardware cost, power consumption, performance, flexibility, and design turnaround time are summarised in Table 2.1. The ASIP approach is utilised in various application domains such as network applications [91], wireless security processing platforms [172], and multimedia applications [57].

2.3 Architecture of Application Specific Processors

ASIPs have proven to be a design approach that combines hardware customis- ability and software flexibility, satisfies design constraints such as performance, power CHAPTER 2. LITERATURE REVIEW 37

use multiple execution use reconfigurable units for instructions logic for instructions extension, which extension, which enable execute functionalities different instructions to in parallel Very Long be swapped in and out Instruction Reconfigurable the FPGA Benefits: Word Processors - Performance Processors Benefits: - Code size - Performance - Area

Extensible Processors use integrated circuits for instructions extension, which increase performance and area, Application Specific minimize power Instruction-set Processors consumption - customize specific instructions (instructions extension) for Benefits: specific functions to increase performance significantly with - Performace incremental increase in area and power consumption - Power

Figure 2.4: Different types of application specific instruction-set processors consumption and hardware cost, and relieves the time-to-market pressure by allow- ing the design of embedded systems in a short design turnaround time. Customisable processors with the capacity to include specific instructions are becoming increasingly popular in academia [2, 97, 99, 106] and with commercial vendors [10, 12, 13, 18, 21, 23].

Several different architectures have been proposed, such as Very Long Instruction Word processors (containing multiple execution units, which have the ability to execute mul- tiple operations simultaneously); reconfigurable processors (use reconfigurable logic such as FPGAs as the platform for specific instructions and have the ability to re- configure different instructions for different situations); and extensible processors (use integrated circuits for executing specific instructions which often have better area and power characteristics). Figure 2.4 demonstrates the different types of ASIPs and the characteristics of each. 38 CHAPTER 2. LITERATURE REVIEW

2.3.1 Very Long Instruction Word Processors

Very Long Instruction Word (VLIW) processors are processors that contain mul- tiple execution units with the ability to issue multiple operations simultaneously in a single instruction. The advantage of this design approach is that it achieves very high performance by improving instruction-level parallelism of the application, and reduces energy consumption by minimising the code size. Figure 2.5 shows the VLIW architecture in the execution path. The figure shows that two types of execution units are located in the execution path. These are: i) general execution unit - can execute any operations; and ii) specific hardware unit - operates a dedicated function. Since instruction-level parallelism needs to be explored, instructions within the application are reordered and scheduled at compile time. During compile time, all data depen- dencies are checked, including independent instructions and subsequent scheduling.

Multiple operations are grouped together to form a complex instruction for executing in parallel. Hence, the instruction format is different from the base instruction set.

Figure 2.6 shows a generic 128-bit VLIW instruction, which combines four 32-bit base instructions. After compiling and scheduling of the application, a parallelised code is generated. Therefore, VLIW processors require greater compiler support than GPPs.

This reliance on the compiler has two advantages: i) the compiler has the ability to look at a much larger window of instructions, thus yielding better results to improve parallelism; and ii) the compiler has specific knowledge of the program’s source code, such as branches and register usage, allowing further optimisation.

Since VLIW architectures rely on compile-time scheduling, the generic design method- ology revolves around techniques such as instruction encoding and optimisation. In- struction encoding involves encoding the sequential code of the application into a par- allelised VLIW code to satisfy design constraints. There are two processes involved in CHAPTER 2. LITERATURE REVIEW 39

There are two types of execution unit: i) general execution unit; and ii) specific hardware unit. General execution unit can execute any operations whereas specific hardware unit executes a dedicated function.

Instruction Cache Register File

Execution Unit Execution Unit Execution Unit Execution Unit #1 #2 #3 #4 Control

These execution units are executed in Data Cache parallel using a single very long instruction word (VLW) instruction.

Figure 2.5: A VLIW architecture

128 bit VLIW instruction

Instruction 1 Instruction 2 Instruction 3 Instruction 4

This 128-bit generic VLIW instruction contains four individual Destination Source Source Opcode field 32-bit instruction slots, register register A register B which can execute four individual 32 bit generic instruction execution units in parallel

Figure 2.6: A generic VLIW instruction format 40 CHAPTER 2. LITERATURE REVIEW instruction encoding: instruction scheduling and resource binding. Instruction schedul- ing is to put concurrent, yet independent, operations in the same instruction. The number of operations executed in parallel may differ between different processor archi- tectures. For example, Philips TriMedia issues five operations at a time [108], while

IBM DAISY executes eight operations [79]. Once instructions are grouped, execution units need to be bound for a particular instruction at a certain time. This process is known as resource binding. During instruction scheduling and resource binding, opti- misation techniques such as loop unrolling and prediction techniques can be applied to achieve better results. Loop unrolling refers to the unrolling of loop in the application, so that multiple execution units can be utilised in the VLIW architecture. Prediction techniques exploit the locality of data to predict the results of operations [147,157].

Several commercial vendors develop tools suites for VLIW processors. PICO tech- nology, founded by Synfora Inc. [18], is based on the research previously carried in the

Hewlett-Packard Labs [8]. PICO technology enables algorithm-to-tapeout synthesis by combining accelerators and a VLIW processor to create a highly efficient application engine directly from an algorithmic description. The Jazz DSP processor is a config- urable VLIW processor architecture from Improv Systems, Inc. [10]. It consists of a comprehensive tool chain including a compiler, debugger, profiler, and Instruction Set

Simulator, allowing designers to create custom RTL blocks and instructions to create a designer-defined DSP core. Media embedded Processor (MeP) is a platform for digital media SoC developed by Toshiba, Inc. [13]. The MeP processor consists of the MeP core, extension VLIW units, and a bus interface that controls a local bus and connects the MeP core and VLIW execution units to a global bus.

Pozzi [170] proposed an application specific reconfigurable VLIW processors, which consists of a VLIW base processor and reconfigurable FPGA co-processor. The recon- CHAPTER 2. LITERATURE REVIEW 41

figurable FPGA co-processor, or Reprogrammable Functional Units (RFU), contains specific instructions to customise the application. Pozzi’s contributions include a de- sign methodology for application analysis, extraction for RFU, and static and dynamic selection for RFU. Lodi, Toma, Campi, Cappelli, Canegallo and Guerrieri [148] pro- posed a novel architecture for a reconfigurable embedded system based on a VLIW processor with a run-time configurable datapath.

2.3.2 Reconfigurable Processors

A reconfigurable processor is one that combines a general purpose processor and re- configurable devices such as FPGAs, enabling the configuration of different hardware logic units. Research and development works into reconfigurable processors can be divided into two categories: reconfigurable specific instruction-set processors (RISPs) and reconfigurable co-processors. RISPs consists of reconfigurable devices in the ex- ecution path of processors, where specific instructions are executed in reconfigurable devices. Figure 2.7 shows the architecture of a reconfigurable specific instruction-set processors architecture. One of the distinguishing characteristics of RISPs is that these specific instructions can be swapped in and out of the FPGAs, reducing the area cost of the instructions. Thus, the area of the reconfigurable processors is minimised. The second approach is the reconfigurable co-processor, which has a loose coupling between the core processor and reconfigurable logic. The reconfigurable co-processor can be seen as a slave computational unit located on the same die as the processor or off-chip.

With a reconfigurable co-processor, the granularity of the function to be implemented in the reconfigurable logic is much higher than the first approach. This is because the instructions do not need to satisfy the timing constraint in the execution stage of the processor.

Research into reconfigurable processors has been undertaken for the past ten years. 42 CHAPTER 2. LITERATURE REVIEW

Reconfigurable register Instruction Cache Register File File

Execution Unit Reconfigurable Logic Control

Memory (stores specific Data Cache instruction(s))

Specific instructions are pre-designed and pre-synthesized for dedicated functions. The bitstream of the specific instructions are stored in memory. When the processor schedules to execute the dedicated function, then the bitstream is loaded into the reconfigurable logic. Hence, the dedicated function is executed.

Figure 2.7: A reconfigurable processor architecture

Athanas and Silverman [34] proposed PRISM, a reconfigurable processor with instruc- tion set metamorphosis, consisting of a general purpose processor and a RAM-based logic device to allow fast processor reconfiguration. Razdan and Smith [173] described the PRISC system at Harvard, which extends the instruction set of a RISC proces- sor through implementation of particular functions onto one or more Programmable

Functional Units (PFUs). Wirthlin and Hutchings [198] proposed DISC, a dynamic instruction set computer, which uses partial reconfiguration by swapping instructions in and out of FPGAs. Hauser and Wawrzynek [102] proposed the GRAP system, which consists of a MIPS processor with a reconfigurable co-processor in the same die. The co-processor is activated by the processor when a reconfigurable function is called. In the same year, Hauck, Fry, Hosler and Kao [101] proposed a reconfigurable system, CHAPTER 2. LITERATURE REVIEW 43

Chimaera, where the FPGA and the processor core are placed in the same chip. The primary focus of Chimaera is to minimise the reconfiguration overhead and eliminate a communication bottleneck between the FPGA and the processor. Clark, Kudlur, Park,

Mahlke and Flautner [66] proposed a reconfigurable processor that allows instruction set customisation for embedded systems. This work uses dynamic subgraph identifica- tion methods that identify common subgraphs in the application. Vuleti´c, Pozzi and

Ienne [195] introduced a virtualisation layer that utilises operating system extension and a hardware component, reducing the complexity of interfacing and data transfers between the processor and co-processors.

Many commercial vendors provide services for reconfigurable processors [5, 17, 21,

22]. Stretch processors are reconfigurable processors based on a core processor, Stretch

S5 engine, and the powerful Stretch Instruction Set Extension Fabric (ISEF) [21]. The

ISEF is a software-configurable datapath based on programmable logic. Tarari pro- cessors (provided by Tarari, Inc.) are processors consisting of reconfigurable logic [22].

These processors are based on dynamically reconfigurable hardware that targets spe- cific computationally intensive tasks, and decreases the processing time required to perform those operations. The DAPDNA-2 dynamically reconfigurable processor is a dual-core processor, comprised of a high performance RISC core, paired with the two dimensional processing matrix, provided by IPFlex, Inc [5]. Finally, NIOS II / NIOS reconfigurable processors, provided by Altera, Inc. [17], feature general purpose RISC

CPU architecture with an instruction set, plus the capability to extend this instruc- tion set through new specific instructions. In addition, designers can select different peripherals and interfaces to satisfy the need of an application. (For a comprehensive survey of reconfigurable instruction set processors, see [35]). 44 CHAPTER 2. LITERATURE REVIEW

Instruction Cache Register File Specific Register File

Specific Specific Specific Execution Unit

Control Instruction #1 Instruction #2 Instruction #3

Specific instructions are pre-designed for dedicated Data Cache functions. They are pre-synthesized and pre-fabricated in the processor. The processor can can be only executed either execution unit or one specific instruction at a time.

Figure 2.8: An extensible processor architecture

2.3.3 Extensible Processors

An extensible processor combines a general purpose processor and application spe- cific integrated circuits to implement specific instructions. Their customisation typi- cally addresses three architectural levels on the base processor: i) instructions extension

- the designer can define customised instructions by specifying their functionality; ii) inclusion/exclusion of predefined blocks - the designer can choose to include or ex- clude predefined blocks as part of the extensible processor (including floating-point unit, digital signal processing unit, special function registers, multiply-and-accumulate operations block, etc.); iii) parameterisations - the designer can set extensible proces- sor parameters such as instruction and data cache sizes. Through these architectural customisations, extensible processors are able to achieve high performance, low power consumption and compact area for a particular application. Extensible processors rep- resent the state-of-the-art in application specific instruction-set processors. Figure 2.8 CHAPTER 2. LITERATURE REVIEW 45

Application written in C/C++

Compilation Power * * * ** Analysis and Profiling * * * * ***** ***** ****** ********* *********** ******** ************ *** ******************* Identify computationally intensive * *************** ************** code segments * ***************** Area ************* * * * *********** ** * * ** **** * * Performance Generate extensible instructions for code segments Explore Explore extensibleextensible Select extensible instructions, processorprocessor predefined blocks, parameters design space Synthesizable RTL of: design space Base processor Evaluate performance and design Predefined blocks constraints of the processor Extensible Instructions Parameter settings Design that satisfies design constraints (power, Generate extensible processor area, performance)

Synthesis and Prototyping tape-out

Figure 2.9: A simplified generic design flow of an extensible processor shows the execution path of an extensible processor architecture where specific instruc- tions are application specific hardware units.

Figure 2.9 shows a simplified generic design flow of an extensible processor platform.

The goal of designing extensible processor is typically to maximise the performance of an embedded application, while minimising design constraints. The designer often be- gins by profiling the application using an Instruction-Set Simulator (ISS) of the target processor. The profiling reveals computationally intensive code segments for which possible instruction extensions, inclusions or exclusions of predefined blocks, or param- 46 CHAPTER 2. LITERATURE REVIEW eterisations might improve performance and energy characteristics. After identifying a set of possible extensible instructions, predefined blocks, or parameter settings, the designer defines these customisations in the extensible processor. To evaluate these customisations, designers can use retargeted tools to determine whether the applica- tion meets design constraints. In addition, designers can iterate this step during the design space exploration. Once the application meets design constraints, the platform uses the base processor configurations, predefined blocks and extensible instructions to generate the extensible processor’s synthesisable RTL. This synthesisable RTL is then ready to be taped out or ready for prototyping.

Several commercial vendors provide extensible processors for designing embedded systems. ASIPmeister is an ASIP system provided by Osaka University, Japan [2]. It is a RTL description processor system that allows the designer to customise specific instructions in VHDL. The ARC processors (ARCtangent-A4, ARCtangent-A5, ARC

600 and ARC 700) are a set of 32-bit user-customisable RISC processors [1]. ARC processors typically have the architecture of a 32-bit RISC processors, with the ability to extend optional DSP instructions in order to create processors with low power, high performance and low area overhead. Lexra processors consist of a 32-bit RISC core processor and DSP cores for the embedded market, similar to the ARC processor [11].

LISATek is an automated embedded processor design and optimisation environment provided by CoWare, Inc [12] using architecture description language, which use the design of both custom and standard processors automatically and generates software development tools for the application. Finally, Xtensa processors, provided by Ten- silica, Inc. [23], consist of a base processor core with a base instruction set and the capacity to extend this instruction set by new specific instructions (using Tensilica

Instruction Extension (TIE)). In addition, the designer is able to customise the co- CHAPTER 2. LITERATURE REVIEW 47 processor as well as processor parameters such as instruction cache, data cache, etc.

(For more detailed literature on the Xtensa processor, see [87, 197]). Recently, Tensil- ica has released their automation environment, Xpres compiler, which is a synthesis tool that creates tailored processor descriptions for the Xtensa LX processor from an application written in C/C++ code. In addition, this system also allows the designer to fine-tune the processor manually in order to optimise the embedded system’s specifi- cation. An overview of extensible processors, their benefits and problems are described in [78,103].

2.4 Problems in Designing Extensible Processors

From the design flow of extensible processors shown in Figure 2.9, it is evident that designers require a great deal of expertise to design extensible processors. This is particularly true for processes such as code segment identification, extensible instruc- tion generation, architectural customisation selection, and processor evaluation, which are usually conducted manually (shown in blue boxes in Figure 2.9). Hence, recent research has largely revolved around these design processes, and has sought to opti- mise and automate different aspects of the design processes. This section describes the problems associated with these design processes and some of the solutions proposed in the literature.

2.4.1 Code Segment Identification

Code segments are performance-critical sections of the application that the base processor spends significant time in executing. In order to increase the performance of the application, these code segments need to be sped up by generating extensi- ble instructions, including/excluding predefined blocks, and setting parameterisations.

However, these performance-critical code segments (hereafter referred to as code seg- 48 CHAPTER 2. LITERATURE REVIEW ments) first need to be identified from the application. While identifying code segments is somewhat simplified by profiling tools, it is still a daunting task for large applica- tions, and further complicated when additional constraints (e.g., area and power) must be optimised as well as performance. Furthermore, the number of code segments for a given application grows exponentially with the program size. It is very common for a function with fewer than one hundred operations to contain several hundred possible code segments. Figure 2.10 shows a function named cosine, which, as its name suggests, computes the cosine of a floating-point number. Although the function is quite small, it contains hundreds of code segments. For example, each line within the function (lines

12-18) can be implemented as one or more instructions. Alternatively, line 12 can be one instruction, line 13 can be a second instruction and lines 14-18 can be a third instruction. Also, the code segment (lines 11-18) can be unrolled and executed in par- allel. Thus, the combination of code segments is endless. Most real-world applications contain a function hierarchy with a large number of functions. In fact, it is often the case that there are several performance-critical functions, i.e. no single or even a small set of functions is responsible for a large fraction of the total application execution time [123]. Two architectural customisations that seem to have the same speedups for a single code segment may result in hardware that impacts area overhead, latency and power consumption differently. When design constraints on hardware are present and associated with code segments, the tradeoffs involved are complex and can be difficult to identify manually, making code segment identification (for customisation) one of the most difficult problems to solve. Recent research into code segment identification can be classified into three categories: i) retargetable code generation using matching and covering algorithms; ii) finding patterns in the graph representation (control dataflow graph) of the profiled application; and iii) high-level extraction from the application. CHAPTER 2. LITERATURE REVIEW 49

1 /* 2 * This function computes cosine of x (x in radians) 3 * by an expansion 4 */ 5 float cosine (float x} { 6 int i; 7 int factor = 1; 8 float result = 1.0; 9 float power = x; 10 11 for (i = 2; i <= 10; i++) { 12 factor = factor * i; 13 power = power * x; 14 if ((i & 1) == 0) { 15 if ((i & 3) == 0) 16 result = result + power / factor; 17 else 18 result = result - power / factor; 19 } 20 } 21 return (result); 22 }

Figure 2.10: An example function for demonstrating the complexity of code segment identification

These are described in detail next.

• Code Generation using Matching Algorithm Code generation using a match-

ing algorithm is a well-known problem, particularly in the fields of technol-

ogy mapping in logic synthesis [69, 83, 124] and code generation in compilers

[27, 71, 142, 145, 186]. Matching finds all possible instantiations of identical pat-

terns in a structured representation. This was often the early approach to ASIP

design. There are two main approaches to pattern matching: boolean and struc-

tural mapping. Boolean matching is often applied to the networks of boolean

functions, which includes checking the equivalence of functional representations

between patterns in the application. This kinds of equivalence checking often

uses Binary Decision Diagrams (BDDs), which are unsuitable for non-boolean

functions. Structural matching focuses on a graph representation, where nodes 50 CHAPTER 2. LITERATURE REVIEW

represent the function. This approach often identifies common patterns with

structural, rather than functional equivalence. In structural mapping, the type of

graph representation can also vary the complexity of the graph. In the early ’90s,

most of the matching algorithms revolved around patterns with single acyclic

output [31, 71, 107, 145]. However, Arnold [32] proposed a matching algorithm

to identify patterns with multiple outputs, which expanded the searching space

for possible patterns significantly. Leupers and Marwedel [141] described an in-

struction set model to support retargetable compilation and code generation.

However, the drawback of this approaches is the lack of application input char-

acteristics (e.g., simulation and profiling information), which often lowers the

chance to optimise designs for a specific application. In fact, application in-

put provides data-specific information such as estimates for indeterministic loop

counts, branch taken or not-taken percentage, the range of input data etc., which

is impossible to ascertain without the use of simulation and input data set. How-

ever, input data set suffers from coverage problems - that is how many input

vectors are enough to efficiently represent the applications and enable optimal

design of embedded systems? It is very important to select a good coverage of

input data sets for an application.

• Profiled Graph Representation The second category for identifying code seg-

ments uses profiling analysis through simulations with a set of input data. This

approach first compiles the application into an executable and then simulates the

executable using the input data set to obtain the application specific profiling in-

formation. Using the additional profiling information, a graph representation of

the application is created, with base instructions as nodes and data dependence

between instructions as edges. Code analysis is then performed to identify code CHAPTER 2. LITERATURE REVIEW 51

segments. This profiling analysis technique is applied after the source code is

compiled (sometimes referred as post-compiler optimisation). Sun, Ravi, Raghu-

nathan and Jha [193] proposed an approach for identifying suitable code segments

to implement as extensible instructions in a connected subgraph. First, the ap-

plication is compiled and profiled using the input data set. Then the program

dependence graph is constructed using the profiling instruction and the applica-

tion with the base instruction as nodes and the dependence between assembly

codes as edges. All patterns (e.g., code segments) are identified using a tem-

plate matching technique. The designer then ranks the patterns from the most

frequently executed to the least frequently executed in the application using a

priority function. The highly ranked patterns are selected and implemented as

extensible instructions. A problem with this approach is that the template pat-

terns need to be pre-defined (which may be best for a particular application) and

well-constructed in order to maximise the speedup and reduce energy consump-

tion. Clark, Zhong and Mahlke [67] proposed a similar pruning technique, “guide

function”, to prune the searching design space of code segments in the connected

subgraph. Rather than pruning the number of patterns, the authors proposed

to prune the searching direction in the graph, thus allowing the possibility that

those initial low ranked patterns would amount to a useful pattern in the later

stage of the design space. Sun, Ravi, Raghunathan and Jha [192] proposed a

scalable approach to extend the existing technique, where the matching patterns

were not limited to template. After the patterns are identified, functions can be

added or removed from the patterns in order to be well-suited to the applica-

tion. The steps are performed by using a cost function for area and performance

tradeoffs. However, the main drawback of these two above approaches is that the 52 CHAPTER 2. LITERATURE REVIEW

number of inputs and outputs to the code segments are limited. Atasu, Pozzi

and Ienne [33] described a binary tree search algorithm that identifies patterns

with multiple inputs and outputs in an application dataflow graph, covering an

exhaustive design space. This technique achieves maximum speedup and satis-

fies micro-architectural constraints. This algorithm is originally described in the

Pozzi’s doctoral thesis work on reconfigurable processor [170]. Pozzi proposed

a generic approach for searching and extracting code segments from an appli-

cation, where the patterns have multiple inputs and outputs. It is a tree-based

searching and extracting algorithm. The author created a tree binary graph from

the profiling information and the application program. The search begins with

the top of the graph and extends to the bottom of the tree, eliminating useless

branches in order to reduce the search space. Yu and Mitra [204] proposed a

scalable custom instruction identification method that extracts all possible can-

didate instructions in a given graph. However, the major drawback of profiling

using trace and assembly code is that the graph representation is often limited

to the sequence of code. In order to further reduce the time-to-market pressure,

there is a need to move the identification hierarchy to a high level of abstraction

such as a C/C++ application.

• High-level extraction High-level extraction identifies code segments from an

application written in high-level language (e.g., C/C++). This approach usually

begins with simulation to obtain profiling information. From the profiling in-

formation, the designers identify frequently executed sections of the application.

Semeria, Seawright, Mehra, Ng, Ekanayake and Pangrle [184] developed a tool

to extract the code segments from C code and generated a functional equivalent

of RTL-C and an HDL. However, the C code is only limited to a subset of C, CHAPTER 2. LITERATURE REVIEW 53

which is very close to the hardware description (RTL code). In other words, the

C code needs to be written in a very similar way to the RTL code. Recently,

Clarke, Kroening and Yorav [68] presented equivalent behaviour between C and

Verilog HDL, which can be used to perform high-level extraction for C. Yu and

Mitra [203] described the identification of code segments using the characteris-

tics of the embedded systems application by relaxing constraints to a reasonable

level. The drawback of this approach is that the granularity of the application

written in C/C++ in terms of code style can vary between programmers, mean-

ing that the complexity of code segment identification increases. As described

in later chapters, this thesis proposes a high-level identification technique that

identifies code segments using the application and profiling information, cover-

ing coarse-grain searches of functions or subroutines to fine-grain, line-by-line

approach. Our identification scheme combines the advantages of profiled graph

representation and high-level extraction, shortening the exploration time from

the unthinkable number of code segments in the application.

2.4.2 Extensible Instruction Generation

Instruction generation involves designing extensible instructions to replace com- putationally intensive code segments by specifying new hardware resources and the operations they perform. The typical goal of generating extensible instructions is to maximise performance while satisfying design constraints such as area, power and en- ergy. As mentioned previously, extensible instructions are designed in the execution stage of the processor. If the addition of extensible instructions causes the violation of the base processor’s clock period, the designer is required to i) reduce the amount of computation performed in the instruction; ii) split it into multiple instructions; iii) multi-cycle the execution of the instruction; and/or iv) reduce the clock period of the 54 CHAPTER 2. LITERATURE REVIEW

a[i], a[i+1], b[i], b[i+1] a[i], a[i+1] + + z[i], z[i+1] 64 32 + a[i], a[i+1] c[i], c[i+1], d[i], d[i+1] 32 Performance : 2 cc b[i], b[i+1] + + 32 32 (Throughput) (2 data) + z[i], z[i+1] b[i], b[i+1] Area : 41,347 c[i], c[i+1] 32 + z[i], z[i+1] Power : 21.65 mW c[i], c[i+1] 32 d[i], d[i+1] + 32 a[i], … , a[i+3], b[i], … , b[i+3] Performance : 3 cc d[i], d[i+1] (Throughput) (2 data) + + z[i], z[i+1], z[i+2], z[i+3] Area : 43,762 128 64 Power : 16.85 mW Performance : 2 cc c[i], … , c[i+3], d[i], … , d[i+3] (Throughput) (2 data) Performance : 2 cc Area : 43,157 (Throughput) (4 data) Power : 21.25 mW Area : 70,347 Performance Power : 41.65 mW a[i], b[i] r Area + + z[i] e 32 16 w a[i] o c[i], d[i] P 1 /* + Performance : 2 cc 16 2 * A computationally intensive code (Throughput) (1 data) b[i] Area : 17,151 + z[i] 3 * segment is identified to implement Power : 10.52 mW c[i] 16 4 * as extensible instruction + 16 a[i] d[i] 5 */ 6 short *a, *b, *c, *d, *z; + Performance : 2 cc 16 b[i] (Throughput) (1 data) 7 for (int i = 0; i < 1000; i++) +16 Area : 19,323 Power : 8.85 mW 8 z[i] = a[i] + b[i] + c[i] + d[i]; c[i] +16 z[i] d[i] Performance : 3 cc (Throughput) (1 data) a[i], a[i+1], a[i+2], a[i+3] a[i], a[i+1], a[i+2], a[i+3] Area : 19,323 Power : 7.25 mW +64 +64 b[i], b[i+1], b[i+2], b[i+3] b[i], b[i+1], b[i+2], b[i+3] +64 + z[i], z[i+1], z[i+2], z[i+3] + z[i], z[i+1], z[i+2], z[i+3] c[i], c[i+1], c[i+2], c[i+3] 64 c[i], c[i+1], c[i+2], c[i+3] 64 Performance : 2 cc d[i], d[i+1], d[i+2], d[i+3] Performance : 3 cc +64 (Throughput) (4 data) (Throughput) (4 data) d[i], d[i+1], d[i+2], d[i+3] Area : 72,347 Area : 74,347 Power : 42.65 mW Power : 38.65 mW

Figure 2.11: A computationally intensive code segment for demonstrating the com- plexity of instruction generation base processor. Although the instructions generation step is somewhat simplified by specifying extensible instructions at a high level of abstraction, it is still a tedious task for large code segment, and further complicated when a particular instruction can be designed in a numerous ways. Figure 2.11 shows nine extensible instructions that can be designed for a single code segment. Each instruction has various characteristics, covering a large design space (performance ranges from 2cc to 3cc; area ranges from

19,323 to 72,347 grids; and power ranges from 8.85mW to 42.65mW). Therefore, de- signing extensible instructions is very complex due to the large design space that it occupies. Research into instructions generation began in the mid-’90s where the entire instruction set was specifically designed for a particular application and often referred to as instruction set synthesis. As time-to-market pressure mounts, recent research has CHAPTER 2. LITERATURE REVIEW 55 focused on only specific instruction generation on top of the base instruction set, which is usually referred to as instruction generation or instruction set extension.

Early research in instruction generation focused on completely custom instruction sets to satisfy design constraints [62,107,109]. In 1994, Holmer described a methodol- ogy to find and construct the best instruction set on a predefined architecture for an application domain [107]. His method found code segments that are executed in three or four cycles, and recompiled into new complex instructions. Huang and Despain [109] presented an instruction set synthesis for an application in a parameterised, pipelined micro-architecture. This system was one of the first hardware/software systems to be designed for an application with a customised instruction set. The generated instruc- tions are single-cycle instructions. Several years later, Choi, Kim, Yoon, Park, Hwang and Kyung [62] proposed an approach to generate multi-cycle complex instructions as well as single-cycle instructions for DSP applications. The authors combined regularly executed single-cycle instructions into multi-cycle complex instructions. As pressure mounts in the consumer market for quick turnaround, it is infeasible to perform in- struction set synthesis and design the entire instruction set from scratch.

As a result, recent research has revolved around instruction set extension [39,46,67,

70] and extensible instruction generation [33,88,137,192,193]. Instruction set extension is the term often used with reconfigurable processors (which combines ASIPs with reconfigurable hardware), while extensible instruction generation is the term often used with extensible processors. The major difference between instruction set extension and extensible instruction generation is the need to satisfy the hard latency constraint of the base processor’s clock period. For instruction set extension, Cong, Fan, Han and Zhang [70] proposed a performance-driven approach to generate instructions that can maximise application performance. In addition, they allow operations duplication 56 CHAPTER 2. LITERATURE REVIEW while searching for patterns in the matching phase. The duplication is performed on operations with multiple inputs that are on the critical path of the frequently executed code segment. When the operations are duplicated, parallelism of the code segments may be increased, thus increasing the performance of the application and enhancing design space exploration. The work was evaluated on the NIOS platform, provided by

Altera Inc. [17], which is a VHDL reconfigurable embedded processor. This approach does not perform any design space exploration (tradeoffs between performance and area, power, etc.). Brisk, Japlan and Sarrafzadeh [46] described an instruction synthesis using resource sharing to minimise the area efficiently. Their approach groups a set of extensible instructions into a co-processor in which common hardware blocks are minimised in the synthesis. The area savings are up to 40% when compared to the original extensible instructions. Biswas, Choudhary, Atasu, Pozzi, Ienne and Dutt [39] introduced an instruction set extension including local memory elements access. This approach used a hardware unit to enable direct memory access at the execution stage of the processor. In order to enable the local memory access, memory access needs to be carefully scheduled, or multiple read/write ports are needed. In addition, accessing memory elements in the execution stage potentially increases pipeline hazards, thus increasing the complexity of code optimisation. Although this approach increases the performance of the application, the hardware overhead and the probability of increased pipeline hazards are relatively high. Clark, Zhong and Mahlke [67] described a compiler approach to generate instructions in VLIW architecture without constraining their size or shape. These approaches have largely revolved around maximising speedup of the application while minimising area of the processor. None of these approaches focus on the energy consumption of the application.

Extensible instruction generation focuses on generating instructions that satisfy CHAPTER 2. LITERATURE REVIEW 57 the latency of the base processor while maximising performance and other constraints

[33,88,137,192,193]. Lee, Choi and Dutt [137] proposed an instruction encoding scheme for generating complex instructions. The instruction encoding scheme enables tradeoffs between the size of opcode and operands in the instructions to enhance performance and reduce power dissipation. In addition, it contains a flexible approach for creating complex instructions, the combination of basic instructions that regularly appeared in the application, exploring greater design space and achieving improved performance.

Atasu, Pozzi and Ienne [33] described a generic method to generate extensible in- structions by grouping frequently executed code segments using a tree-based searching and matching approach. The method enables the generation of extensible instructions with multiple inputs and outputs. Sun, Ravi, Raghunathan and Jha [193] described a methodology to generate custom instructions from operation pattern matching in a template pattern library. The generated instructions increase the application per- formance by up to 2-5× with a minimal increase in area. Sun, Ravi, Raghunathan and Jha [192] described a scalable instruction synthesis that could be adopted by adding and removing operations from custom instructions, further ensuring the given latency constraint is satisfied. This approach also optimised the area overhead of the instruction while maximising the performance of the instructions for the application.

Goodwin and Petkov [88] described an automatic system to generate extensible in- structions using three operation techniques. These operation techniques are i) Very

Long Instruction Word (VLIW) operations - grouping multiple instructions into a sin- gle instruction in parallel; ii) vector operations - parallelising data and increasing the instruction width; and iii) fused operations - combining sequential instructions into a single instruction. This system achieved significant speedup for the application while exploring millions of instruction combinations in several minutes. Tensilica Inc. later 58 CHAPTER 2. LITERATURE REVIEW implemented this system as the Xpress system [23]. Sun, Ravi, Raghunathan and

Jha [194] also recently proposed a heterogeneous multiprocessor instruction set synthe- sis using extensible processors to speedup the application. Although these approaches have shown energy reduction, this is achieved by combining computationally intensive code segments into extensible instructions, which reduces execution time significantly

(with an incremental increase in power dissipation).

2.4.3 Architectural Customisation Selection

Architectural customisation selection involves selecting extensible instructions, pre- defined blocks, and parameter settings in the extensible processor to maximise applica- tion performance while satisfying design constraints. This process is often referred to as design space exploration. This selection problem can be simplified and formulated as a well-known knapsack problem, with single or multiple constraints. The single constraint knapsack problem is defined where an item i has a value vi and weights wi.

The goal is to find a subset of the n items such that the total value is maximised and the weight constraint is satisfied. In our case, the item is architectural customisation,

AC, such as extensible instructions, predefined blocks and parameter settings. Each customisation has a speedup factor, sac, compared with the software code segment that it replaced, and a single design constraint such as area aac or power pac. In the single design constraint case, the simplified form of the problem is not strongly NP-hard and effective approximation algorithms have been proposed for obtaining a near-optimal solution. Exact and heuristic algorithms are also developed for the single constraint knapsack problem, (summarised by Martello and Toth [154]). A comprehensive review of the single knapsack problem and its associated exact and heuristic algorithms is given in [154].

On the other hand, a multiple-constraints knapsack problem, defined where an item CHAPTER 2. LITERATURE REVIEW 59

i has a value vi and has multiple weights wij, is considered a strong NP-hard problem.

Other names for this problem in the literature are multidimensional knapsack prob- lem, multi-knapsack problem and multiple knapsack problem. A practical problem can be formulated as a multiple-constraints knapsack problem; for example, a capital budgeting problem where project j has profit pj and consumes rij units of resource i.

The goal is to find a subset of the n projects such that the total profit is maximised and all resource constraints are satisfied. Exact and heuristics algorithms have been proposed in the past, such as a branch and bound algorithm, dynamic programming based algorithm, tabu search based heuristics, analysed heuristics, etc. (For a review of the multiple-constraints knapsack problem, refer to [65]). This section will discuss the literature related to design space exploration and architectural customisation selec- tion in extensible processors with single or multiple architectural customisations under single or multiple constraints.

Research in extensible processor platform exploration has largely revolved around single architectural customisation (either predefined blocks or extensible instructions or parameterisations) under a single constraint. A number of researchers have described methods to include/exclude predefined blocks to include/exclude predefined blocks to customise very long instruction word (VLIW) and explicitly parallel instruction com- puting (EPIC) processors [26,28,120]. Choi, Yi, Lee, Park and Kyung [64] presented a method to select intellectual properties to increase the performance of an application.

Gupta, Ko and Barua [95] described an automatic method to select among processor options under an area constraint. For extending instructions, Lee, Choi and Dutt [138] proposed an instruction set synthesis for reducing energy delay product of application specific processors through optimal instruction encoding. Various methods to gener- ate extensible instructions automatically from basic, frequently occurring, operation 60 CHAPTER 2. LITERATURE REVIEW

e Predefined P sibl ara ten s Blocks met Ex tion Se er ruc tting Inst s

Example: Select extensible Given an application, design instructions, predefined constraints (Area < 62,000 blocks, parameter settings gates; Power < 170 mW;) and to configure extensible the following extensible processor based on instructions with characteristics design constraints ? (Area, Power, Speedup): Design Constraints: Inst1: 30,000 gates; 85 mW; 6x; Area = xxx; Inst2: 20,000 gates; 60 mW; 5x; Power = yyy; Inst3: 10,000 gates; 30 mW; 2x; Performance = zzz Inst4: 12,000 gates; 25 mW; 3x; Inst5: 13,000 gates; 20 mW; 2x; Select a set of extensible instructions that maximizes the performance of the application.

Extensible Processors

Figure 2.12: An example demonstrating the complexity of instructions selection patterns have also been devised [33, 47, 62, 192]. Parameterisations in the extensible processor platform leads to the setting of the register file size, instruction and data cache sizes, and memory configuration. Jain, Wehmeyer, Steinke, Marwedel and Bal- akrishnan [117] proposed a method to evaluate the register file size and optimised area constraint. Methods to optimise the memory size of the embedded software in a pro- cessor have also been proposed [51, 199]. Finally, Abraham and Rau [25] described a scheme to select instruction and data cache size.

Abraham and Rau [25] and Lee, Choi and Dutt [137] presented examples of de- sign exploration with multiple architectural customisations under a single constraint.

The PICO system was proposed to explore the design space of the non-programmable hardware accelerators (NPAs) and parameters in memory and cache for VLIW pro- CHAPTER 2. LITERATURE REVIEW 61 cessors [25]. PICO system produces a set of sub-optimal solutions using a divide-and- conquer approach and defers the final constraint tradeoffs to the designer. This is the cornerstone of the system provided by IPflex, Inc. Lee, Choi and Dutt [137] proposed a heuristic space exploration for encoded instructions and parameter settings with the tradeoffs between area overhead and performance. By introducing multiple architec- tural customisation selections, the design space exploration is extended in such a way that the number of extensible instructions generated or selected can be between en- tirely different predefined blocks. For example, if an application has a floating-point

(fp) multiplication that takes up significant execution time. There are at least two pos- sible solutions to speedup the application: i) to include a fp predefined block (which has fp registers and is able to execute fp addition, fp multiplication, etc.) in the processor; and ii) to generate/select a fp multiplication instruction in the processor. The first so- lution may be overkill if the application only has one fp operation. On the other hand, since the fp multiplication instruction is generated by designers in the second solution, the speedup and area constraint may not be ideal when compared to the vendors’ fp predefined block. These kinds of situations are introduced when predefined blocks are included, extending the existing design space beyond simply just identifying/selecting extensible instructions.

There is very little work on extensible processor platform exploration under multiple constraints when predefined blocks and extensible instructions are involved. Often, re- search in extensible processor platform exploration only focuses on the area constraint while the energy constraint is neglected [137, 193]. This naive assumption is due to the fact that improving performance usually reduces energy consumption of the pro- gram running on the custom processor. However, when multiple design constraints are given, the difficulty of tradeoffs increases exponentially. Despite this fact, the selection 62 CHAPTER 2. LITERATURE REVIEW problem under multiple constraints has been around for more than a decade in other research areas and is often formalised as a multidimensional knapsack problem. Chu and Beasley [65] proposed a genetic algorithm to solve the multi-dimensional knap- sack problem, and introduced a heuristic operator with problem-specific knowledge.

Chekuri and Khanna [52] described a polynomial time approximation scheme based on guessing sub-optimal items for a multi-dimensional knapsack problem.

2.4.4 Processor Evaluation and Estimation

Processor evaluation and estimation involves verifying the newly configured exten- sible processor (consisting of predefined blocks, extensible instructions and parameter settings), to determine whether the processor meets the desired design constraints.

Evaluation and estimation ranges from gate-level simulation, RTL-level estimation and instruction set simulation to abstract high-level estimation, verifying area, power dis- sipation, timing latency, and measuring application performance. Significant expertise and verification efforts are essential elements in this step, which significantly affects the accuracy of the processor’s evaluation. Tradeoff between accuracy and speed of evalua- tion is the main difference between various evaluation and estimation methods. Figure

2.13 shows various evaluation and estimation methods in terms of accuracy and time.

In this section, the methodologies for evaluating and estimating extensible processor characteristics are described in detail.

Gate-level simulation involves simulating the physical characteristics of the extensi- ble processor with application data inputs, and is often used to verify power consump- tion and functionality. The inputs of this simulation are the gate-level description of the extensible processor, the silicon technology library, and the data inputs that cap- ture the characteristics of the application, while the output is the target simulation results. Extensible processor platform vendors often only provide the RTL description CHAPTER 2. LITERATURE REVIEW 63

Short Verification Time Long Verification Time Less Evaluation Accuracy High Evaluation Accuracy

High-level Instruction RTL-level Gate-level Abstraction Set Synthesis Simulation Estimation Simulation To estimate To simulate power performance, area, dissipation of the power using high-level extensible processor with abstraction functions application input data. The with characteristics of simulation time depends extensible processor on the complexity of the and application. The processors as well as the estimation can be done amount of application input in seconds. data.

To simulate performance of the application To synthesis the RTL description of the in the extensible processor with input data processors thus estimate area, power, using instruction-set simulator, which is delay of the processor. The synthesis time often provided by extensible processors varies, depending the complexity of the platform vendors. The simulation time processor (i.e. the number of extensible varies, depending on the amount of the instructions, predefined blocks, and application input data. parameter settings).

Figure 2.13: Accuracy and time tradeoffs between different approaches of the extensible processor, leaving the designers to generate the gate-level description from the RTL description with their silicon technology library and synthesis tools.

These synthesis tools are often provided by logic synthesis companies such as Synop- sys, Inc. [6], Cadence, Inc. [4], and Magma, Inc. [3]. A simulation from the gate-level description is one of the most accurate simulations for power consumption and timing.

However, significant simulation time may be consumed, depending on the amount of data inputs.

RTL-level estimation serves to evaluate the area, power and delay of the RTL de- scription of the extensible processor, and is provided by the extensible processor plat- form vendors. Since predefined blocks and base processors are predesigned, it is often the case that RTL-level synthesis is applied to only extensible instructions synthesis.

This synthesis is to determine whether the latency and delay of extensible instructions satisfy the clock period of the base processor, and whether the extensible instructions 64 CHAPTER 2. LITERATURE REVIEW

fit well inside the execution stage of the processor. The length of RTL synthesis time depends on the complexity of the extensible instructions as well as the constraints set by the designer. Therefore, this approach requires a great deal of expertise and is a relatively time consuming process (taking hours and possibly days). As time-to-market pressure mounts, RTL-level estimation is introduced to estimate the area, power and delay of the RTL description. These estimation methods are largely focused on esti- mating power dissipation of the extensible processor. Vendors that provide such tools include Sequence Inc. (PowerTheatre) [19], and Synopsys, Inc. (Prime Power) [6]. This estimation process is relatively fast (can be done in minutes), and the accuracy of these methods is often within 20% of the gate-level simulation results.

Instruction set simulation simulates the performance of the application in the newly configured extensible processor (consisting of the base instruction set, extensible in- structions, and instructions associated with predefined blocks). This simulation re- quires input data sets to simulate the application performance. It is often the case that the simulation is conducted by a cycle-accurate instruction set simulator, which is specifically suited for the target extensible processor architecture. Instruction set simulators can be either generated for the target extensible processor from its archi- tecture description language (such as in LISATek [12]) or predesigned and provided by vendors such as Tensilica, Inc. [23]. The predesigned simulators require the setting of the extensible processor configuration in order to specifically run the application with data sets. The runtime of these simulations can vary from minutes to days, depending on the amount of input data sets and complexity of the application. In addition, input data sets must be large enough to sufficiently capture the application characteristics, to truly reflect the application’s performance.

In order to further reduce time-to-market pressure, research into abstract high-level CHAPTER 2. LITERATURE REVIEW 65 estimations for extensible processors has been carried out. Gupta, Sharma, Balakrish- nan and Malik [96] proposed a processor evaluation methodology to quickly estimate the performance improvement when architectural modifications are made. Jain, Bal- akrishnan and Kumar [115, 116] proposed methodologies to evaluate the register file size, register window and cache configuration in an extensible processor design. By selecting an optimum register file size, they were able to significantly reduce area and energy consumption significantly. Bhatt, Balakrishnan and Kumar [37] also proposed a methodology to evaluate the number of register windows needed in processor synthesis.

Fei, Ravi, Raghunathan and Jha [82] described a hybrid methodology for estimating en- ergy consumption of extensible processors. However, the proposed energy model does not include the schedule of operations or instructions. Jacome, Veciana and Lapin- skii [113] proposed an algorithm to evaluate performance tradeoffs in VLIW processors with clustered datapath. Jha and Dutt [118] proposed a rapid estimation scheme for area and power utilising parameterised components in high-level synthesis. Sanghavi and Wang [179] proposed a method to estimate speed, area, and power consumption of software intellectual property at an architectural level. Bona, Sami, Sciuto, Silvano,

Zaccaria and Zafalon [40] described a method for energy estimation for processors based on instruction clustering. While several methods for estimation of speed, area and power are presented in the high-level synthesis literature, they often relate to es- timation methods for a fully customised circuit, whereas an extensible instruction is a partially customised circuit and surrounded by built-in control logic of the extensible processor, storage areas and busses connecting the instruction to the storage area.

This chapter has described a wide range of design approaches for embedded sys- tems and various architectures of application specific instruction-set processors. It introduced reasons for using extensible processor platforms, and showed that the ex- 66 CHAPTER 2. LITERATURE REVIEW tensible processor platform is the state-of-the-art design approach for today’s embed- ded systems. This chapter introduced design problems related to extensible processor platform, such as code segment identification, instruction generation, architectural customisation selection, and processor evaluation and estimation, and described the state-of-the-art work to resolve these issues. In the next chapter, we present our pro- posed design methodologies to further address these design problems and show how our methodologies advance the existing work. Chapter 3

Methodology Overview

This chapter presents the overview of a suite of design automation methodologies we have proposed for the extensible processor platform. We first review the existing design flow for extensible processors and its current problems, and then summarise the state-of-the-art research to address these problems (detailed in the previous chapter).

Our proposed design methodologies are then presented with a description of how our methodologies differ from previous work and how they fit into the existing design

flow. Each design methodology is described individually and then presented together as a single design system. The complete system significantly improves upon the state- of-the-art research. The contributions of the thesis are presented at the end of this chapter.

3.1 Existing Design Flow

Figure 3.1 shows an existing design flow for the extensible processor platform. The design goal for the extensible processor is to maximise the performance of an embed- ded application, while meeting design constraints such as area overhead and power consumption. The designer compiles, analyses and profiles the application using an In- struction Set Simulator of the target processor. The profiling reveals computationally intensive code segments for which possible instructions extension, inclusion/exclusion

67 68 CHAPTER 3. METHODOLOGY OVERVIEW of predefined blocks or parameterisations might improve performance and design char- acteristics. After identifying a set of computationally intensive code segments, the designer first generates extensible instructions for these code segments. Next, the de- signer selects a set of possible extensible instructions, predefined blocks, or parameter settings to speed up the code segment, by defining these customisations in the exten- sible processor. To evaluate these customisations, the designer can use the available

(retargetable) tools to determine whether the application can meet design constraints.

In addition, the designer can iterate this step during the design space exploration.

Once the application meets the specified design constraints, the platform uses the base processor configurations, predefined blocks and extensible instructions to generate the extensible processor’s synthesisable RTL. This synthesisable RTL can be taped out or prototyped.

As discussed in the previous chapter, there are four problems in the extensible processor design flow:

1. Code Segment Identification - Code segments are groups of computationally in-

tensive primitive instructions that take up considerable execution time in the ap-

plication. By replacing code segments with possible architectural customisations

(including instructions extension, inclusion/exclusion of predefined blocks and pa-

rameterisations), the application performance can be significantly boosted in ex-

change for an incremental increase in area and power consumption. The code seg-

ment identification step involves identifying computationally intensive code seg-

ments in the application. Although a variety of methods for code segment identifi-

cation have been proposed (such as matching algorithm [27,32,124], profiled graph

representation [33,67,170,192,193], and high-level abstraction [68,184,203]), these

methods have their own disadvantages, as described in Chapter 2. In fact, the CHAPTER 3. METHODOLOGY OVERVIEW 69

Application written in C/C++

Compilation Power * * * ** Analysis and Profiling * * * * ***** ***** ****** ********* *********** ******** ************ *** ******************* Identify computationally intensive * *************** ************** code segments * ***************** Area ************* * * * *********** ** * * ** **** * * Performance Generate extensible instructions for code segments Explore Explore extensibleextensible Select extensible instructions, processorprocessor predefined blocks, parameters design space Synthesizable RTL of: design space Base processor Evaluate performance and design Predefined blocks constraints of the processor Extensible Instructions Parameter settings Design that satisfies design constraints (power, Generate extensible processor area, performance)

Synthesis and Prototyping tape-out

Figure 3.1: A generic existing design flow of the extensible processor platform

primary problem encountered during code segment identification is that there

are an enormous number of candidates within each code segment that are imple-

mentable as extensible instructions. Furthermore, it is not until after synthesis,

that the physical characteristics of extensible instructions (such as area overhead,

power consumption and latency) are known. Thus, the suitability of extensible

instructions is unknown until after implementation. Therefore, identifying suit-

able code segments is a critically important step to increase performance of an

application. 70 CHAPTER 3. METHODOLOGY OVERVIEW

2. Instruction Generation - after computationally intensive code segments are iden-

tified, extensible instructions are generated by specifying new hardware resources

and operations that code segments perform. Designing extensible instructions for

extensible processors is a computationally complex task because of the large de-

sign space to which extensible instructions are exposed. There are numerous

ways to generate extensible instructions, such as selecting different components,

parallelism techniques and diverse scheduling. Approaches to instructions gen-

eration range from instruction set synthesis [62, 107, 109], instruction set exten-

sion [39, 46, 67, 70], and extensible instructions generation [33, 88, 137, 192, 193].

The majority of these approaches focus on combining a large group of primitive

instructions into a single extensible instruction to maximise performance. How-

ever, this approach often leads to large power dissipation and discharge current,

posing a challenge for battery-powered products. Therefore, it is vital to au-

tomatically generate instructions that explore all possible designs to satisfy the

power dissipation constraint.

3. Architectural Customisation Selection - after identifying extensible instructions,

predefined blocks, and parameterisations for code segments, the designer then

selects amongst these architectural customisations to satisfy design constraints.

This simplified form of selection problem can be formulated as a well-known

knapsack problem, with single or multiple constraints. Research into extensible

processor platform exploration has largely revolved around single architectural

customisation under a single constraint [26, 28, 33, 47, 62, 120, 192]. Design ex-

ploration with multiple architectural customisations under a single constraint in

embedded systems is presented in [25, 137, 193]. The design space with multiple

architectural customisations is extremely large; therefore, efficient and effective CHAPTER 3. METHODOLOGY OVERVIEW 71

algorithms must be developed to select multiple architectural customisations in

the extensible processor platform, ensuring design constraints are satisfied.

4. Processor Evaluation and Estimation - after selecting architectural customisa-

tions in the extensible processor, the designer evaluates the processor’s design

characteristics. Evaluation and estimation methods include from gate-level sim-

ulation [4,6], RTL-level synthesis [15,19], instruction set simulation [1,12,23] and

high-level abstraction estimation [40, 82, 96, 117, 179]. However, significant ex-

pertise and verification efforts are necessary for each extensible processor, which

significantly affects the accuracy of evaluation. Furthermore, there are hundreds

(even thousands) of extensible processor configurations for an application, mean-

ing that the evaluation process can be extremely time consuming. Therefore, it

is essential to evaluate design characteristics of the newly configured extensible

processor accurately and quickly.

3.2 Overview of Our Automation Methodologies

Our design automation methodologies aim to solve the design problems discussed, explore a greater design space in a shorter amount of time, and achieve improved embedded systems for an application. Our suite of design automation methodologies includes:

1. An Identification Scheme;

2. Instructions Estimation Model;

3. An Instruction Generation Method;

4. A Tool To Match Instructions and Code Segments;

5. A Two-level Hierarchy Selection Algorithm; and 72 CHAPTER 3. METHODOLOGY OVERVIEW

6. A Novel Estimation Function

1. An Identification Scheme - An identification scheme is used to identify com-

putationally intensive code segments. This begins with a profiled application in

which the computationally intensive functions/subroutines (or lines) are identi-

fied using Instruction Set Simulation. All combinations of computationally inten-

sive lines within each function are then grouped to form a list of code segments.

This scheme uses a fitting function to quantify the characteristics of a code seg-

ment and match it to the hardware qualities (i.e. speedups and power saving) of

the extensible instruction to be implemented. In other words, this function in-

terprets high-level characteristics of a code segment in an application, to predict

the physical characteristics of the implemented extensible instruction. If a code

segment has a low speedup and area radio, it is pruned from the design space to

reduce design and verification time for the extensible instruction. The advantage

of this fitting function is its ability to identify thousands of code segments in an

application quickly. By searching within individual computationally intensive C

functions, the exponential blowup problem is reduced, as each individual line is

considered (rather than each operation). In addition, the boundary of searching

connected subgraphs is extended (which is the limitation of the work described

in [67, 193]). Our method also searches between connected subgraphs in DFG,

which advances the state-of-the-art research in this area.

2. Instructions Estimation Model - This is a fast and accurate estimation model

to predict area overhead, power consumption, and latency of instructions. This

instruction estimation model is derived using system decomposition theory and

regression analysis. The inputs of the estimation model are code segment and

design constraints, and the output is the proposed extensible instruction with CHAPTER 3. METHODOLOGY OVERVIEW 73

components and parallelism techniques to satisfy the design constraints. The

design space of extensible instructions is enormous due to the fact that there are a

wide range of customisations, such as selecting different components and applying

various parallelism techniques. By applying the estimation model, designers are

able to rapidly explore the design space for extensible instructions.

3. An Instruction Generation Method - Instruction generation automatically

generates extensible instructions for a given code segment. Previous approaches

to instruction generation often combined a large group of primitive instructions

into a single extensible instruction for maximising performance. This instruc-

tion generation method proposes the separation of instructions and utilisation

of the slack of the instructions in a fine-grain approach. This method not only

achieves performance enhancement (in an order of magnitude), but also minimises

the energy consumption of extensible instructions. Furthermore, this instruction

generation method uses an estimation model to evaluate the performance and

power consumption of the instructions.

4. A Tool To Match Instructions and Code Segments - This instruction

matching tool finds a functionally equivalent extensible instruction for an identi-

fied code segment using a combinational equivalent approach. These pre-designed

extensible instructions are stored in a library, which is often previously designed

for an application. The purpose of this tool is to find as many as pre-designed

extensible instructions to reuse in the processor, thus minimising the design and

verification time for new extensible instructions. We propose a novel matching

tool as part of our suite of design tools.

5. A Two-level Hierarchy Selection Algorithm - This two-level hierarchy

selection algorithm selects extensible instructions, predefined blocks and pa- 74 CHAPTER 3. METHODOLOGY OVERVIEW

rameterisations for an extensible processor, to maximise performance of the

application while satisfying a given area constraint. This algorithm first se-

lects a pre-configured processor that combines predefined blocks and parame-

terisations with a base processor, and then selects a set of extensible instruc-

tions. The pre-configured processor uses designer inputs of predefined blocks

and parameterisations to prune the design space. Our two-level hierarchical

approach solves the problem of selecting multiple architectural customisations

and differs from approaches that only consider instruction identification/selection

[33,67,70,193,204], thereby advancing the state-of-the-art research in this area.

6. A Novel Estimation Function - An estimation function evaluates the perfor-

mance of the newly created extensible processor using the profiling information

and the latency of selected extensible instructions and the pre-configured pro-

cessor. The latency of the selected extensible instructions and the pre-designed

processor directly affects the processor clock speed. By obtaining the latency and

profiling information, this function can predict the execution time accurately and

quickly.

3.3 Modified Design Flow for Extensible Proces- sors

Figure 3.2 shows the modified design flow for the extensible processor platform.

Our methodologies are displayed in different colours to show how they fit into the existing design flow. Our methodologies can be divided into four parts: i) a semi- automatic design system (purple boxes in Figure 3.2); ii) an instructions matching tool

(orange box); iii) an instructions estimation model (blue box); and iv) an instructions generation method (green box). Each part of our design automation methodologies is CHAPTER 3. METHODOLOGY OVERVIEW 75 described in the following chapters.

In this modified design flow, the designer begins in the same way, with compila- tion, analysis and profiling of the application using an Instruction Set Simulator of the target processor. The profiling reveals computationally intensive code segments for which possible instructions extension, inclusion/exclusion of predefined blocks or parameterisations might improve performance and design characteristics. In the first phase of the semi-automatic design system, a fitting function quantifies the character- istics of the code segments to hardware qualities such as speedups and power saving, thus identifying suitable code segments that can be implemented as extensible instruc- tions. After a set of computationally intensive code segments is identified, the designer has two choices: i) generate new extensible instructions; and/or ii) find predesigned extensible instructions. If the designer chooses to generate new extensible instructions, a fast and accurate instructions estimation model explores the design space in order to generate extensible instructions for each code segment. These models estimate area overhead, power consumption, and latency of possible extensible instructions that can be implemented for a code segment; thus, recommending a set of extensible instruc- tions for each code segment. Next, extensible instructions are generated using the instruction generation methodology, which satisfies performance, area overhead, delay, and energy consumption constraints. On the other hand, if the designer chooses to look for pre-designed instructions, an automated instruction matching tool is used to find a pre-designed, functionally equivalent extensible instruction for the identified code segment. Utilising these two choices, a set of new and pre-designed extensible instruc- tions are defined for all identified code segments. The designer then uses a two-level hierarchy selection algorithm to select extensible instructions, predefined blocks and parameterisations for the possible code segments in an extensible processor, in order 76 CHAPTER 3. METHODOLOGY OVERVIEW

Application written in C/C++

Compilation

Analysis and Profiling

Fitting Function - Identify computationally intensive code segments to implement as extensible instructions

Power Instructions Estimation Model - Fine- * * Matching Instructions * ** grain instructions exploration to propose * * * Tool * ************ * the customization for code segments ****** ******** - To find functional ***** ***** ****** ****************** equivalent pre-designed ****************** Instructions Generation - Generate * *************** ************** instructions of the * **************** Area extensible instructions to satisfy power ************* * identified code segments * * *********** ** consumption and performance * * ** **** * * Performance Two-level Hierarchy Selection Algorithm Explore - Select predefined blocks and extensible extensible instructions based on design constraints processor Synthesizable RTL of: design space Base processor Novel estimation function - Take Predefined blocks extensible instructions latency into account Extensible Instructions for performance evaluation Parameter settings

Design that satisfies design constraints (power, Generate extensible processor area, performance)

Synthesis and Prototyping tape-out

Figure 3.2: Our design methodologies for an extensible processor platform CHAPTER 3. METHODOLOGY OVERVIEW 77 to satisfy the design constraints. To evaluate these architectural customisations, the design system uses the novel estimation function combined with latency of the selected extensible instructions and pre-configured processor to provide fast and accurate per- formance analysis. In addition, the designer can iterate this step during the design space exploration. Once the application meets the design constraints, the platform uses the base processor configurations, predefined blocks and extensible instructions to generate the extensible processor’s synthesisable RTL for tape out or prototyping.

3.4 Contributions

The main contribution of this thesis is to automate the design flow of an extensible processor platform, which significantly shortens the design time and explores a larger design space, achieving better design metrics tradeoff. This is achieved by a suite of design automation methodologies that includes a semi-automatic design system; an instruction matching tool; an instructions estimation model; and an instructions generation method. The other contributions of this thesis are as follows:

1. The semi-automatic design system maximises application performance (on aver-

age 4.74× (up to 15.71×)) while minimising a given area constraint in a short

design time (2.5% of the full simulation time) with majority Pareto points ob-

tained (91% on average), by specifying:

• An identification scheme that is able to automatically identify suitable code

segments within an application (as opposed to an error-prone manual pro-

cess), so that these code segments can be translated to instructions within

the processor. The fitting function can predict the speedup/area ratio of

the extensible instruction to be implemented for code segment. 78 CHAPTER 3. METHODOLOGY OVERVIEW

• A two-level hierarchy selection algorithm to first select a pre-defined pro-

cessor, and then to select the right instruction set for this extensible pro-

cessor. Hence, design constraints such as performance are satisfied. Using

a set of pre-configured processors and a pre-designed library of extensible

instructions to prune the design space of the extensible processor, design

turnaround time is reduced significantly.

• A performance estimator to estimate an application’s performance (rather

than running each configuration repeatedly through an Instruction Set Sim-

ulator), which minimises evaluation time.

2. An instruction matching tool automates the instruction matching step and is su-

perior to computationally intensive and error-prone simulation approaches. The

use of functional equivalence checking ensures that the results are independent of

the programming style of the application. This tool enables a reduction in verifi-

cation time and enhances reusability of extensible instructions. Our instruction

matching tool is 7.3× faster on average compared to the best known approaches

to the problem (partial simulations).

3. An fast and accurate estimation model (area overhead, latency, and power con-

sumption) of extensible instructions is derived. This model simplifies the process

of modelling extensible instructions by using system decomposition and regres-

sion analysis. Both parallelism techniques and schedule alternatives for instruc-

tion models are taken into account. This model enhances design exploration for

extensible instructions, and is fast and accurate. Our estimation model has a

mean absolute error as small as: 3.4% (6.7% max.) for area overhead, 5.9%

(9.4% max.) for latency, and 4.2% (7.2% max.) for power consumption, com-

pared to estimation through the time consuming synthesis and simulation steps CHAPTER 3. METHODOLOGY OVERVIEW 79

using commercial tools.

4. An instruction generation tool reduces the power dissipation of extensible in-

structions by separating instructions and utilising the slack of the instruction

and explores fine-grain granularity in instruction generation. For the first time,

battery lifetime (battery behaviour model) is taken into account in generating ex-

tensible instructions, as opposed to simply shortening the execution time, which

leads to energy reduction. The instruction generation tool reduces energy con-

sumption by a further 5.8% on average (up to 17.7%) compared to extensible

instructions generated by previous approaches.

Our design methodologies have performed experiments in the context of a commer- cial design flow (using Tensilica’s Xtensa processor), indicating that these methodolo- gies can work co-operatively with existing extensible processor platforms. 80 CHAPTER 3. METHODOLOGY OVERVIEW Chapter 4

Semi-automatic Design System

This chapter presents a semi-automatic design system for configuring an extensible processor, which maximises the performance of an application while satisfying the area constraint. The design system consists of a methodology for identifying suitable code segments to implement as extensible instructions, a two-level hierarchy selection algorithm for selecting a pre-configured processor (with predefined blocks included and parameters configured), extensible instructions to generate an extensible processor, and an estimation function to rapidly estimate the performance of the application on a newly configured extensible processor.

4.1 Motivations

The motivation for the work described in this chapter is in four parts. These are: i) identifying code segments; ii) generating extensible instructions; iii) exploring the design space using predefined blocks, extensible instructions, and parameters; and iv) estimating application performance.

Identifying: Although profiling identifies frequently occurring code segments, there is an enormous number of suitable candidates within each code segment that can be implemented as extensible instructions. Furthermore, the suitability of a code segment to be converted to an instruction is not known until after synthesis. Figure 4.1 shows

81 82 CHAPTER 4. SEMI-AUTOMATIC DESIGN SYSTEM

1 static int fmult (int an, int srn) { 2 short anmag, anexp, anmant, wanexp, wanmant, retval; 3 anmag = (an > 0) ? an : ((-an) & 0x1FFFF); 4 anexp = quan(anmag, power2, 15) - 6; 5 anmant = (anmag == 0) ? 32 : (anexp >= 0) ? anmag >> anexp : anmag << -anexp; 6 wanexp = anexp + ((srn >> 6) & 0xF) - 13; 7 wanmant = (anmant x (srn \& 077) + 0x30) >> 4; 8 retval = (wanexp >= 0) ? ((wanman << wanexp) & 0x7FFF) : (wanmant >> -wanexp); 9 return (((an ^ srn) < 0) ? -retval : retval); 10 }

Figure 4.1: Motivation example an example of a frequently occurring code segment in an application. Each line within the code segment (from lines 3-9) can be implemented as one or more extensible in- structions. Alternatively, lines 3-4 can be an instruction, lines 4-6 can be a separate instruction and lines 6-9 can be a third instruction. Even for this simple code segment, there are hundreds of candidates to be implemented as instructions. Each candidate can have different characteristics such as speedup, area overhead, latency, and power consumption. Thus rapidly identifying code segments to implement as instructions is necessary to speed up the design process.

Selecting: The process of creating an extensible instruction is error-prone and time consuming (usually takes a number of days). Even for the simple code segment shown above, the design time for implementing these hundreds of combinations as extensible instructions is intractable. Therefore, it is essential to create extensible instructions in a reusable form.

Exploring: The extensible processor design space, with predefined blocks, extensible instructions, and additional customisable parameters is large and complex. In addition, selecting predefined blocks, extensible instructions, and parameters for an extensible CHAPTER 4. SEMI-AUTOMATIC DESIGN SYSTEM 83 processor to maximise the performance of an application while satisfying the constraints is considered an NP-hard problem [84].

Evaluating: The evaluation of the application performance using an Instruction Set

Simulator of the target processor in a large design space is a time-consuming process.

4.2 System Overview

This section first presents an overview of the entire design flow, and then describes the important phases and steps in details. Figure 4.2 shows the design flow of the semi-automatic design system. The inputs of the semi-automatic design system consist of: an application written in C/C++, a set of pre-configured processors, a library of extensible instructions, and an area constraint. The set of pre-configured processors contain different processor configurations, i.e. processor 1 contains the base processor only; processor 2 contains the base processor with multiplier; and processor 3 is the base processor with floating point unit. The extensible instruction library contains a set of extensible instructions and associated characteristics such as area, latency and speedup.

The output is an extensible processor (with predefined blocks, extensible instructions, and parameters). The goal of this design system is to configure an extensible processor by selecting a pre-configured processor and extensible instructions, in order to maximise application performance while satisfying the given area constraint. The design flow of the design system involves 11 individual steps separated into four phases. These are detailed below.

The first step (step 1) in the design system compiles an application using the com- piler of the target processor. The application is then simulated for each pre-configured processor1 using an Instruction-Set Simulator to obtain the execution trace and to find

1Note that this step implies that at least one of the set of pre-configured processors will meet the constraints of a specific application. This first step is a major designer’s input that allows the designer to provide the system with domain-specific architectural features without fixing the processor core. 84 CHAPTER 4. SEMI-AUTOMATIC DESIGN SYSTEM

Area Constraint of Application Pre-designed Extensible an extensible written in C/ Pre-configured Processor Library (Each instruction has processor C++ Library (Pre-defined blocks: area, latency, speedup w.r.t. (Hard DSP, Floating-Point each pre-configured Constraint) Unit, MAC etc) processor that stores in the library)

1 2 I Cross-compilation Calculate EP & AE value Heuristic Algorithm Instruction Set 3 (Part 1: for Selecting Select processor with Simulation (ISS) pre-configured pre-defined blocks in the pre-configured processor) Execution traces processor library (Profiling)

4 Search a list of code A list of code segments from critical segments, cs functions Selected 5 code Rank the list of code segment, segments using cost scs function, CodeSegment Critical functions, cf 7a Implement extensible Does a instruction critical code 7b 6 segment have at least one implementation No RTL simulation in the instruction library? Latency & area estimation for extensible Yes instruction II Methodology for 7c Obtain speedup using Identifying Code ISS Segment, Implementing 8 Is there a next code segment in the list? Extensible Instruction, Yes 7d Insert the extensible and Extending the instruction into the library Instruction Library No

III 9 Area Constraint of an Find PSAR of extensible extensible processor (Hard instruction that associates Constraint) Heuristic Algorithm with code segments in the (Part 2: for application Selecting pre- 10 Pre-designed Extensible designed Select extensible Yes Library (Each instruction has extensible instruction area, latency, speedup w.r.t. instruction) each pre-configured Area Constraint >AE? processor that stores in the library) No

IV 11 Execution Time Find ETE of an application Estimation

Processor (pre-defined blocks & extensible instructions) Execution Time

Figure 4.2: Our semi-automatic system design flow for configuring extensible processor (double square box: commercial tools; grey box: our contributions) CHAPTER 4. SEMI-AUTOMATIC DESIGN SYSTEM 85 execution characteristics for all functions/subroutines2 within the application. Compu- tationally intensive functions within the application are referred to as critical functions

(cf) of the application, and are considered as possible candidates for implementation as instructions.

In phase I (steps 2-3), the speedup/area ratio, EP , and the area, AE, are computed for each pre-configured processor. This phase then select the pre-configured processor with the highest speedup/area ratio that also has an area less than the given area constraint. The reason for selecting the pre-configured processor with the highest speedup/area ratio, rather than the pre-configured processor with the highest speedup, is to take into account the additional area needed for extensible instructions to further enhance the performance of the application.

In phase II (steps 4-8), the system exhaustively searches the list of critical func- tions for all possible combinations of code segments containing consecutive lines of code. The code segments are then ranked according to our cost function, Fitting, and code segments with a value greater than α (in our case, α is 0.001) are selected for implementation. The reason for choosing α as 0.001 is solely based on the designer’s expertise. For each “selected code segment”, we manually check whether one or more functionally equivalent implementations exist in the extensible instruction library. If no such implementation exists, then the extensible instruction is designed and synthesised manually and inserted into the instruction library.

In phase III (steps 9-10), the instruction selection algorithm is executed to greedily select a set of extensible instructions using the speedup/area ratio, P SAR, and the area, AE. Once again, the extensible instruction with the higher speedup/area ratio is selected until either all extensible instructions are selected or the given area constraint is reached. 2Granularity may vary. 86 CHAPTER 4. SEMI-AUTOMATIC DESIGN SYSTEM

Finally, in phase IV (step 11), our performance estimation model, ETE, is ap- plied to evaluate the execution time of the application on the newly created extensible processor.

4.2.1 Phase I: Pre-configured Processor Selection

Phase I of the design system selects a pre-configured processor with high speedup/area ratio (outlined in steps 2-3 in Figure 4.2). The inputs of phase I are the characteristics of each pre-configured processor, such as area, clock rate, etc., and the profiling results of the application obtained from step 1. This phase first ranks each pre-configured processor using the cost function, EPi, and then selects the pre-configured processor with the highest value of the cost function. The cost function, EPi, of pre-configured processor i for an application is defined as:

1 EPi = (4.1) CCi × Clk PDi × Area Proci where CCi is the total cycle-count of the application run on the pre-configured proces- sor, Clk PDi is the clock period of the processor, and Area Proci is the area of the processor. This function is inversely proportional to the area delay product.

4.2.2 Phase II: Instruction Identification Model

The inputs for identifying an instruction model are the list of critical functions from the profiling step and the pre-configured processor selected during phase I. A critical function is one that contributes more than θ% of the total execution time, where θ% is the designer’s input (in our case, θ% is 5). The identification consists of five steps:

1. Exhaustively search of a list of critical code segments as part of critical functions

(step 4 in Figure 4.2); CHAPTER 4. SEMI-AUTOMATIC DESIGN SYSTEM 87

2. Identify and rank the list of critical code segments using a fitting function (step

5). The fitting function is described in the next section;

3. Check whether an equivalent implementation is part of our extensible instruction

library (step 6);

4. If there is not an equivalent instruction to the code segment in the library, im-

plement the code segment as an extensible instruction and characterise the in-

struction using the Xtensa development tools from Tensilica, Inc. [23] and Design

Compiler from Synopsys, Inc. [6] with associated scripts (steps 7a - 7d);

5. If there is an equivalent instruction that matches the code segment, then move

down to the next item in the list of code segments (step 8). Note that the code

segment matching against the instructions in the library is currently performed

manually. These methodology outputs a set of extensible instructions with its

area, latency and speedup. These extensible instructions are added to the exten-

sible instruction library for reuse and can be selected in later phases of the design

flow.

This instruction identification model is somewhat similar to the work by Sun, Ravi,

Raghunathan and Jha [193] and Clark, Zhong and Mahlke [67]. Sun et al. proposed to rank computationally intensive functions and lines (identified by profiling) using a pri- ority function, and then select the high priority functions and lines, pruning the number of functions and lines. Clark et al. proposed to use a guide function to prune the search- ing direction (as opposed to just pruning functions and lines), allowing lowly ranked code segments to grow into a more useful code segment. However, Clark et al. only considered connected subgraphs. Our model extends Sun’s work by searching possible combinations within the computationally intensive functions/subroutines identified by 88 CHAPTER 4. SEMI-AUTOMATIC DESIGN SYSTEM profiling, exploring the large design space to identify more potential code segments.

In addition, our model breaks the boundary of connected subgraphs by searching code segments within computationally intensive functions, as opposed to only considering connected subgraphs. Furthermore, the use of a fitting function to rank code segments using the speedup and area constraint, and pruning code segments which do not reach a certain speedup and area constraint, is most similar to the work by Sun’s priority function [193].

Fitting Function

The aim of this phase is to extract the characteristics of the code segment using a cost function, namely the fitting function, in order to predict the speedup/area ratio of each extensible instruction. The fitting function, Fitting, is derived from studying manually performed extensible processor designs and extracts four characteristics of the code segment:

1. The frequency of use, F Ux, indicates how often a code segment is executed in

the application. F Ux is obtained from the execution traces of the application

program. Moving these segments to extensible instructions is likely to have a

great impact upon the performance of the application.

2. If the number of operands to be implemented as an instruction is three or fewer

operands (less than or equal to two source and one destination operands), then

it is considered to be an instruction that can be implemented fairly easily, as

processors typically have two source busses and one destination bus going to the

ALU. When the number of operands are more than the above, then multiple

cycles are needed to ferry the operands to the newly created functional unit CHAPTER 4. SEMI-AUTOMATIC DESIGN SYSTEM 89

increasing the latency of the operation. This is reflected as NOx in the cost

function.

3. The amount of bit operations in a segment favours its implementation in hard-

ware, since such an instruction requires a small cycle count and has high per-

formance gain on the application program. The amount of bit operations in a

segment is reflected in the cost function as BOx.

4. The type of operands, TOx, in an instruction relates to the type of register file and

the manipulation of the operands. If the type of operands is different, then the

processor needs extra registers or even custom designed registers. These registers

increase the area of an extensible instruction. If manipulation of the operands is

needed, then the latency of the instruction also increases. The increase in latency

and area is reflected as TOx in the cost function.

The fitting function, Fittingx, is defined as:

1 Fitting = F U × × TO × BO (4.2) x x NOx x x ⌈ α ⌉ where F Ux is the frequency of use of a code segment x; NOx represents the number of operands in a code segment x; TOx is the percentage of integer (short) type operands in all the operands (char would be considered an integer); BOx is the percentage of bit operations in all the operations; and α is the ideal number of operands in the code segment (in our case, the number of operands is less than or equal to 2 inputs and 1 output).

Figure 4.3 gives an example of how the fitting function is used. The example consists of two segments: fmult and quan where fmult uses up 22% of the execution time. The number of operands of this function is 3 (namely an, srn, retval). The operation types in the function are mostly bit operations (i.e. and, left shift, right shift etc.), so BOfmult 90 CHAPTER 4. SEMI-AUTOMATIC DESIGN SYSTEM

= 0.8. The types of operands are integers and therefore TOfmult = 1. Thus the value of the fitting function is 0.176, whereas quan will yield 0.28, indicating that there is a higher benefit in quan rather than fmult.

1 static int fmult (int an, int srn) { 2 short anmag, anexp, anmant, wanexp, wanmant, retval; 3 anmag = (an > 0) ? an : ((-an) & 0x1FFFF); 4 anexp = quan(anmag, power2, 15) - 6; 5 anmant = (anmag == 0) ? 32 : (anexp >= 0) 6 ? anmag >> anexp : anmag << -anexp; 7 wanexp = anexp + ((srn >> 6) & 0xF) - 13; 8 wanmant = (anmant x (srn & 077) + 0x30) >> 4; 9 retval = (wanexp >= 0) ? ((wanman << wanexp) 10 & 0x7FFF) : (wanmant >> -wanexp); 11 return (((an ^ srn) < 0) ? -retval : retval); 12 } 13 14 static int quan (int val) { 15 static short table[15] = {1, 2, 4, 8, 0x10, 0x20, 0x40, 16 0x80, 0x100, 0x200, 0x400, 17 0x800, 0x1000, 0x2000, 0x4000} 18 for (int i = 0; i < 15; i++) 19 if (val < table[i]) 20 break; 21 return (i); 22 }

Figure 4.3: An example of a code segment to demonstrate how the fitting function works

4.2.3 Phase III: Extensible Instruction Selection

Phase III of the design system selects a set of extensible instructions to maximise the performance of an application while satisfying the remaining area constraint. The inputs are the extensible instruction library, profiling result, and the remaining area constraint. The extensible instruction selection model is based on the area/speedup ratio of each instruction. This selection model is based on the percentage of total cycle- count (%jk), and the area (Area Instj), speedup (Sp Instjk), and latency (Latencyj) CHAPTER 4. SEMI-AUTOMATIC DESIGN SYSTEM 91 of each selected extensible instruction and is defined as:

%jk × Sp Instjk P SARjk = (4.3) Area Instj × Max(Latencyj,Clk Pdk)

P SARjk indicates the performance gain per area of an application when the extensible instructions are implemented.

4.2.4 Phase IV: Performance Estimation Model

Phase IV estimates the performance of an application executing on the newly cre- ated extensible processor. Our performance estimation model is based on the simula- tion result on the profiling result of the application, the pre-configured processor and the speedup of the selected extensible instructions. The execution time estimation,

ETEjk, for an extensible processor k with a set of selected extensible instruction (from

1 to j ) is defined as:

CC × % ETE = {CC × (1 − % )+ k jk } × Latency jk k jk Sp Inst max (4.4) Xj Xj jk

where CCk is the original total cycle-count of an application running on a pre- configured processor k.

In addition, the area estimation used in the selection of the pre-configured processor and extensible instructions, is defined as:

AEjk = Areabase + Areacopr + Areainst (4.5) X X where Areabase is the number of gates used by the base processor, Areacopr is the additional gates incurred by the selected co-processor and Areainst is the number of gates that the selected instruction occupies. The first two terms of equation 4.5 are estimated using the Xtensa generator [23] and the final term is estimated using the

Design Compiler from Synopsys, Inc. [6]. Since the busses are not changed significantly 92 CHAPTER 4. SEMI-AUTOMATIC DESIGN SYSTEM by the addition of co-processors, the gate count still gives a good indication of the area.

The custom register files and any extra tristates inserted for increased bus lengths are reflected by the last term.

4.2.5 Overall Design Flow Algorithm

The overall algorithm brings together the various steps in the semi-automatic design system. We first compile, simulate and profile the application program to obtain a list of critical functions, cf. We then calculate the cost functions for area and speedup/area ratio, AE and EP , of each pre-configured processor. Next, we select a processor with the highest EP value, with AE being less than the area constraint. From the list of critical functions, we exhaustively search all possible combinations of code segments that are consecutive lines of code. We rank the code segments according to our fitting function, CodeSegment. We select code segments that have a CodeSegment value greater than 0.001. Then, if any of the “selected code segment”, scs, do not exist in the instruction library, we create the instruction manually and then add it to the exten- sible instruction library. We continue to implement instructions until all the selected code segments are available in the library. Next, we run the Instruction Selection algo- rithm to select a set of extensible instructions using cost functions, P SAR (potential speed area ratio) and AE. Finally we perform an estimation, ETE, to check on the performance of the created extensible processor. Figure 4.4 shows the overall design

flow algorithm for the design system.

4.3 Experimental Results

In this section, we describe our experimental setup and results. We first describe the libraries and applications used, and followed by a discussion of our results. CHAPTER 4. SEMI-AUTOMATIC DESIGN SYSTEM 93

Overall Design Flow Algorithm() { \∗ Compile, simulate and profile the application ∗\ Compile the application, then simulate the application program using ISS; Profile the application program (a list of critical functions, cfi); \∗ Selecting a pre-configured processor ∗\ for (i = 1 to w pre-configured processor) do { (Step 2) EP = 1 ; i CCi×Clk PDi×Area Proci

AEi = Areabasei + Areacopri ; } P for (i = 1 to w pre-configured processor) do { (Step 3) if (AEi < Area Constaint) Select processor with the highest value of EP ; } \∗ Identifying code segment ∗\ for (i = 1 to x critical functions, cfi) do { Search exhaustively for all code segments, csij; (step 4) for (j = 1 to y code segment, csij, in function i) do { (step 5) CodeSegment FU 1 TO BO j = j × NOj × j × j; ⌈ α ⌉ if (CodeSegmentj > 0.001) Insert into a list of selected code segment, scsj; } } \∗ Manually checking whether a code segment matches an instruction ∗\ for (j = 1 to z selected code segment, scsj) do { (Step 6) if (selected code segment, scsj, is not in library) { Manually create code segment into instruction; (step 7a) Characterize the instruction; (steps 7b-7c) Insert the extensible into the library; (step 7d) } else Continue search; (step 8) } } \∗ Selecting a set of extensible instruction ∗\ for all extensible instructions in instruction library do (step 9) %jk×Sp Instjk P SARjk = ; Area Instj ×Max(Latencyj ,Clk PDk) for (j = the highest PSAR to the lowest PSAR) do { (step 10) if (Area Remain > Area Instj) { Select Instj;

AEjk = Areabasek + Areacoprk + Areainstj ; Area Remain = AreaPConstraint -PAEjk; } } \∗ Execution time estimation ∗\ (step 11) CCk×%jk ETE = {CC × (1 − % )+ } × Latencymax; jk k jk Sp Instjk } P P

Figure 4.4: Overall algorithm of the semi-automatic design system 94 CHAPTER 4. SEMI-AUTOMATIC DESIGN SYSTEM

4.3.1 Experimental Setup

We have set up our design flow (described in Section IV) using tools and scripts to design extensible processors for ten real-world applications. The target extensible processor used in our experiments is the Xtensa processor from Tensilica, Inc. [23]. Two libraries have also been created: a pre-configured processor library and a library of pre- designed extensible instructions, which stores a set of pre-configured processors and all of the extensible instructions generated through our methodology respectively. The experiment for synthesising and simulating the instructions was conducted on a Sun

UltraSPARC III running at 900MHz (dual) with 4Gb of RAM, while the experiments for simulating performance and evaluating the design system were conducted on an

Intel IV running at 1.5GHz with 512Mb of RAM.

We have pre-configured twelve extensible processors in the first library, namely P 1,

P 2, P 3, ..., P 12. Table 4.1 shows the parameters of these pre-configured processors such as the clock rate and area, as well as the additional predefined blocks included in each pre-configured processor. We used four different predefined blocks in this experiment: i) 32-bit multiplier (32b MUL); ii) floating-point predefined block (FPU); iii) digital signal processing predefined block (DSP V0810-8) where memory width is eight bit, register width is ten bit, and SIMD width is eight bit; and iv) digital signal processing predefined block (DSP V1620-8) where memory width is 16 bit, register width is 20 bit, and SIMD width is eight bit. These predefined blocks are the designer’s input in order to prune the design space. The reason for choosing these predefined blocks is that the benchmarks are in the multimedia domain and the predefined blocks are closely related to multimedia algorithms. In Table 4.1, pre-configured processor P 1 is the base processor with no additional predefined blocks; pre-configured processor P 2 has a 32-bit multiplier as the predefined block; and pre-configured processor P 3 has CHAPTER 4. SEMI-AUTOMATIC DESIGN SYSTEM 95

Pre- Core Core Clock Configurations configured Area Power Rate Processor [Gates] [mW] [Mhz] P1 69,680 113 222 – P2 77,670 119 217 32b MUL P3 105,200 141 199 FPU P4 110,900 146 196 32b MUL, FPU 130,600 162 185 DSP(V0810-8) 131,900 163 184 32b MUL, DSP(V0810-8) P7 161,200 186 170 FPU, DSP(V0810-8) P8 163,900 188 169 32b MUL, FPU, DSP(V0810-8) P9 186,700 203 160 DSP(V1620-8) P10 192,400 206 158 32b MUL, DSP(V1620-8) P11 217,400 218 149 FPU, DSP(V1620-8) P12 224,400 221 147 32b MUL, FPU, DSP(V1620-8)

Table 4.1: Characteristics of pre-configured processors the base processor with the floating-point predefined block, etc. All pre-configured processors are set up with direct mapped 1KB instruction and data caches, 128-bit wide system bus and a generic register file with 64 32-bit registers. In addition, these processors are configured from the T1050.2 version of the Xtensa processor in 0.18µm technology.

The extensible instruction library contains 45 extensible instructions. Table 4.2 shows the information available to the designer from the extensible instruction library.

The first column is the extensible instruction name. The second column lists the appli- cation name that uses the extensible instruction. The applications in bold show from which of the applications the instruction was derived. The next 14 columns indicates the area, the speedup of the specified instruction when associated with processor P 1,

P 2, P 3, etc., and latency of the instructions, respectively. These characteristics of the instruction are obtained using an ISS of the target processor [23] and Designer

Compiler, Synopsys, Inc. [6]. The last column is the fitting function’s value of the cor- responding code segment under the pre-configured processor P 1. It should be noted 96 CHAPTER 4. SEMI-AUTOMATIC DESIGN SYSTEM that the fitting function is only comparable within the application from which the instruction was derived. While the cost functions of GSMS and CAL 1 are directly comparable, GSMS and DC3 are not.

We performed experiments on the design system using ten benchmarks: adpcm encoder, g721 encoder, g721 decoder, gsm encoder, gsm decoder, mpeg2 decoder, epi encoder, epi decoder, epwi decoder, and voice recognition. The first eight applications are multimedia applications obtained from Mediabench [135]. The ninth application, obtained from GRASP Laboratory [50], is an embedded predictive wavelet image coder and is an enhancement of efficient pyramid image coder (epi coder). The final applica- tion, obtained from [57], provides user voice control over Unix commands upon a Linux shell environment. For verification purposes, we simulated all possible combinations of extensible processors with pre-configured processors and extensible instructions on each benchmark, so that the entire design space (including the Pareto points) of each benchmark is obtained. In addition, we started with a tight area constraint and relaxed the area constraint during our experiments in order to obtain all possible Pareto points in the design space.

4.3.2 Evaluation Results

In this section, we discuss the evaluation of the design system. First, we discuss the efficacy of the fitting function. Second, we demonstrate the efficiency of the heuristic algorithm for selecting the pre-configured processor and extensible instructions. Third, we discuss the accuracy of the execution time estimation. Finally, we discuss the effectiveness of the overall design flow in the design system.

Table 4.3 summarises the efficacy of the fitting function and Figure 4.5 shows the relationship between the normalised value of the fitting function and the speedup/area ratio of the corresponding extensible instruction for ten applications. These results CHAPTER 4. SEMI-AUTOMATIC DESIGN SYSTEM 97 0.11 0.02 0.12 0.05 0.03 0.03 0.34 0.28 0.22 0.004 0.003 0.005 0.009 0.004 0.007 Fitting function [ns] 7.46 6.00 6.25 6.00 7.50 6.00 6.00 6.00 6.00 4.33 5.57 5.96 6.00 8.15 7.20 Latency × × × × × × × × × × × × × 8 . 20 20 82 81 30 50 50 20 40 92 14 82 ...... P12 N.A. N.A. 1 1 2 1 1 1 4 3 1 3 10 9 2 × × × × × × × × × × × × × 8 . 30 30 08 20 70 60 70 62 81 30 74 82 ...... P11 N.A. N.A. 1 1 3 1 3 3 1 3 1 1 10 9 2 × × × × × × × × × × × × × 8 . 20 20 82 12 50 20 50 56 81 30 14 82 ...... P10 N.A. N.A. 1 1 2 1 4 3 1 3 1 1 10 9 2 × × × × × × × × × × × × × 8 . 20 20 08 20 50 60 50 45 81 30 74 82 P9 ...... N.A. N.A. 1 1 3 1 3 3 1 3 1 1 10 9 2 × × × × × × × × × × × × × 8 . 30 30 82 81 30 20 50 20 23 12 14 82 P8 ...... N.A. N.A. 1 1 2 1 1 1 3 3 1 4 10 9 2 × × × × × × × × × × × × × 8 . 20 30 08 81 30 20 70 20 20 12 74 82 P7 ...... N.A. N.A. 1 1 3 1 1 3 3 3 1 4 10 9 2 × × × × × × × × × × × × × 8 . 30 20 82 20 50 20 40 02 81 30 14 82 P6 ...... N.A. N.A. 1 1 2 1 4 3 1 4 1 1 10 9 2 Speedup under × × × × × × × × × × × × × 8 . 20 10 08 20 80 20 30 72 76 30 74 82 P5 ...... N.A. N.A. 1 1 3 3 3 3 1 3 1 1 10 9 2 × × × × × × × × × × × × 8 . 20 20 12 50 20 50 12 81 30 14 82 P4 ...... N.A. N.A. N.A. 1 1 1 4 3 1 4 1 1 10 9 2 × × × × × × × × × × × × × × 8 . 30 79 30 20 50 45 20 50 20 30 74 82 08 P3 ...... N.A. 1 1 1 1 3 3 3 1 4 3 10 9 2 3 × × × × × × × × × × × × × × 8 . 20 81 20 12 50 45 60 20 30 30 14 82 54 P2 ...... N.A. 1 1 1 1 4 3 3 1 3 3 10 9 2 2 × × × × × × × × × × × × × 8 9 . . 50 50 45 60 30 30 81 30 20 03 62 P1 ...... N.A. N.A. 3 4 3 3 1 3 1 3 1 10 10 3 3 Table 4.2: A subset of extensible instructions library 50 180 950 2740 2630 5810 1065 1200 7000 Area 16000 23400 13200 10154 10000 21000 [gates] enc enc enc enc enc enc enc enc 721 721 721 721 enc enc enc dec dec dec dec g g g g gsm gsm gsm gsm , , , , , , , , Used dec dec dec dec dec dec dec dec mpeg2 mpeg2 mpeg2 mpeg2 adpcm adpcm adpcm Application gsm gsm gsm gsm g721 g721 g721 g721 1 CC Inst DC3 DC4 SSIZE ADD8 GSMS CAL QUAN ADD14 FMULT MYSAT GSMLM GSMMR RECONS DC1,DC2 Extensible 98 CHAPTER 4. SEMI-AUTOMATIC DESIGN SYSTEM

Application Number of Code segment before after Adpcm encoder 10 3 Gsm encoder 15 4 Gsm decoder 15 4 G721 encoder 20 4 G721 decoder 20 4 Mpeg2 decoder 152 12 Epi encoder 247 15 Epi decoder 362 20 Epwi decoder 153 12 Voice recognition 106 10

Table 4.3: The efficacy of the fitting function show that the methodology for identifying instructions suggests useful instructions to extract from an application program.

Table 4.4 summarises the efficiency of the heuristics algorithm for selecting pre- configured processor and extensible instructions. The first column in Table 4.4 displays the application name. The second column shows the number of Pareto points obtained using the design system, while the third column shows the total number of Pareto points in the design space. The final column indicates the total number of configurations in the application. The heuristic algorithm (Part I & II - for selecting pre-configured processor, and extensible instructions when a tight area constraint is given and then progressively relaxed) is able to obtain, on average, 91% of the Pareto points for all the benchmarks. Although our algorithm does not obtain all of the Pareto points, application performance is on average within 7.1% of the Pareto points (which were not obtained) application performance. The reason that our heuristic algorithm fails to deliver all Pareto points is that the heuristic algorithm searches for ratios rather than absolute values.

Third, in order to demonstrate the efficiency and accuracy of the execution time estimation of the system, we estimated the execution time for all the obtained Pareto CHAPTER 4. SEMI-AUTOMATIC DESIGN SYSTEM 99

ADPCM GSM 0.14 0.07 FittingFunction 0.12 0.06 Speedup/area 0.10 0.05

0.08 0.04

0.06 0.03 Relative Value Relative Value Relative 0.04 0.02

0.02 0.01

0.00 0.00 GSMS CAL_1 GSMMR GSMLM DC3 DC4 DC1, DC2 Instructions Instructions

(a) GSMenc and GSMdec (b) ADPCMenc

MPEG2 G721 7.00 1.00 0.90 6.00 0.80 5.00 0.70 4.00 0.60 0.50 3.00 0.40 Relative Value 2.00 Relative Value 0.30 0.20 1.00 0.10 0.00 0.00 CC MYSAT ADD14 ADD8 QUAN FMULT RECONS SSIZE

Instructions Instructions

(c) MPEG2dec (d) G721enc and G721dec

VOICE RECOGNITION 0.35 EPI 0.30 0.30 0.25 0.25

0.20 0.20

0.15 0.15

Relative Relative Value 0.10 0.10 RelativeValue 0.05 0.05 0.00 0.00 3 E 2 2 D LN E1 E2, E4, E6 E7 E8 E9, E11, E14, E16 E17 E18, O LDE EXP FR A 3 F M 3 2 R XP F D 3 F M F E3 E5 E10 E12, E15 E19, LDE, FRE E13 E20 M N , L P , C B Instructions Instructions

(e) VOICE (f) EPIenc and EPIdec

EPIW 0.16 0.14

0.12 0.10

0.08

0.06 Relative value 0.04

0.02 0.00 E1 E2 E3, E4E5, E6 E7 E8 E9 E10, E12 E11

Instructions

(g) EPIWenc and EPIWdec

Figure 4.5: The relationship between the fitting function and speedup/area ratio of the instruction 100 CHAPTER 4. SEMI-AUTOMATIC DESIGN SYSTEM

Application. Pareto Points Total number of Total number of obtained Pareto Points Configurations Adpcm encoder 4 6 96 Gsm encoder 6 6 768 Gsm decoder 6 7 768 G721 encoder 8 8 192 G721 decoder 7 7 192 Mpeg2 decoder 15 18 768 Epi encoder 36 42 1536 Epi decoder 215 235 12288 Epwi decoder 35 38 1536 Voice recognition 19 19 768

Table 4.4: The efficiency of the heuristic algorithm points for extensible processors on each benchmark and compared this with the ex- ecution time obtained using an ISS of the target processor. The estimation of the execution time is on average within 5.7% of the real execution time of an application program.

Finally, Table 4.5 shows a summary of the semi-automatic design system. The

first column displays the application name. The second and third columns represent the original solution which is the solution on the base processor with no additional predefined blocks and extensible instructions. The fourth and fifth columns show the solution that the system obtained. The sixth column displays the speedup of each application by the design system. The seventh column shows the accuracy of the performance estimation. The final two columns compare the design exploration time between exhaustive simulation methodology and design system respectively. Table 4.5 indicates that the design system achieved, on average, a 4.74× speedup (up to 15.71×) of the application while the performance estimation is within 5.7% of the ISS result. In addition, the design space exploration time for our design flow is on average 2.5% of the design space exploration time using exhaustive simulation methodology. Figures 4.6 and 4.7 show the design space of the gsm decoder and the mpeg2 decoder benchmarks CHAPTER 4. SEMI-AUTOMATIC DESIGN SYSTEM 101 8 4 95 80 75 213 329 255 105 305 [minutes] Our System 58 Exploration Time 187 2382 3963 1003 1039 10659 10496 15059 13305 [minutes] Exhaustive 3.8% 6.8% 5.9% 6.5% 4.8% 5.8% 6.2% 4.8% 5.1% 7.1% error rate on Performance est. Pareto Points [%] of an 1.17x 3.41x 1.84x 2.47x 2.37x 1.14x 3.44x 6.90x 6.75x 15.71x Speedup application [x] 1.66 0.79 0.64 0.65 0.48 0.16 0.40 0.04 0.095 0.037 Execution Time [sec.] Area Our Best Solution 83,480 96,410 96,410 88,870 88,870 78,850 [gates] 179,500 183,200 129,560 124,100 5.67 1.46 1.58 1.54 7.54 0.55 2.76 0.27 0.111 0.042 Execution Table 4.5: Summary of the semi-automatic design system results Time [sec.] Original Solution Area 69,680 69,680 69,680 69,680 69,680 69,680 69,680 69,680 69,680 69,680 [gates] Epi encoder Epi decoder Application. Gsm encoder Gsm decoder Epwi decoder G721 encoder G721 decoder Mpeg2 decoder Adpcm encoder Voice recognition 102 CHAPTER 4. SEMI-AUTOMATIC DESIGN SYSTEM

(a) Full design space

(b) Pareto points

Figure 4.6: GSM decoder’s design space and Pareto points CHAPTER 4. SEMI-AUTOMATIC DESIGN SYSTEM 103

(a) Full design space

(b) Pareto points

Figure 4.7: MPEG2 decoder’s design space and Pareto points 104 CHAPTER 4. SEMI-AUTOMATIC DESIGN SYSTEM respectively, both with 768 configurations and the Pareto points walk through the design space using the design flow of the design system. For these benchmarks, the design process takes approximately 3963 minutes (66.1 hours) and 187 minutes (3.11 hours) respectively to explore the entire design space, whereas our design flow took only 105 minutes (1.75 hours) and 4 minutes respectively to obtain these Pareto points in the design space. Furthermore, Column two of Table 4.2 also shows that in our design system, an extensible instruction can be reused in more than one application within the same domain. This methodology eliminates repeated effort in the creation of extensible instructions. The result clearly indicates that the design system shortens the design turnaround time for an extensible processor.

4.4 Conclusions and Future Work

This chapter has described the design system for configuring an extensible processor, which maximises the performance of an application and satisfies the area constraint, as well as significantly reducing the design turnaround time. These results are achieved through three major improvements over existing design systems:

• Our method first identifies code segments within individual computationally in-

tensive C functions, and then uses a fitting function to rank the identified code

segments according to the potential speedup and area constraint. By considering

each line (rather than each operation) within individual computationally inten-

sive C functions, the exponential blowup problem is reduced. In addition, the

boundary of searching connected subgraphs is extended, which is a limitation of

previous work [67,193]. Our method also searches between connected subgraphs

in DFG, which advances the state-of-the-art research in this area.

• A two-stage hierarchical approach to explore the design space is implemented: CHAPTER 4. SEMI-AUTOMATIC DESIGN SYSTEM 105

first, to select a pre-configured processor; and second, to select the right exten-

sible instruction set for the processor. The designer selects a set of predefined

blocks for the application to prune the design space of extensible processors.

Cost functions and a heuristic algorithm are developed to guide the two-stage

selection process, ensuring significant design space is explored in a short time.

By introducing pre-configured processor selection, the design space exploration

is extended in such a way that the number of extensible instructions generated

or selected can be entirely between different pre-configured processors. For ex-

ample, if an application has a floating-point (fp) multiplication that takes up

significant execution time, there are at least two possible solutions to speedup

the application: i) to include a fp predefined block (which has fp registers and is

able to execute fp addition, fp multiplication, etc.) in the processor; and ii) to

generate/select a fp multiplication instruction in the processor. The first solu-

tion may be overkill if the application only has one fp operation. On the other

hand, since the fp multiplication instruction is generated by designers in the sec-

ond solution, the speedup and area constraint may not be ideal when compared

to the vendor’s fp predefined block. These kinds of situations are introduced

when predefined blocks are included, extending the existing design space of just

identifying/selecting instructions. Our two-stage hierarchical approach addresses

this kind of situation in a unique way, differentiating from the problem described

in [120] and advancing the state-of-the-art research in this area.

• An estimation function is implemented to estimate an application’s performance,

rather than executing each configuration repeatedly through an Instruction Set

Simulator. Thus the evaluation time is significantly reduced. In addition, this

estimation function takes into account the addition of extensible instructions 106 CHAPTER 4. SEMI-AUTOMATIC DESIGN SYSTEM

that affects the clock speed of the base processor, thereby providing an accurate

measure of the application performance.

One of the main ideas behind this semi-automatic design system is the hierarchical approach to designing an extensible processor by i) first limiting the design space through selection of a pre-configured processor core; and ii) by selecting an appropriate set of extensible instructions on that specific processor. This approach allows efficient searching of the design space while allowing for designer input. This is the first system of its kind that achieves high speedups at reasonable costs without exhaustive (time consuming) design space exploration.

This chapter has demonstrated how, using our design system, ten different real- world benchmarks can be designed within an extensible processor environment. The design space exploration time of the design system is on average 2.5% of the design space exploration time using full simulation for a given set of benchmarks. The fitting function for identifying the appropriate code segment relates to the speed/area ratio of the instruction. In addition, our heuristic algorithm was able to locate on average 91% of all Pareto points from the entire design space in all benchmarks. The execution time estimation for the proposed extensible processor is on average within 5.68% of results obtained with an ISS, and is generated in typically less than a second. Finally, the application program execution time is reduced by up to 15.71× (4.74× on average), with an average area overhead of 65% on the benchmarks.

Although matching code segments with instructions in the library and the generat- ing instructions are both manually performed in the design system, the system is still useful in many real-world applications. Chapter 5

Matching Instructions Tool

This chapter builds upon the insights developed in the previous chapter to develop an automation tool for matching pre-designed extensible instructions with computa- tionally intensive code segments. This is one of the most challenging and as yet un- solved steps in the design flow of the extensible processor platform. Given a library of pre-designed extensible instructions that may or may not be included (depending on the application and its constraints) within the final design of the extensible processor and an application written in C/C++, the goal of matching is to automatically match instructions in the library with code segments in the application in order to automat- ically judge whether a specific code segment (software) of the application should be replaced by an extensible instruction. This is an inherently complex task. Figure 5.1 shows a motivation example for matching an extensible instruction with a code segment and determining whether they are functionally equivalent.

The traditional approach to instruction matching consists of instruction simulation

[189], and data control graph matching techniques [71,119,145,186]. In the simulation approach, a code segment and the equivalent hand-designed instruction are simulated with the same set of input vectors, while comparing output vectors. The drawback of this approach is the need to simulate a complete set of data vectors in order to ensure that the extensible instruction and the software code segment are functionally

107 108 CHAPTER 5. MATCHING INSTRUCTIONS TOOL

// Application in C int main() { … … // Code segment for (x = 0; x < 100000; x++) {

tmp = a[x] * b[x] << 4; total += tmp; } … … }

Functional Equivalent

// Pre-designed specific instruction state total 32 ? iclass ei EI {out arr, in art, in ars} {in state} reference EI { wire [31;0] tmp assign tmp = TIEmul(art, ars, 1’b0) << 4; assign arr = tmp + state; }

Figure 5.1: A motivation example: to match pre-designed extensible instructions with code segments of the application equivalent. This makes the process not only time and computation intensive but also potentially error prone unless 100% data set coverage is guaranteed. Another technique, data control graph matching, enables the matching of extensible instructions with a structurally equivalent representation of the appropriate code segment. Since the same segment can be represented graphically in many different ways, such a method will often result in a false negative. The differences in the graphical representation can arise from the level of granularity and the method of decomposition in a function.

To overcome the shortcomings of the simulation and pattern matching techniques, a matching instructions tool, namely MINCE (Matching INstructions using Combina- tional Equivalent), is proposed. MINCE consists of a translator, a filtering algorithm and a combinational model equivalence checking tool. The translator converts a code segment described in a high-level language (typically C/C++) to a combinational Ver- ilog representation. The filtering algorithm rapidly prunes candidate instructions that CHAPTER 5. MATCHING INSTRUCTIONS TOOL 109 cannot match any pre-designed extensible instructions. Finally, the combinational model equivalence checking tool is used to ensure that the functionality of the code segment and the extensible instruction are equivalent. The advantages of the MINCE tool are:

• It automates the step of instruction matching and is superior to computation-

intensive and error-prone simulation approaches; and

• The usage of functional equivalence checking ensures that the results (i.e. found

candidates for extensible instructions) are largely independent of the program-

ming style of the application that is to be accelerated.

MINCE is the automated tool for matching extensible instruction to the functional equivalence of code segments in an extensible processor platform. This new, key and missing step compliments the existing extensible processor design flow shown in the orange section in Figure 5.2.

5.1 Background

This section provides the necessary background on combinational equivalence check- ing. First, it describes some basics of binary decision diagrams (BDDs), and the ad- vantages and disadvantages of using BDDs as functional equivalence checking. Finally, an example of the BDD presentation of code segment and extensible instructions (both are functionally equivalent) is given.

A Reduced Ordered Binary Decision Diagram (ROBDD, but often simply referred to as “BDD”) is a canonical data structure that uniquely represents a boolean function with the maximal sharing of substructure [49]. The BDD data structure is based on the maximal sharing of substructure in a boolean function, and hence BDD is not as prone to exponential resource blowup as other representations. In order to further 110 CHAPTER 5. MATCHING INSTRUCTIONS TOOL

Application written in C/C++

Compilation

Analysis and Profiling

Identify computationally intensive code segments

Power * * * ** Matching Instructions * * * * ***** ***** Tool ******* ********** ***** ***** ****** ************* **** Generate extensible instructions - To find functional ****************** * *************** for code segments equivalent pre-designed ************** * **************** Area instructions of the ************* * * * *********** ** * * ** **** * * identified code segments Performance

Explore extensible Select extensible instructions, processor predefined blocks, parameters Synthesizable RTL of: design space Base processor Predefined blocks Evaluate performance and design Extensible Instructions constraints of the processor Parameter settings

Design that satisfies design constraints (power, Generate extensible processor area, performance)

Synthesis and Prototyping tape-out

Figure 5.2: A generic design flow for designing an extensible processor and how MINCE tool fits in the design flow CHAPTER 5. MATCHING INSTRUCTIONS TOOL 111 minimise the memory requirements of BDDs, Rudell [174] introduced dynamic variable ordering. Dynamic variable ordering is often applied to change the order of the variable continuously (without changing the original function being represented) while the BDD application is running in order to minimise memory requirements. There are many derivatives of BDDs such as the multi-valued BDD (MDD) - which has more than two branches and potentially a better ordering, and the free BDD (FDD) - which has a different variable ordering and is not canonical etc. Madre and Billion [149,150] proposed the use of BDDs to solve combinational equivalence checking problems. They showed that BDDs could be used to verify the equivalence of combinational circuits and to determine where the two circuits are not equivalent.

The advantage of using BDDs for combinational equivalence checking is that if two functions have the same functionality but different circuit representations, their BDDs will still be identical. On the other hand, the disadvantage of BDDs is that during the verification, the memory requirement of complex modules such as multiplication is extremely large, and the verification time may slow down significantly as a result. In addition, verification time for checking whether two functions are functionally equiva- lent consists of creation time of BDDs, dynamic variable ordering time, and checking equivalent time. Figure 5.3 shows the verification time distribution for three extensible instructions with three code segments. The checking equivalent time is often less than

50% of the total verification time.

For example, Figure 5.4a shows the high-level language representation for a code segment (S = (a + b) ∗ 16) and an extensible instruction (S = (a + b) << 4) (both are functionally equivalent). Figure 5.4b shows the BDD representation of the code segment and the extensible instruction. Since there are 32 bits in each variable (a, b, S), the BDDs of variable S (bit 11 to bit 4) are shown. Bit 5 of variable S is expanded for 112 CHAPTER 5. MATCHING INSTRUCTIONS TOOL

Verification Time Distribution

3

2 Extensible Instructions

Creation and flattening time of BDDs 1 Dynamic variable ordering time Checking equivalent time

0 1020304050607080 Time [minutes]

Figure 5.3: Verification time distribution

clarity. Note that the ci in the BDD in Figure 5.4b is the carry in of each bit. The BDD representation of extensible instruction is identical to the BDD representation of the code segment, which indicates that the code segment and the extensible instruction are functionally equivalent. Figure 5.4c shows a BDD representation for a 4-bit addition.

5.2 Related Work

This section divides the related work into two sections. First, it describes the previ- ous work for automating one or more steps of an extensible processor design flow with extensible instructions capabilities. Second, it discusses work related to automatically matching/identifying software language constructs to equivalent hardware descriptions.

Starting with the first group, Lee, Choi and Dutt [137] proposed a design flow with instruction encoding, complex instruction generation, and a heuristic design space exploration, in order to reduce the design turnaround time for extensible processors.

Their speedup of complex instruction is mainly achieved through reducing the size of op-codes and operands, and shortening the instruction fetch/decode time. Cheung,

Henkel, Parameswaran [57] produced a design flow that includes a methodology for rapidly selecting extensible instructions in a pre-designed instruction library. There, the CHAPTER 5. MATCHING INSTRUCTIONS TOOL 113

\\ High-level Code Segment \\ Extensible Instruction S = (a + b) * 16; S = (a + b) << 4;

(a) Code segment and extensible instruction

a a a a a a a a a a a a a a a a i i i i i i i i i i i i i i i i

b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i

c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i

… 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 … … 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 … S31 … S11 S10 S9 S8 S7 S6 S5 S4 … S0 S31 … S11 S10 S9 S8 S7 S6 S5 S4 … S0

a1 a1

b1 b1 b1 b1

ci1 ci1 ci1 ci1 ci1 ci1 ci1 ci1

0 1 0 1

S5 S5

(b) BDD representations

(c) BDD of 4-bit addition

Figure 5.4: A code segment and an extensible instruction and their BDD representa- tions 114 CHAPTER 5. MATCHING INSTRUCTIONS TOOL extensible instructions are optimally but manually designed. Sun, Ravi, Raghunathan and Jha [193] proposed a design flow that atomically generates instructions, inserts instructions, and performs a heuristic design space exploration. Automatic instruction generation locates the regular templates derived from program dependence graphs, and implements the most suitable ones as extensible instructions. Automatic instruction generation is based on matching regular templates in the graph only and then selecting the combination of the regular templates using a graph representation and algorithm.

Matching hardware to a software code segment has been attempted in various forms during the last decade and can be categorised into three research disciplines: graph matching approaches [71,119,145,186], extensive simulation [189] and equivalence ver- ification [68,168,184].

Graph matching approaches can be further divided into template/pattern match- ing [71, 119] and instruction-set matching [145, 186]. These approaches are based on the concept of graph representation (such as control/data flow graphs (CDFG)) and the application of heuristic algorithms to search the equivalent pre-defined template instructions in the graph representation of the software application. The limitation of this approach is that only instructions with structurally equivalent templates/patterns can be matched. Since extensible instructions often contain special modules to meet design constraints, it is practically infeasible to find a structural match.

Extensive simulation using an Instruction Set Simulator enables the matching of functionally equivalent instructions with corresponding software code segments [1,23].

However, this approach requires the designer to locate the corresponding software code segment manually. Stadler, Rower, Kaeslin, Felber, Fichtner and Thalmann [189] pro- posed a simulated-based solution for verifying intellectual properties. These techniques require simulation of a large data set in order to ensure the functional equivalence of CHAPTER 5. MATCHING INSTRUCTIONS TOOL 115 an instruction. Hence, the simulation approach is a very time-consuming process.

Several tools for verifying the combinational equivalence between C/C++ code and an HDL description have recently appeared [68, 168, 184]. In 1998, Pnueli, Siegel and

Shtrichman [168] introduced the idea of verifying the equivalence (safety-critical) of a software implementation in C with a small BDD transition model. However, the C program is restricted to a subset of C. Semeria, Seawrigha, Mehra, Ng, Ekanayake and

Pangrle [184] developed a tool for verifying the combinational equivalence of RTL-C and an HDL. Once again, the C code is only limited to a subset of C, which is very close to the hardware description (RTL code). In other words, the C code needs to be written in a very similar way to the RTL code. Recently, Clarke, Kroening and

Yorav [68] presented a tool for verifying the behavioral consistency of C and Verilog

HDL programs. This tool translates both C and Verilog HDL to bit vector equations, which in turn are translated to SAT instances and used to verify the equivalence using a bounded model checker. Our MINCE tool extends this approach to verify an extensible instruction and a C software code segment, but does not require the insertion of extra functions in the C program.

5.3 Overview of the MINCE Tool

This section describes the automated matching instructions tool, MINCE. It first provides an overview of the whole tool and then describes the important components: the translator, the filtering algorithm, and the combinational equivalence checking model in detail. Figure 5.5a shows the MINCE tool, which consists of a translator, a

filtering algorithm, and a combinational equivalence checking model. The input of the

MINCE tool is the extensible instruction library (in Verilog HDL) and the application

(in C/C++). The goal of the MINCE tool is to automatically match instructions in the 116 CHAPTER 5. MATCHING INSTRUCTIONS TOOL

Application Software Extensible Application Instruction Library Software in Separate C/C++ (Verilog HDL) Code Segment

Compile I Assembler MINCE system Translator Assembly Code Instruction Hardware Library II Filtering Algorithm Convert (Verilog HDL)

Register Transfer List III Combinational Equivalence Checking Tool Map

Verilog Code Functional equivalence implementation

(a) The design flow of the MINCE approach (b) Translator flow

Figure 5.5: MINCE: an automated tool for matching extensible instructions library with code segments of the application in order to automatically judge whether a specific code segment of the application might be replaced by an extensible instruction.

The first phase of the tool involves converting a code segment in C/C++ to a Verilog

HDL using the translator. There are three reasons for converting a code segment to a

Verilog HDL:

1. The extensible instruction is designed in Verilog HDL and no manipulation is

required if the combinational equivalence checking model uses Verilog HDL files

as input;

2. There is a well-developed combinational equivalence checking model that uses

Verilog HDL files as input and performs functional equivalence checking;

3. The granularity of code segment in C/C++ is typically high and the hardware

complexity of code segment may be higher than what it actually is. If the com-

plexity of code segment is higher, it would slow down the verification time signif- CHAPTER 5. MATCHING INSTRUCTIONS TOOL 117

icantly. It is therefore advantageous to systematically convert the code segment

to Verilog HDL to control the granularity and the hardware complexity of the

code segment.

Next, the filtering algorithm is applied to eliminate instructions that will not match with any code segments. The instructions that pass through the filter are then com- pared one by one with the code segment using a combinational equivalence checking model. The model is called Verification Interfacing with Synthesis (VIS), and was jointly developed by the University of California, at Berkeley and the University of

Colorado, Boulder [42].

5.3.1 The Translator

Figure 5.5b illustrates the translator flow. The input of the translator is the ap- plication written in C/C++. The goal of the translator is to convert the application written in C/C++ to a set of code segment in Verilog HDL using a systematic ap- proach. The translator consists of four steps: separate, compile, convert, and map.

In addition, there is assembler instruction hardware library written in Verilog HDL in the translator, which is a self-made library for the target processor. These instructions in hardware are referred as “base hardware modules”, and are used for technology mapping in the translator.

The application written in C/C++ is first separated into a set of frequently used code segments written in C/C++. In other words, the complete application written in

C/C++ is first profiled and then segmented, according to a ranking criterion (described in [58]).

The C/C++ code segment is first translated into assembly, which achieves the following objectives: 118 CHAPTER 5. MATCHING INSTRUCTIONS TOOL

Application Software written in C/C++

Separate (Step I)

// High-level code segment Assembler Instruction Hardware int example (int sum, int input1, int input2) { Library (Verilog HDL) total = sum + (input1 * input2) >> 4; return total; module mult (product, input1, input2); } …… endmodule Compile (Step II) module add (sum, in1, in2); // Assembly Code …… mult R6, R1, R2 endmodule mov R4, R6 sar R4, $4 add R5, R4, R3 module sfr (out1, in1, amt); …… Convert (Step III) endmodule

// Register Transfer List module sfl … ... R6 = R1 * R2; module cmpl … ... R4 = R6; R4_1 = R4 >> 4; ... … R5 = R3 + R4_1;

Map (Step IV) // Verilog Code module example (total, sum, input1, input2); output [31:0] total; input [31:0] sum, input1, input2; wire [31:0] r1, r2, r3, r4, r4_1, r5; wire [63:0] r6;

r1 = input1; r2 = input2; r3 = sum; mult (r6, r1, r2); r4 = r6; sfr (r4_1, r4, 4); add (r5, r3, r4_1); input1 total = r5; mult endmodule input2

module mult (product, in1, in2); sfr output [63:0] product; 4 input [31:0] in1; input [31:0] in2; add total sum … … endmodule

module add (sum, in1, in2); … … endmodule

module sfr (out1, in1, amt); … … endmodule

Figure 5.6: An example for translating to Verilog HDL in a form that allows matching through the combinational equivalence checking model CHAPTER 5. MATCHING INSTRUCTIONS TOOL 119

• Uses all of the optimisation methods available to the compiler to reduce the size

of the compiled code;

• Converts the translated code into the same data types as the instructions in the

library; and

• Unrolls loops with deterministic loop counts in order to convert the code segment

to a combinational implementation.

An example of this step (code segment to assembler) is shown in Figure 5.6 (step II).

The software code segment in the example contains addition, multiplication and shift right operations (mult - multiplication, move - move register, sar - shift right, and add

- addition). The reason the assembly code contains a move instruction is that the mult produces a 64-bit product, and hence the move instruction is used to reduce the size of the product to 32-bit data.

The assembler code is then transformed into a list of register transfer operations.

The translator converts each assembly instruction into a series of register transfers. The main goal of this conversion step is to convert any non-register transfer type operations, such as pop and push instructions, into explicit register transfer operations. In this step MINCE renames the variables in order to remove duplicate name assignments automatically. Duplicate names are avoided as Verilog HDL is a static single assignment form language [72]. In Figure 5.6, this is shown as step III. In this example, the translator converts each assembly instruction into a single register transfer. The register transfer operations show the single assignment statement of each register, R4, R4 1,

R5 and R6, where R4 1 is the variable renamed by our tool.

After the assembly code is converted to register transfer operations, the next step is technology mapping (step IV of Figure 5.6). In this step the register transfer oper- ations are mapped to the base hardware modules given in the pre-designed assembler 120 CHAPTER 5. MATCHING INSTRUCTIONS TOOL instruction hardware library. This library is manually designed for minimising the ver- ification time of the functions. Once each register transfer has been mapped to a base hardware module, the translator creates a top-level Verilog HDL description intercon- necting all the base hardware modules. The Verilog HDL, shown in Figure 5.6, is based upon the code segment and the register transfer operations. In this example, there are three input variables (sum, input1 and input2), one output variable (total), seven temporary connection variables (r1, r2, etc.) and three hardware modules (addition, multiplication and shift right). The top-level Verilog HDL declares the corresponding number of variables and contains the mapped code of the register transfer operations.

The technology mapping step provides a system-level approach to converting register transfer operations to a combinational hardware module. One of the drawbacks of this approach is that control flow operations such as branch and jump instructions may not directly map into a single base hardware module. Those instructions map to more complex hardware modules.

5.3.2 Filtering Algorithm in MINCE

The second phase of the MINCE tool is the filtering algorithm. The input of the

filtering algorithm is two Verilog HDL files: extensible instruction (written in Verilog

HDL and given as the input of the MINCE tool) and code segment (written in C/C++ and translated to Verilog HDL). The goal of the filtering algorithm is to eliminate the unnecessary and complex Verilog HDL file in the combinational equivalence checking model. The filtering algorithm also reduces the manipulation time from Verilog HDL to

BDDs in the combinational equivalence checking model by inserting a low complexity algorithm to filter unnecessary Verilog HDL file before the checking model.

Verilog HDL files can be pruned as non-match due to:

• Differing number of ports (the code segment might have two inputs, while the CHAPTER 5. MATCHING INSTRUCTIONS TOOL 121

Complex Module Implementation - Hardware Module Multiplier (32-bit) Add, Shift Multiplier (32-bit) Multiplier (16-bit), Adder, Multiplexor Division (32-bit) Multiplier (32-bit), Reciprocal Division (32-bit) Subtract, Shift Square Root (32-bit) Multiplier (32-bit), Add, Subtract Sine (32-bit) Multiplier (32-bit), Add, Subtract Cosine (32-bit) Multiplier (32-bit), Add, Subtract

Table 5.1: A subset of complex module with limited implementations

extensible instruction only one);

• Differing port sizes;

• An insufficient number of base hardware modules to represent a complex module

(for example, if the code segment only contained an XOR gate and an AND

gate, while the extensible instruction contained a multiplier (complex module),

a match would be impossible).

There are two reasons to check whether the code segment has an insufficient number of base hardware modules to represent the complex modules in the extensible instruction:

1. Complex module in the Verilog HDL file requires extremely large BDDs (i.e.

uses 1Gb RAM) to represent in the combinational equivalence checking model.

In addition, the manipulation time (from Verilog to BDDs) for these Verilog HDL

file is very large.

2. The number of ways to implement the complex modules is limited in the code

segment, controlling the granularity and complexity of the code segment and

reducing the possibility of failure in verification.

A subset of complex modules with limited implementations is shown in Table 5.1.

Figure 5.7 presents the pseudo code of the filtering algorithm. 122 CHAPTER 5. MATCHING INSTRUCTIONS TOOL

Algorithm Filtering (v1, v2) { if (Σ input(v1) != Σ input(v2)) return filtered; if (Σ output(v1) != Σ output(v2)) return filtered; if (Σ |input(v1)| != Σ |input(v2)|) return filtered; if (Σ |output(v1)| != Σ |output(v2)|) return filtered; for all modules v2 do { if (modules(v2) == complex module) cm list = implement(modules(v2)); } for all element i in cm list do { if (cm listi ⊆ Σ modules(v1)) return potentially equal; } return filtered; }

Figure 5.7: Algorithm filtering for eliminating the number of extensible instructions into the combinational equivalence checking model

5.3.3 Combinational Equivalence Checking Model

After filtering out unrelated instructions to the given code segment in the library,

MINCE checks whether the Verilog HDL converted from the software code segment is functionally equivalent to an instruction written in Verilog HDL. The checking is performed using Verification Interfacing with Synthesis (VIS) [42]. This part of the work could have been carried out with any similar verification tool.

Using a stand-alone compiler (VL2MV), this model first converts both Verilog HDL

files into an intermediate format (BLIF-MV) upon which VIS operates [55]. The BLIF-

MV hierarchy modules are then flattened to a gate level description. Note that VIS uses both BDDs and its extension MDDs to represent boolean and discrete functions.

VIS is also able to apply dynamic variable ordering [174] to improve the possibility of convergence.

The two flattened combinational gate level descriptions are declared combination- CHAPTER 5. MATCHING INSTRUCTIONS TOOL 123

Application Software in C/C++ Extensible Instruction Library Code Segments (Verilog HDL)

Simulation-based approach MINCE Compile the code tool Translator - to convert segment code segments in to Input Dataset Verilog HDL Simulate and obtain (100 millions of results input vectors) Filtering Algorithm - to filter unnecessary candidate into the combinational equivalence checking tool Simulated Pre-computed result result Combinational Equivalence Checking Tool - to check functional equivalent between the code segment and Compare results extensible instruction

Verifications and Comparing Number of Code Verification Time segment matched

Figure 5.8: Experimental and verification platform ally equivalent if they produce the same outputs for all combinations of inputs, and if MINCE declares the code segment and the extensible instruction to be functionally equivalent.

5.4 Experimental Results

This section describes the experimental setup and results. The target extensi- ble processor compiler and profiler used in our experiments is the Xtensa processor’s compiler and profiler from Tensilica, Inc. [23]. Our extensible instruction library and assembler instruction library is written in Verilog HDL (See Figure 5.5). Figure 5.8 shows the experimental and verification platform for our experiments.

To evaluate the MINCE tool we conducted two separate sets of experiments. In the

first, we created arbitrary diverse instructions and matched them against artificially generated C code segments. These segments either: a) matched exactly (i.e. they 124 CHAPTER 5. MATCHING INSTRUCTIONS TOOL were structurally identical); b) were only functionally equivalent; c) the I/O ports match (i.e. code segment passes through the filter algorithm but is not functionally equivalent); or d) did not match at all. This set of experiments was conducted to show the efficiency of functional matching as opposed to finding a match through the simulation-based approach. In the simulation-based approach, the C code segment is compiled and simulated with input vectors to obtain output vectors. The output vectors are compared with the pre-computed output result of the extensible instruction.

The simulation was conducted with 100 million data sets each (approximately 5×10( −

10)% of the full data set with two 32-bit variables as inputs of the code segment). The reason for choosing 100 millions as the size of the data set is the physical limits of the hard-drive. Each data set and pre-simulated result of the instruction requires approximately 1Gb of memory space. If more than n (n = 1 million (1% of the data set) for our experiments) differences occur in the simulation results, computation is terminated, and we state that a match is non-existent.

The second set of experiments used real-life C/C++ applications form Mediabench and automatically matched code segments to our pre-designed library of extensible instructions. We examined the effectiveness of the filtering algorithm by comparing the complete matching time including and excluding the filtering step. We selected the following applications: adpcm encoder, g721 encoder, g721 decoder, gsm encoder, gsm decoder, mpeg2 decoder from Mediabench [135] and a complete voice recognition system [57]. All experiments were conducted on a Sun UltraSPARC III running at

900MHz (dual) with 4Gb of RAM.

5.4.1 Evaluation Results

Table 5.2 and Table 5.3 summarise the results of our first experiment. The first column indicates the type of instruction and the hardware modules it contains. The CHAPTER 5. MATCHING INSTRUCTIONS TOOL 125 second column displays the type of software code segment (as compared to the instruc- tion being matched) while the third column shows the number of corresponding code segments used in the experiment. The fourth column represents the average matching time of the simulation-based approach for determining whether the extensible instruc- tion is functionally equivalent to the corresponding software code segment. The last column displays the average matching time of MINCE. In our first experiment, we show that MINCE successfully matches various, quite diverse (since generated) soft- ware code segments. In both experiments the correct result for all the software code segments was obtained. Our tool performed on average 7.1× (up to 39.5×) faster than the simulation-based approach. The time reduction for the comparison between our tool and simulation-based approach is shown in Figure 5.3. The negative time reduc- tion for the Don’t Match (I/O match only) code segments is due to the large creation and flattening time and dynamic variable ordering time of BDD. Despite this obser- vation, MINCE by far outperformed the simulation approach. Figures 5.10 and 5.11 summarise the matching time of simulation vs. MINCE. It should be noted that the simulation does not guarantee a match and is only a necessary condition, whereas the

MINCE tool guarantees a match.

Table 5.4 summarises the results for matching instructions from the library to code segments in six different, real-life multimedia applications. We compare the number of instructions matched and the time taken in matching extensible instructions between a reasonably experienced human extensible processor designer and this simulation-based approach. The extensible processor designer selects the code segments manually and simulates code segments using 100 million data sets. The first column of Table 5.4 indi- cates the application. The second column shows the speedup achieved by the extensible processor designer and our MINCE tool. The third and forth columns represent the 126 CHAPTER 5. MATCHING INSTRUCTIONS TOOL

Instruction Software Number of Simulation MINCE (Hardware module) Code Segment Code Segment Time [min.] Time [min.] Instruction 1 Exact Match 1 79 2 (Add, logical AND) Functional Equ. 3 82 3 I/O Match only 3 < 1 < 1 Do Not Match 3 < 1 < 1 Instruction 2 Exact Match 1 46 2 (Shift right, logical Functional Equ. 3 46 2 XOR) I/O Match only 3 < 1 < 1 Do Not Match 3 < 1 < 1 Instruction 3 Exact Match 1 65 2 (Add, Rotate shift Functional Equ. 3 65 3 right) I/O Match only 3 < 1 < 1 Do Not Match 3 < 1 < 1 Instruction 4 Exact Match 1 86 2 (Add, Shift left) Functional Equ. 3 87 3 I/O Match only 3 < 1 2 Do Not Match 3 < 1 < 1 Instruction 5 Exact Match 1 41 2 (Add, Shift right, Functional Equ. 3 42 3 logical AND) I/O Match only 3 < 1 2 Do Not Match 3 < 1 < 1 Instruction 6 Exact Match 1 49 10 (Add, shift, Functional Equ. 3 55 20 extra register) I/O Match only 3 < 1 12 Do Not Match 3 < 1 < 1 Instruction 7 Exact Match 1 85 60 (Shift right, Functional Equ. 3 90 85 multiplier) I/O Match only 3 < 1 15 Do Not Match 3 < 1 < 1 Instruction 8 Exact Match 1 102 70 (add, multiplier) Functional Equ. 3 105 75 I/O Match only 3 < 1 20 Do Not Match 3 < 1 < 1 Instruction 9 Exact Match 1 64 2 (Comparator, Shift Functional Equ. 3 65 10 left) I/O Match only 3 < 1 7 Do Not Match 3 < 1 < 1 Instruction 10 Exact Match 1 35 5 (Combine, logical Functional Equ. 3 45 9 XOR, logical OR) I/O Match only 3 < 1 6 Do Not Match 3 < 1 < 1

Table 5.2: Experimental results on hardware instructions on different kinds of software code segments CHAPTER 5. MATCHING INSTRUCTIONS TOOL 127

Instruction Software Number of Simulation MINCE (Hardware module) Code Segment Code Segment Time [min.] Time [min.] Instruction 11 Exact Match 1 39 21 (8-bit multiplier) Functional Equ. 3 42 31 I/O Match only 3 1 1 Do Not Match 3 1 1 Instruction 12 Exact Match 1 68 48 (16-bit multiplier) Functional Equ. 3 72 52 I/O Match only 3 1 2 Do Not Match 3 1 1 Instruction 13 Exact Match 1 92 56 (32-bit multiplier) Functional Equ. 3 101 81 I/O Match only 3 1 7 Do Not Match 3 1 1 Instruction 14 Exact Match 1 105 72 (MAC unit, Functional Equ. 3 112 78 Rotate shift left) I/O Match only 3 1 7 Do Not Match 3 1 1 Instruction 15 Exact Match 1 81 5 (Selector, Functional Equ. 3 85 8 Subtract, Add) I/O Match only 3 1 1 Do Not Match 3 1 1 Instruction 16 Exact Match 1 85 3 (Comparator, Functional Equ. 3 89 5 Selector, Add) I/O Match only 3 1 3 Do Not Match 3 1 1 Instruction 17 Exact Match 1 88 68 (Comparator, Selector, Functional Equ. 3 95 78 Multiplier) I/O Match only 3 1 3 Do Not Match 3 1 1 Instruction 18 Exact Match 1 72 5 (Subtract constant, Functional Equ. 3 85 8 Selector, Add) I/O Match only 3 1 3 Do Not Match 3 1 1 Instruction 19 Exact Match 1 72 3 (Add, Logical Functional Equ. 3 79 3 XOR, Shift left) I/O Match only 3 1 1 Do Not Match 3 1 1 Instruction 20 Exact Match 1 85 3 (Comparator, Functional Equ. 3 89 9 Logical INV) I/O Match only 3 1 2 Do Not Match 3 1 1

Table 5.3: Experimental results on hardware instructions on different kinds of software code segments (part 2) 128 CHAPTER 5. MATCHING INSTRUCTIONS TOOL Simulation r Matchingr Tool Totally Wrong (Do Not Match) Not (Do Wrong Totally and MINCE Do Not Match (I/O only) (I/O Match DoNot Functional Equivalent Functional Exact Match Exact Time ReductionTime Comparison Between Simulation and Ou Figure 5.9: Time reduction: the comparison between 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 0

75 50 25

-25 100 Time (minutes) Time

CHAPTER 5. MATCHING INSTRUCTIONS TOOL 129

TW

DM

FE 10 EM

(part 1) TW

Simulation MatchingOur Tool

DM FE

MINCE EM

vs. TW

8 9 Instruction Instruction

EM - Exact Match Exact - EM Equivalent Functional FE- only) (I/O Match DoNot - DM Match) Not (Do Wrong Totally - TW DM

FE EM

Simulation TW

DM

FE

EM

TW

DM

FE

ction with different kinds ofdifferent S/Wwith kinds ction segment) code EM

uction matching step: TW

DM

FE

EM

TW

4 5 Instruction 6 Instruction 7 Instruction Instruction

DM

FE

EM

TW

DM

FE

EM

TW

DM

FE

EM

Simulation vs Our Tool Matching (Time H/Won instru Simulation TW

DM

FE EM Instruction 1Instruction 2 Instruction 3 Instruction Instruction 0

80 60 40 20

120 100 Figure 5.10: Results in terms of computation time for the instr Time (minute) Time

130 CHAPTER 5. MATCHING INSTRUCTIONS TOOL

TW

DM

FE 20 EM

Instruction

(part 2)

TW

Simulation MatchingOur Tool

DM FE 19

MINCE EM Instruction

vs. TW DM

EM - Exact Match Exact - EM Equivalent Functional FE- only) (I/O Match DoNot - DM Match) Not (Do Wrong Totally - TW

FE 18 EM Instruction

Simulation TW

DM

FE 17

EM

Instruction

TW

DM FE 16

ction with different kinds ofdifferent S/Wwith kinds ction segment) code EM

Instruction

uction matching step:

TW

DM

FE 15

EM

Instruction

TW

DM

FE 14

EM

Instruction

TW

DM

FE 13

EM

Instruction

TW

DM

FE 12

EM

Instruction

TW

Simulation vs Our Tool Matching (Time on H/W instru Simulation

DM

FE 11 EM Instruction 0

80 60 40 20

Figure 5.11: Results in terms of computation time for the instr 120 100 Time (minute) Time CHAPTER 5. MATCHING INSTRUCTIONS TOOL 131 8 9 25 10 25 15 18 Time [hour] 9 3 4 4 4 4 4 MINCE tool matched No of Instruction 25 20 20 40 35 21 40 Time [hour] nd speedup gained by different systems 3 4 4 4 4 4 9 matched No of Instruction MINCE (without filtering algorithm) 80 75 74 95 105 115 205 Time [hour] 3 4 4 4 4 4 9 matched Designer and Simulation No of Instructions [x] 2.2 2.5 2.3 1.1 1.1 1.3 6.8 Speedup enc enc 2 dec enc dec enc 721 721 Table 5.4: Number of instructions matched, matching time used a G G GSM GSM V OICE Software Application MPEG ADPCM 132 CHAPTER 5. MATCHING INSTRUCTIONS TOOL number of instructions matched and the matching time used by the extensible proces- sor designer and simulation-based approach respectively. The next two columns show the number of instructions matched and time used by MINCE (without the filtering algorithm). The last two columns displays the same characteristics by the MINCE tool. Our automated tool is, on average, 7.3× (up to 9.375×) faster than manually matching extensible instructions. We show that the effectiveness of the filtering al- gorithm, which reduces the equivalence checking time by more than half (compare columns six and eight). In addition, we show the speedup of the embedded application that could be achieved through the automatic matching, which is 2.47× on average

(up to 6.8×). Note also that both the human designer and our MINCE system made identical matches.

5.5 Conclusions and Future Work

This chapter has presented the MINCE tool as part of an extensible processor de- sign framework. MINCE translates selected code segments of an embedded application to a hardware description, filters out code segments that would not match and even- tually matches code segments to a pre-defined library of extensible instructions using functional equivalence checking. Using the Mediabench suite of applications, we have shown that our approach is feasible, as the tool was able to automatically match appli- cation code segments to extensible instructions in the library. The time for matching was on average 7.3× faster than a simulation-based approach, which has been the state- of-the-art in extensible processor design so far. We have also evaluated the speedup of the embedded application that could be achieved through the automatic matching.

This speedup was 2.47× on average and therefore identical to a hand-optimised design

(optimum solution). CHAPTER 5. MATCHING INSTRUCTIONS TOOL 133

Ours is therefore the first computationally feasible approach to fully automate an extensible processor design flow by filling in the missing gap of instruction matching.

What remains unresolved in our system is the matching of complex code segments, which include both data operations and control statements. This will form part of our future work. 134 CHAPTER 5. MATCHING INSTRUCTIONS TOOL Chapter 6

Instruction Estimation Models

The previous two chapters focused on automating code segment identification, ar- chitectural customisations selection, processor evaluation and matching pre-designed instructions. The rest of the thesis is devoted to the analysis and generation of extensi- ble instructions, which could be selected in the application on the extensible processor.

Since the design space of designing extensible instructions is unfeasibly complex, there is a need to derive characteristics analysis models with greater accuracy in the system level, and which explore all proposed extensible instruction structures before generating the extensible instruction for the code segment.

Since analysis is the pre-requisite for generating extensible instructions, this chap- ter first presents techniques for estimating the physical characteristics of extensible instructions such as area overhead, latency (delay), and power consumption. Previous work on estimating physical characteristics of extensible instructions has ignored par- allelism techniques and scheduling alternatives for instruction estimation models. This chapter demonstrates that parallelism techniques and scheduling alternatives provide significant information regarding extensible instructions. In particular, for estimating latency (delay) and power consumption, parallelism techniques and scheduling alter- natives affect the layout and connectivity of the extensible instructions. This chap- ter presents estimation models of extensible instructions for area overhead, latency

135 136 CHAPTER 6. INSTRUCTION ESTIMATION MODELS and power consumption using system decomposition [196] and regression analysis [20].

These estimation models achieve high accuracy, which enables the designers to con- trol our previously presented techniques for semi-automatic instruction selection for extensible processors.

6.1 Motivation

Extensible instructions can be customised in numerous ways, such as selecting and parameterising components like arithmetic operators, etc. Designers can judiciously select from the available components and parameterise them for specific functionality.

Parallelism techniques can be deployed to achieve a further speedup. There are three known techniques: a) Very Long Instruction Words (VLIW); b) vectorisation; and c) hard-wired operation, each with varying tradeoffs in performance, power etc. [26, 166,

207]. These techniques can be used in conjunction with one another. Designers can also schedule the extensible instruction to run in multi-cycles. Thus, the design space of extensible instructions is almost unfeasibly complex.

Figure 6.1 shows four instructions (sequences) that can be designed to replace a single code segment in the original software-based application. A code segment with four vectors a, b, c and d, are summed up to generate a vector, z, in a loop with a loop iteration count of 1000. If an extensible instruction is to replace the code segment, then a summation in series, or a summation in parallel, using 8-bit adders can be defined. These are shown in Figures 6.1b and 6.1d respectively. Designers can also group four sets of 8-bit data together and perform a 32-bit summation in parallel

(shown in Figure 6.1e). It should be noted that this implementation only loops 250 times. In Figure 6.1c, an instruction using a 64-bit adder and a 32-bit adder can also be implemented, requiring just 125 loop iterations. Furthermore, each of these designs, CHAPTER 6. INSTRUCTION ESTIMATION MODELS 137

short *a, *b, *c, *d, *z; for (int i = 0; i < 1000; i++) z[i] = a[i] + b[i] + c[i] + d[i];

(a) Code segment

a[i]

+8 b[i] a[i], … , a[i+3], b[i], … , b[i+3] +8 + + z[i], z[i+1], … , z[i+7] c[i] +8 z[i] 64 32 d[i] c[i], … , c[i+3], d[i], … , d[i+3] (b) (c)

a[i], a[i+1], a[i+2], a[i+3] a[i] + + 32 8 b[i], b[i+1], b[i+2], b[i+3] b[i] + z[i], z[i+1], z[i+2], z[i+3] + z[i] 32 c[i] 8 c[i], c[i+1], c[i+2], c[i+3] + +8 32 d[i] d[i], d[i+1], d[i+2], d[i+3] (d) (e)

Figure 6.1: A motivation example: four varieties to design an instruction which replaces a code segment while functionally equivalent, will have differing characteristics in power, performance, etc. To verify the area overhead, latency, and power consumption of each instruction is a time consuming task and not a tractable method for exploring the instruction design space. To ensure a good design, it is crucial to explore as many of the design points as possible. This requires fast and accurate estimation techniques.

6.2 Background and Theory

The Xtensa processor from Tensilica Inc. [23] was used as the platform for this work.

The processor consists of a RISC base core with approx. 80 base instructions, plus the capacity to define specific functionality through extensible instructions (using Tensilica

Instruction Extension (TIE)), which coexist with the base instructions. An extensible instruction (in Xtensa) decomposes into five parts: the decoder, which uses data from the instruction decoding stage and assigns internal signals for the execution stage; 138 CHAPTER 6. INSTRUCTION ESTIMATION MODELS hardware to clock gate the instruction, such that the instruction can be turned on and off as needed; control logic (also known as top-logic), to schedule the operations within the instruction; customised registers, to store any additional variables (there are two types of customised register: register file and instruction register); and combinational operations, such as arithmetic and logic operations.

This estimation model uses methods described in system decomposition theory to decompose the embedded processor hardware into independent subsystems that can be analysed separately [196]. System decomposition theory originated from the ontolog- ical model of information system decomposition. The following basic definitions and theorems are obtained (for a detailed description, see Wand and Weber [196]).

Basic definitions of system decomposition theory:

• A system, σ, comprises a set of parameters.

• A parameter, c, is a discrete variable from an ordered and finite set.

• A parameter space, S, is a multidimensional discrete space, where each dimen-

sion corresponds to a parameter and each point corresponds to an extensible

instruction.

• A function, f, over a parameter space, S, is a function that corresponds to a

model of the extensible instruction .

• A systemσ ´ is a subsystem of σ, and σ is a supersystem ofσ ´ if and only if the

composition ofσ ´ is a subset of the composition of σ.

• A decomposition of a system σ is a set of subsystems D(σ) = {σi}i∈I , such that

each parameter in the system is included in at least one of the subsystems. CHAPTER 6. INSTRUCTION ESTIMATION MODELS 139

Theorem 3.1: Let D(σ) = {σi}i∈I be a decomposition σ. The parameter space of the decomposition is:

S(D) ≡ S(D(σ)) = ⊗i∈I S(σi) (6.1)

A parameter space of S(D) will be called a parameter space of the decomposition.

Let ci be a parameter in an ordered and finite set Si. C is an n tuple of parameters,

C = {c1,c2, ..., cn}. Let C1 be an x tuple of parameters {c1,c2, ..., cx} and C2 be an n − x tuple of parameters {cx+1,cx+2, ..., cn} such that n tuple C can be expressed as

{C1,C2}. In addition, a function, f(C1,C2), is independent of C2 if f(C1,C2)= f(C1) which can be completely represented by parameters in C1.

There are three requirements to ensure a valid system decomposition: i) the system must have a well-defined structure; ii) the system must only be represented by a known set of parameters; and iii) a change in a parameter, that belongs to a subsystem, must result in a change on the function of the subsystem. System decomposition theory is applicable to model extensible instructions for a number of reasons: i) extensible instructions are well structured into five architectural parts (as described above); and ii) extensible instructions are represented using a set of customisation parameters.

These parameters represent a wide range of instruction customisations, which model differing components, dissimilar parallelism techniques, and diverse scheduling.

Regression analysis is an analysis method that expresses a model as a function of parameters. In extensible instructions, each of the area overhead, latency, and power consumption is a model that is expressed as a function of customisation parameters.

For example, a model of a system, M(σ), is expressed as a linear function of c1, c2, ..., cn, where each ci is a parameter of the system and can be represented as follows:

M(σ)= m0 + m1c1 + m2c2 + ... + mncn (6.2)

where m0, m1, ..., mn are coefficients of the parameters. The function can take other 140 CHAPTER 6. INSTRUCTION ESTIMATION MODELS

Extensible Instructions Synthesis and simulation A set of customization parameters, P results ( Area overhead , (c omponent parameters , parallelism technique Latency and Power dissipation ) parameters , and schedule parameters ) I

Decompose the extensible instructions into subsystems using system decomposition theory II

Clock gated Customized Combinational Decoder Top-logic Hardware Register Operations A subset of A subset of A subset of A subset of A subset of customization customization customization customization customization parameters, parameters, parameters, parameters, parameters, UI UI UI UI UI P1 P P2 P P3 P P4 P P5 P

Obtain the coefficients of each subsystem model individually using regression analysis III

The estimation models of extensible instructions (Area overhead, Latency, Power dissipation )

Figure 6.2: An overview for characterising and estimating the models of the extensible instructions forms of expression, such as quadratic or polynomial etc. The coefficients of the pa- rameters and the relationship of the model (i.e. linear, quadratic, polynomial etc.) can be determined (if such a relationship exists) by commercial tools, when a sample dataset and parameters are given.

6.3 Extensible Instructions Model

This section presents an overview of the derivation methodology, and then describes how extensible instructions are represented by customisation parameters, and how the extensible instructions are divided into subsystems with a subset of customisation pa- rameters using system decomposition. Finally, this section explains how the estimation models are derived using regression analysis in terms of customisation parameters. CHAPTER 6. INSTRUCTION ESTIMATION MODELS 141

6.3.1 Overview

An overview of the method to derive estimation models is shown in Figure 6.2.

The inputs are the extensible instructions with a set of customisation parameters and synthesis and simulation results (including the results for each subsystem, such as the decoder, top logic, etc.). The outputs are the estimation models of the extensible instructions. An extensible instruction represented by a large set of customisation pa- rameters is complex and therefore hard to analyse. Hence, system decomposition the- ory is applied to decompose an instruction into its independent structural subsystems: decoder, clock gating hardware, top-logic, customised register, and combinational op- erations. Each such subsystem is represented by a subset of customisation parameters.

A customisation parameter belongs to a subsystem if and only if a change in the cus- tomisation parameter would affect synthesis and simulation results of the subsystem.

In addition, one and the same customisation parameter can be contained in multiple subsystems. Regression analysis is then used in each subsystem to determine: i) the re- lationship between synthesis and simulation results and the subset of the customisation parameters; and ii) the coefficients of the customisation parameters in the estimation models. In addition, the decomposition of subsystems is refined until the subsystem’s estimation model is satisfactory. The estimation models for the subsystems are then combined to model extensible instructions for the purpose of estimating its charac- teristics. This procedure is applied separately for area overhead, latency, and power consumption.

6.3.2 Customisation Parameters

Customisation parameters are properties of the instruction that designers can cus- tomise when designing extensible instructions. They can be divided into three cate- gories: a) component parameters, b) parallelism technique parameters, and c) schedule 142 CHAPTER 6. INSTRUCTION ESTIMATION MODELS parameters.

Component parameters characterise primitive operators of an instruction. They can be classified on the basis of structural similarity, as follows: i) adder and subtrac- tor (+/−); ii) multiplier (∗); iii) conditional operators and multiplexers (<,>, ? :); iv) bitwise and reduction logic (&, |, ...); v) shifter (<<,>>); vi) built-in adder and sub- tractor from the library (LIB add) (these custom built components are used to show the versatility); vii) built-in multiplier (LIB mul); viii) built-in selector (LIB csa); ix) built-in mac (LIB mac); x) register file; and xi) instruction register. The bitwidths of all primitive operators can also be altered.

Parallelism parameters characterise various levels of parallelism during instruc- tion execution. There are three parallelism techniques: i) VLIW - allows a single instruction to execute multiple independent operators in parallel; ii) vectorisation - in- creases throughput by operating multiple data elements at a time; and iii) hard-wired operation - takes a set of single instructions with constants and composes them into one new custom complex instruction. The parallelism technique parameters include: i) the width of the instruction using different parallelism techniques, which models the additional hardware and wider busses for paralleling the instructions; ii) the con- nectivity of the components (register file, instruction register, operations, etc.), which represents the components that are commonly shared; iii) the number of operations in series; iv) the number of operations in parallel; and v) the number of operations in the instruction.

Schedule parameters represent the scheduling for instruction execution such as multi-cycling etc. The schedule parameters are i) the number of clock cycles required to execute an instruction; ii) the maximum number of instructions that may reside in the processor; and iii) the maximum number of registers that may be used by an instruction. CHAPTER 6. INSTRUCTION ESTIMATION MODELS 143

Customisation parameters Descriptions

Numadd/subi Number of i-bit addition/subtraction operators

Nummuli Number of i-bit multiplication operators Numcond Number of condition operator and multiplexors Numlogic Number of bitwise and reduction logics

Numshifti Number of i-bit shifters

NumLIB addi Number of i-bit built-in adders

NumLIB csai Number of i-bit built-in selectors

NumLIB muli Number of i-bit built-in multipliers

Components NumLIB maci Number of i-bit built-in macs

Numregfj Number of j-bit width register files Numireg Number of instruction registers

Widvliw Width of the VLIW instructions Widvector Width of the vectorisation instructions Widhwired Width of the hard-wired instructions

Conregfj Connectivity of j-bit register files Conireg Connectivity of instruction registers

Conoperi Connectivity of operations

Numoperi Number of i-bit operations in total Numser Number of operations in serial

Parallelism tech. Numpara Number of operations in parallel Nummcyc Number of cycles scheduled Numminst Number of instructions included

Sche. Useregj Usage of the j-bit register files

Table 6.1: Customisation parameters of extensible instructions

Table 6.1 shows the notations and the descriptions for customisation parameters1.

Hence, notations from Table 6.1 are used to refer to customisation parameters.

6.3.3 Characterisation for Various Constraints Area Overhead Characterisation

Unless the subsystems share common hardware, the area overhead of extensible instructions can be defined as the summation of the individual subsystem’s area over- head.

The decoder, the clock gating hardware, and the top-logic are built-ins and are ac-

1To handle scalability and to limit the design space, the estimation models only consider the following bitwidths: 8/16/32/64/128 for the operators with suffix i; and 32/64/128 for register files with suffix j in the table 144 CHAPTER 6. INSTRUCTION ESTIMATION MODELS tually shared amongst extensible instructions, which must be taken into consideration.

The customisation parameters for these subsystems are: i) Conoper; ii) Conregf ; iii)

Conireg; iv) Numinst; and v) Numminst.

A customised register can also be shared amongst the extensible instructions in the processor. The area overhead of the customised register is based on the size and the width of the registers. Therefore, the customisation parameters are: i) Numregf ; and ii) Numireg.

The combinational operations’ area overhead is not shared with other instructions and is dependent only upon the operations within the instruction. The customisation parameters for combinational operations are: i) Numadd/sub; ii) Nummul; iii) Numcond; iv) Numlogic; v) Numshift; vi) NumLIB add; vii) NumLIB mul; viii) NumLIB csa; and ix) NumLIB mac.

Latency Characterisation

The latency of extensible instructions can be defined as the maximum delay of each subsystem in the critical path when that specific extensible instruction is executed. A major part of the critical path is contributed to by combinational operations. Other subsystems either have very little effect on the latency, or do not lie on the critical path.

The customisation parameters for latency of the decoder, clock gated, top-logic, and customised register are similar to the area overhead characterisation. The reason is that these subsystems mainly revolve around the connectivity between each other

(i.e. fan-ins/fan-outs) while the internal latency is relatively constant within these subsystems.

In the combinational operations, the latency depends not only on structural com- ponents, but also on parallelism technique parameters and schedule parameters. Com- CHAPTER 6. INSTRUCTION ESTIMATION MODELS 145 ponent parameters represent latency of independent operators. Parallelism technique parameters describe latency of internal connectivity, the number of stages in the in- struction, and the level of parallelism; and schedule parameters represent the multi- cycle instructions.

Power Consumption Characterisation

The characterisation of power consumption is similar to the constraints described above.

The customisation parameters of decoder and top-logic relate to the connectivity between the subsystems, and therefore are dependent upon: i) Conoper; ii) Conregf ; iii)

Conoper; iv) Numminst; and v) Nummcyc.

For clock gating hardware, the customisation parameters include the connectivity and complexity of operations and the scheduling: i) Numoper; ii) Numregf ; iii) Numinst; iv) Nummcyc; v) Numser; and vi) Numpara. The last two parameters specify the power consumption of the clock tree in the extensible instruction.

For customised register, the power consumption refers to the number of customised registers used by the instruction. The customisation parameters are i) Numregf ; ii)

Numireg; and iii) Usereg.

For computational operations, the power consumption characterisation is further categorised into a number of stages in the instruction, and the level of parallelism in the stage. The reason for capturing power consumption when operations execute in parallel and when multi-cycle instructions are present, is that stalling increases energy dissipation significantly. 146 CHAPTER 6. INSTRUCTION ESTIMATION MODELS

6.3.4 Estimating Characteristics of Extensible Instructions Area Overhead Estimation

As discussed previously, the area of extensible instructions, A(inst), can be defined by using system decomposition (eqn. 6.1):

A(inst)= ⊗i∈{dec,clk,top,reg,opea}A(i) (6.3) or as:

A(inst)= A(i) (6.4) i∈{dec,clk,top,reg,opeaX } where A(i) is the area overhead estimation of all affected subsystems. Applying re- gression analysis on each subsystem and its customisation parameter subset, the area overhead estimation of subsystems is derived as follows:

The decoder has five customisation parameters (according to Table 6.1) Using re- gression analysis, the relationship of the estimation model is seen to be linear and the area overhead estimation, A(dec), is hence defined as:

A(dec)= Aregfi Conregfi + AiregConireg + Aoperi Conoperi + i∈32X,64,128 i∈8,16X,32,64,128 (6.5)

AmcycNummcyc + AminstNumminst

where Aregfi , Aireg, Aoperi , Amcyc, and Aminst are the respective coefficients. For clock gating hardware, the area overhead estimation A(clk) can be defined as:

A(clk)= Aregfi Conregfi + AiregConireg + AminstNumminst (6.6) i∈32X,64,128

A(top) is the area overhead estimation of a top-logic and is defined as:

A(top)= Aoperi Conoperi + Aregfi Conregfi + i∈8,16X,32,64,128 i∈32X,64,128 (6.7)

AiregConireg + AmcycNummcyc + AminstNumminst CHAPTER 6. INSTRUCTION ESTIMATION MODELS 147

The area overhead estimation of customised register, A(reg), is defined as:

A(reg)= Aregfi Numregfi + AiregNumireg (6.8) i∈32X,64,128

A(opea) is the area overhead estimation of combinational operations and is defined as:

A(opea)= {Aadd/subi Numadd/subi + ALIB muli NumLIB muli + i∈8,16X,32,64,128 (6.9) ALIB maci NumLIB maci + ALIB addi NumLIB addi + ALIB csai NumLIB csai +

Amuli Nummuli + Ashifti Numshifti } + AcondNumcond + AlogicNumlogic

Latency Estimation

As described in Section 6.3.3, the latency of extensible instructions is the maximum delay of each subsystem in the critical path of the extensible instruction. Therefore, the latency estimation, T (inst), is defined as:

T (inst)= maxi∈{dec,clk,top,reg,opea}T (i) (6.10) where T (dec) is the latency estimation of the decoder which is defined as follow:

T (dec)= Tregfi Conregfi + TiregConireg + Toperi Conoperi i∈32X,64,128 i∈8,16X,32,64,128 (6.11)

+TmcycNummcyc + TminstNumminst

The latency estimation of the clock gated, T (clk), is shown:

T (clk)= Tregfi Conregfi + TiregConireg + TmcycNummcyc+ i∈32X,64,128 (6.12)

TminstNumminst

T (top) is the latency estimation of the top logic and is shown:

T (top)= Tregfi Conregfi + TiregConireg + Toperi Numoperi + i∈32X,64,128 i∈8,16X,32,64,128 (6.13)

TmcycNummcyc + TminstNumminst 148 CHAPTER 6. INSTRUCTION ESTIMATION MODELS

T (reg) is the latency estimation of the customised register and is shown: 1 T (reg)= { × T Num + T Num } Num + Num regfi regfi ireg ireg (6.14) i∈32X,64,128 regf ireg T (opea) is the latency estimation of the combinational operations and is shown: Num T (opea)= ser × {T Num + Num × Num add/subi add/subi mcyc oper i∈8,16X,32,64,128

TLIB addi NumLIB addi + TLIB csai NumLIB csai + TLIB muli NumLIB muli + (6.15) TLIB maci NumLIB maci + Tmuli Nummuli + Tshifti Numshifti } + TcondNumcond+

TlogicNumlogic + TV LIW WidV LIW + TvectorWidvector + ThwiredWidhwired+

TserNumser + TparaNumpara + TmcycNummcyc + TminstNumminst

Power Consumption Estimation

Similar to previous arguments the power consumption can be modelled as:

P (inst)= P (i) (6.16) i∈{dec,clk,top,reg,opeaX } where

P (dec)= Pregfi Conregfi + PiregConireg + Poperi Conoperi + i∈32X,64,128 i∈8,16X,32,64,128 (6.17)

PmcycNummcyc + PminstNumminst

P (clk)= Pregfi Conregfi + PiregConireg + PmcycNummcyc+ i∈32X,64,128 (6.18)

PminstNumminst

P (top)= Pregfi Conregfi + PiregConireg + Poperi Conoperi + i∈32X,64,128 i∈8,16X,32,64,128 (6.19)

PmcycNummcyc + PminstNumminst

Useregi P (reg)= { × (Pregfi Numregfi + Numregf + Numireg i∈32X,64,128 i (6.20)

PiregNumireg + PminstNumminst)} CHAPTER 6. INSTRUCTION ESTIMATION MODELS 149

P (opea)= {Padd/subi Numadd/subi + PLIB muli NumLIB muli + i∈8,16X,32,64,128

PLIB maci NumLIB maci + PLIB addi NumLIB addi + PLIB csai NumLIB csai + (6.21) Pmuli Nummuli + Pshifti Numshifti } + PcondNumcond + PlogicNumlogic+

PV LIW WidV LIW + PvectorWidvector + PhwiredWidhwired + PserNumser+

PparaNumpara + PmcycNummcyc + PminstNumminst

6.4 Experimental Results

The purpose of the experiments was to evaluate the constraint estimation models against “measured” values; i.e., compared to the case where the instructions have actually been synthesised and power has been estimated at gate-level.

For evaluation purposes, the T1040.0 version of the Xtensa processor (0.18µ tech- nology) from Tensilica Inc. [23], with a clock speed of 166.7MHz, was used. All ex- periments were conducted on a Sun UltraSPARC III running at 900MHz (dual) with

4Gb of RAM. Although these models are based on the Xtensa processor and 0.18µ technology, the underlying method for deriving models is general and can be applied to any extensible processor platform of similar capability (ability to design custom instructions).

6.4.1 Experimental Setup

The two series of experiments detailed in this chapter sought to: i) determine coeffi- cients and derive the estimation models (a one-time effort); and ii) verify the estimation models. Figure 6.3 shows the verification methodology. In the first experiment, about

5000 extensible instructions2 (TIE) were automatically generated (dataset 1) with a wide range of customisation parameters. Figure 6.4 shows the possible design points of

2The number 5000 results from an analysis of the design space. We made sure that all areas of the design space were reasonably well covered. 150 CHAPTER 6. INSTRUCTION ESTIMATION MODELS the extensible instructions around the latency of 6ns. These instructions are first com- piled and Verilog implementations are generated. Instructions are then synthesised to obtain area overhead and latency using Design Compiler from Synopsys, Inc. [6] with cell library (0.18µ). The power consumption figures are obtained using PowerTheatre from Sequence Design, Inc. [19] with simulation data (testbenches) generated using

Modelsim from Mentor Graphics, Inc. [15]. Next, dataset 1 is used to determine the coefficients of the estimation models using S-Plus from Insightful, Inc. [20]. It should be noted determining the coefficients of estimation models is a one time effort. Table 6.2 shows all obtained coefficients of the extensible instructions. The mean absolute error between the estimation models and the synthesis results in all automatically generated instructions is computed3. In addition, the error rates are further discussed for: VLIW instructions; vectorisation instructions; hard-wired operation instructions; multi-cycle instructions; and when multiple instructions are part of the processor.

The second set of experiments used 11 extensible instructions from three real-world applications: adpcm, gsm, and mpeg2 [135]. The synthesis and simulation results were first obtained using commercial tools (as shown in Figure 6.3). The estimation results are then computed using our estimation models. These 11 instructions were grouped into sets of instructions to evaluate the estimation models, for the case when multiple instructions are present/selected in the processor. The accuracy of the esti- mation models for individual extensible instructions and multiple instructions was then verified. 3The reasoning for automatically generating instructions is as follows: if we were to only generate those (few) extensible instructions that are actually useful for a certain application, then we would most likely not cover all cases to evaluate our estimation techniques. Even though these instruc- tions are automatically generated, they are indeed useful for the estimation evaluation, even though they may not speedup the application at all. However, we have also generated and evaluated exten- sible instructions for real-world applications such that those applications profit from the extensible instructions CHAPTER 6. INSTRUCTION ESTIMATION MODELS 151 W) 87 96 82 84 95 187 258 638 115 428 681 871 185 318 562 µ 1421 4528 2281 3715 3512 2315 2150 2941 2812 1912 3512 3428 1121 1541 1248 4581 ( 14581 Power 121511 135121 142012 884 170 131 120 215 421 648 121 172 (ps) 1068 1547 1121 2485 4412 1600 1700 3845 4412 3845 2151 5801 1021 3481 1154 1241 1854 28912 26810 36122 12158 11215 21548 28154 38125 31548 128-bit Latency – – – – – 89 75 68 65 61 17 51 45 18 845 389 8421 5415 2515 1571 2815 1541 1842 2512 1452 1251 Area 16812 75815 84512 69812 42568 98101 (grids) 215478 205184 265481 W) 51 51 42 68 68 42 68 142 165 312 392 824 282 180 891 848 871 421 357 µ 2015 1215 1571 7251 1425 1681 1842 1451 1258 2512 2845 1254 1871 ( 42427 48110 45181 Power 89 58 53 59 654 754 758 182 981 658 705 952 345 890 548 121 (ps) 1542 2151 4815 5841 2251 8910 3584 1752 1451 3210 2510 1024 1251 15474 11541 16015 13542 18945 13541 64-bit Latency – – – – – 68 40 48 30 28 16 50 25 13 687 583 728 205 786 845 4285 3482 1245 8452 1251 1252 1684 Area 32880 96871 45223 29801 35818 95215 42845 (grids) 125412 W) 91 26 22 24 67 12 85 18 64 181 252 431 481 786 162 893 617 589 481 721 651 891 429 657 341 872 211 612 µ 3417 1250 1767 2313 ( 23401 24921 22911 Power e purpose of calibrating through regression 60 30 47 79 28 51 451 292 312 341 542 970 285 870 490 490 810 238 454 648 435 (ps) 6885 4125 7480 2745 2918 1282 3581 2472 6485 6172 1810 1910 1125 5482 32-bit Latency – – – – – 45 20 49 12 15 15 52 10 12 512 489 226 283 387 456 102 539 492 3593 2047 5735 1148 Area 13591 48095 22301 12878 28138 55110 48344 16018 (grids) shifter multiplier Descriptions build-in mac logic operator build-in adder build-in selector VLIW technique build-in multipler condition operator register connection instruction register operator connection operator connection multiple instructions addition/subtraction register file connection register file connection register file connection i-bit width register file vectorisation technique number of operation in serial instruction register connection instruction register connection instruction register connection instruction register connection multiple clock cycle connection multiple clock cycle connection multiple clock cycle connection multiple clock cycle connection number of operation in parallel hard-wired operation technique multiple instructions connection multiple instructions connection multiple instructions connection number of instructions uses register mac mul add csa i LIB LIB LIB LIB add/sub hwired V LIW vector P minst minst minst minst minst P P regf shift P mcyc mcyc mcyc mcyc mult P logic para cond regf regf regf oper oper ireg ireg ireg ireg ireg P P reg ser P , , P P P P P , P , P P P P P , P P P P P P P P P P P P P P , , P P , , , , , , , , , , , , , , , , , , , , , , , , , , , , i csa add mul mac ser reg ireg ireg ireg ireg ireg oper oper regf regf regf cond para logic mult mcyc mcyc mcyc mcyc shift regf minst minst minst minst minst T T vector V LIW hwired T T T T T T T T T T T T T T add/sub T T T T T T LIB LIB T T T T T LIB T , LIB , T T , , , , , , , , , , , , , , T , , , , , T , T , , , , , T T , , , i , Coefficients , , , , ser reg ireg ireg ireg ireg ireg Table 6.2: The coefficients of the extensible instructions for th oper oper A A regf regf regf cond para logic mult csa mcyc mcyc mcyc mcyc add shift regf A A A A A A A mul A A A mac A A A minst minst minst minst minst A vector A A A A A A V LIW hwired A A A A A A add/sub A A LIB LIB LIB A LIB

A A

A A

Clock Reg Operations Combinational Decoder Toplogic 152 CHAPTER 6. INSTRUCTION ESTIMATION MODELS

Automatic generated extensible instructions Extensible instructions used in real-world (Dataset 1) applications (Dataset 2)

Obtain the synthesis & simulation results

To compile the extensible instructions using xt- xcc compiler, Tensilica Inc.

Obtain the estimation results using Automatic generated extensible Cell library the estimation models instructions (Verilog) (0.18u) The estimation models are derived by Synthesize area overhead & latency using using system decomposition theory and Design Compiler, Synopsys Inc. regression analysis which is described in Fig.2. In addition, the derivation of these Generate the power simulation traces using estimation models is one time effort. Modelsim, Mentor Graphics Inc.

Simulate power dissipation with simulation traces using PowerTheatre, Sequence Inc.

Synthesis & simulation results for dataset 1 and 2 Estimation results for dataset 1 and 2 (area overhead, latency, power dissipation) (area overhead, latency, power dissipation)

Verification of the estimation models (Compute the Mean Absolute Error)

Figure 6.3: Experimental methodology

Design Space Example (~6ns) (Area Overhead vs. Latency) 45000 40000 35000 30000 25000 20000 15000 10000 Area Overhead (grids) Overhead Area 5000 0 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 Latency (ns)

Figure 6.4: A design space example of the extensible instructions (around 6ns) CHAPTER 6. INSTRUCTION ESTIMATION MODELS 153

6.4.2 Evaluation Results

In our first experiment, we examined the accuracy of the estimation models under differing customisations: VLIW; vectorisation; hard-wired operation; multi-cycle; and sequences of (multiple) extensible instructions. Table 6.3 shows the mean absolute error for the area overhead, latency, and power consumption of the estimation models in these categories. The mean absolute error (area overhead, latency, and power consumption) in hard-wired operation is lower than those instructions using VLIW and vectorisation techniques. This is because the estimation for VLIW and vectorisation techniques instructions depends on a larger number of customisation parameters, and hence, higher error rates are observed. In terms of schedules, the mean absolute error is relatively close to the average mean error. The mean absolute error of the estimation models for all automatically generated instructions is only 2.5% for area overhead, 4.7% for latency, and 3.1% for power consumption.

Figure 6.5 shows the mean absolute error of the estimation models for sequences of (multiple) instructions in the real-world applications. The mean absolute error for previously unseen multiple instructions ranges between 3-6% for the three estimation models. Figure 6.6 summarises the accuracy of estimation models for extensible instruc- tions with unseen individual real-world application extensible instructions (dataset 2).

The maximum estimation error is 6.7%, 9.4%, and 7.2% for area overhead, latency and power consumption respectively, while the mean absolute error is only 3.4%, 5.9%, and

4.2%. The estimation errors are all far below the estimation errors of the commercial estimation tools (typically around 20% absolute error at gate-level) against which our models were verified. As such, we can conclude that our models are accurate enough for the purpose of high-level design space exploration for extensible instructions.

Our estimation models are by far faster than a complete synthesis and simulation 154 CHAPTER 6. INSTRUCTION ESTIMATION MODELS 1.7 0.8 0.3 0.7 0.6 0.8 Minimum 6.8 5.1 7.9 6.1 6.9 6.2 Maximum Power consumption Error 2.5 2.0 4.5 3.1 3.8 2.8 Mean Abs. 1.4 0.3 0.3 0.2 1.7 1.1 Minimum ifferent types of extensible instructions 6.7 7.0 5.4 8.4 6.8 9.1 Maximum Latency Error 4.2 3.5 2.5 4.1 2.5 4.7 Mean Abs. 0.3 1.4 0.2 0.0 0.4 1.0 Minimum 6.7 7.5 4.4 7.6 6.7 6.2 Maximum Area Overhead Error 3.5 2.8 2.6 4.2 3.2 2.5 Mean Abs. Table 6.3: The mean absolute error of the estimation models in d VLIW Overall Categories Multi-cycle Hard-wired Instructions Multiple Inst Vectorisation CHAPTER 6. INSTRUCTION ESTIMATION MODELS 155

The mean absolute error for multiple instructions in real-world applications 8.0 Area Overhead Latency Power Dissipation 7.0

6.0

5.0

4.0

3.0

MeanAbsolute (%) Error 2.0

1.0 Set 1 Set 2 Set 3 Set 1 Set 2 Set 3 Set 4 Set 1 Set 2 Set 3 Set 4

Adpcm Gsm Mpeg2dec Number of Instructions in the processor

Figure 6.5: The accuracy of the estimation models for multiple instructions (sets of in- structions: set 1 contains a single instruction; set 2 contains a group of two instructions, etc.) 156 CHAPTER 6. INSTRUCTION ESTIMATION MODELS

The accuracy of the area overhead estimation 35000 Area Synthsized result Area Estimated 30000

25000

20000

15000

10000 Area overhead (grids) overhead Area

5000

0 Inst 1 Inst 2 Inst 3 Inst 1 Inst 2 Inst 3 Inst 4 Inst 1 Inst 2 Inst 3 Inst 4 Adpcm Gsm Mpeg2dec (a) Area overhead

The accuracy of the latency estimation 10.00 Latency Synthsized result 9.00 Latency Estimated

8.00

7.00

6.00 Latency Latency (us) 5.00

4.00

3.00 Inst 1 Inst 2 Inst 3 Inst 1 Inst 2 Inst 3 Inst 4 Inst 1 Inst 2 Inst 3 Inst 4 Adpcm Gsm Mpeg2dec

(b) Latency

The accuracy of the power dissipation estimation 70.00 Power Simulated result 60.00 Power Estimated

50.00

40.00

30.00

20.00 Power dissipation (mW) 10.00

0.00 Inst 1 Inst 2 Inst 3 Inst 1 Inst 2 Inst 3 Inst 4 Inst 1 Inst 2 Inst 3 Inst 4

Adpcm Gsm Mpeg2dec (c) Power consumption

Figure 6.6: The accuracy of the estimation models in real-world applications CHAPTER 6. INSTRUCTION ESTIMATION MODELS 157 process using commercial tools. The average time taken by Design Compiler and

PowerTheatre to determine the customisation of an extensible instruction can be up to several hours, while our estimation models only require a few seconds in the longest case. This is another prerequisite for extensive design space exploration.

6.5 Conclusions

This chapter has presented fast and accurate techniques for estimating area over- head, latency and power consumption of extensible instructions. As distinct from similar work in the field, our techniques also include models for instruction parallelism, and multi-cycling, which is crucial for accurate and reliable estimation. Using our tech- niques, which have been calibrated through regression models and compared against commercial synthesis/estimation tools, the models are able to explore the large design space of extensible processors. The estimation techniques have been integrated into our extensible processor tool suite.

Our contributions to this instruction estimation model include:

• Derivation of fast and accurate estimation models (area overhead, latency, and

power consumption) of extensible instructions;

• Simplification of the process of modelling extensible instructions by using system

decomposition and regression analysis; and

• The use of both parallelism techniques and schedules alternatives for instruction

models.

A summary of the results is as follows: the mean absolute error for a set of in- structions used in real-world applications is 3.4% (6.7% max.) for area overhead, 5.9%

(9.4% max.) for latency, and 4.2% (7.2% max.) for power consumption. Our es- timation models execute in a few seconds for an instruction, whereas synthesis and 158 CHAPTER 6. INSTRUCTION ESTIMATION MODELS subsequent estimation would take hours.

Future work will involve extending our estimation techniques to allow them to estimate more complex extensible instructions (as they will soon be available in commercial ex- tensible processor tool suites). Chapter 7

Instructions Generation

This chapter presents an automatic extensible instructions generation tool with battery awareness, which minimises power dissipation of the instructions while max- imising speedup. As discussed previously, extensible instructions are customised to replace computationally intensive code segments (groups of primitive instructions) in the application, satisfying performance and power dissipation constraints. A typical approach to generating extensible instructions is to combine a computationally inten- sive code segment, with an extensible instruction that maximises the speedup. How- ever, the drawback of this approach is that the energy consumption of the extensible instruction is not minimised, thus leading to unnecessary energy consumption in the extensible processor. This chapter proposes methodologies to extend the typical ap- proach to minimising energy consumption of the extensible instructions. There are two proposed methods: i) separating instructions into multiple instructions and ii) utilising the slack of the instructions.

7.1 Motivation

Previous approaches to this problem have largely focused on identifying large com- putationally intensive primitive instructions groups in the application, and combining them into a single extensible instruction [33, 58, 67, 70, 193]. These approaches often

159 160 CHAPTER 7. INSTRUCTIONS GENERATION maximise speedup and reduce execution time, and hence minimise energy consumption of the application. However, the drawback of these approaches is that power dissipa- tion of extensible instructions is large and often consumes up to 20% of the total power dissipation when those instructions are executed. The variance of power dissipation distribution (between base processor and base processor plus extensible instructions) is not minimised. Thus, the variance of discharge current distribution is not minimised, which often leads to shortened battery lifetime [165].

Figure 7.1 shows two designs that can be generated in order to replace a single code segment in the original software-based application. The code segment has six inputs, namely a, b, c, d, e, and f. The sum of the first four inputs are multiplied and accumulated by the sum of last two inputs to produce an output, z, which executes for 25% of the time in the application. The base processor is a five-stage pipeline processor that runs at 222MHz with average power dissipation of 100mW. If an ex- tensible instruction is to replace the code segment by using previous state-of-the-art approaches [33, 58, 67, 70, 193], all of the operations are combined into a single in- struction (shown in Figure 7.1a). The average power dissipation of the instruction is

31.65mW, which is consumed by different operations (i.e. five adders and multiplier) and registers used (i.e. seven registers). On the other hand, the designer can separate the code segment into two instructions, as shown in Figure 7.1b. The average power dissipation of these instructions is 14.18mW and 15.68mW respectively. The reduction of average power dissipation is due to the fact that each of the instructions contains fewer operations and registers. Using the design in Figure 7.1b, the energy consump- tion of the application1 is reduced by 7% compared with the design in Figure 7.1a. The reduction of energy consumption is because the instructions in Figure 7.1b are executed

1The energy consumption of an application is computed using the battery behaviour model [146, 165], which is described in the Background. CHAPTER 7. INSTRUCTIONS GENERATION 161

// Code segment (25% of the execution time) a // Assume the application runs 1,000,000 cc add z = (a + b + c + d) * (e + f) + (e + f) 32

b add 32

a c add 32 z_1

add 32 d Clock cycle : 2 cc b add 32 Area : 29,008 add Avg Power : 14.18 mW c 32 d z_1 mul 32 e mul 32 add 32 z e add 32 z add 32 add 32 f f Clock cycle : 5 cc Clock cycle : 3 cc Area : 72,347 Area : 50,535 Avg Power : 31.65 mW Avg Power : 15.68 mW Energy (App) : 103.27 uJ Energy (App): 95.94 uJ

(a) Single instruction (b) Multiple Instructions

Figure 7.1: A motivation example: separates instruction to reduce energy consumption sequentially and an extensible instruction only dissipates power when it is executed (see

Background). Therefore, energy consumption may be reduced by separating a large computationally intensive code segment into multiple instructions. In the era of cost efficiency, high performance and portability, it is critical to reduce power dissipation

(to reduce energy consumption and extend battery lifetime) and explore as many design points as possible. Therefore, an automatic, battery-aware tool for designing extensible instructions is needed.

7.2 Background

Our instructions generation tool uses the battery behaviour model [146, 165] to define the actual energy capacity that can be drawn from a battery. The battery lifetime, BL, can be defined as CAP BL = (7.1) P act 162 CHAPTER 7. INSTRUCTIONS GENERATION where CAP is the ideal energy capacity of a battery; and P act is the actual power consumption of the circuit. The energy capacity of a battery is the amount of energy stored in a battery, which is measured in ampere-hours or watt-hours. Normally, the capacity of a battery decreases as the discharge current increases. In the analytical model, the actual current, Iact, that is taken out of the battery is

I Iact = , 0 ≤ µ ≤ 1 (7.2) µ where µ is the battery efficiency (or utilisation) factor. The actual energy capacity,

CAP act, is

CAP act = CAP · µ, 0 ≤ µ ≤ 1 (7.3)

From Peukert’s formula [146], the relationship between the battery capacity and the discharge current empirically is defined as

k CAP = (7.4) Iα where k is a constant determined by the chemical and physical design of the battery; and I is the discharge current. For an ideal battery, α equals to 0, and for a real battery, α ranges up to 0.7 for a typical load. Using eqn. 7.2 and eqn. 7.3, the battery efficiency factor can be defined in terms of the discharge current

1 µ = f(I)= (7.5) Iα where f is a monotonic-decreasing function [165]. In our case, our tool use α is equal to 0.7. A comprehensive survey of battery modelling techniques is described in [133].

In [165], Pedram and Wu showed that battery efficiency is affected by the average discharge current as well as by the average current profile. The actual power drawn out of the battery is defined as

I P act = V · p(I)dI (7.6) Z µ(I) CHAPTER 7. INSTRUCTIONS GENERATION 163 where V is the voltage of the circuit; µ(I) is the battery efficiency factor; and p(I) is the probability density function of I. From eqn. 7.1 to eqn. 7.6, maximum battery lifetime is achieved when the variance of the discharge current distribution is minimised. If it is assumed that voltage is relatively constant during operations, then maximum battery lifetime is achieved when the variance of the power dissipation distribution is minimised.

The Xtensa processor from Tensilica Inc. [23] was used as the design platform for this work. It consists of a five-stage pipeline RISC base core with approximately 80 base instructions, plus the capacity to define specific functionality through extensible instructions (that coexist with the base instructions) using Tensilica Instruction Ex- tension (TIE). An extensible instruction (in Xtensa) has hardware to clock-gate the instruction, such that the instruction can be turned on and off as needed. Therefore, the instruction is assumed to only dissipate power when it is executed.

7.3 Problem Statements and Preliminaries

Our automatic tool was developed for the generation step in the design flow. The computationally intensive code segments have been identified in the identification step using simulation and profiling [58].

Problem 1: Given a computational intensive primitive instruction group, generate extensible instruction(s) that maximise speedup of the extensible instruction(s) while minimising the average power dissipation distribution of the extensible instruction(s).

A Control Dataflow Graph (CDFG) G(V,E) is a directed acyclic graph (DAG), where vertices, V, represent the primitive instructions from the base instruction-set of the target processor; and edges, E, represent the data dependency between instruc- tions. The graph has two properties: i) execution time2, et(G); and ii) average power

2Execution time is the time required to execute the instruction in the execution stage of the 164 CHAPTER 7. INSTRUCTIONS GENERATION dissipation, apd(G). In addition, the latency (clock period) of the target processor is notated as latencyproc. Without loss of generality, it is assumed that G(V,E) is a convex CDFG [33], and contains a maximum of ten input ports and ten output ports.

Using the preliminaries, the Problem 1 is redefined as follows: Problem 2: Given a convex control dataflow graph (CDFG), G(V,E), find subgraph(s) (G′ ⊆ G) that cover all the vertices (v ∈ V ) and minimises the following constraints:

1. The execution time of the graph – the sum of the execution time of all subgraphs:

et(G′) GX′⊆G

2. The average power dissipation of the graph – the average of the subgraphs’ av-

erage power dissipation: apd(G′) ′ GP⊆G |G|

It should be noted that minimising execution time drives the design to make as few extensible instructions as possible, because the execution overhead time (such as the register read/write time) may increase the overall delay. On the other hand, minimising average power dissipation of the graph compels the design to contain smaller multiple instructions. As a result of these competing demands, this instruction generation problem is complex.

7.4 Instruction Generation

Our automatic instruction generation tool contains two algorithms (five phases): an instruction generation algorithm and a battery-awareness algorithm. An overview of the automatic instruction generation tool is shown in Figure 7.2. processor, which includes reading registers from the previous stage, executing operations, and writing registers. CHAPTER 7. INSTRUCTIONS GENERATION 165

Latency of the target Control Dataflow processor Graph (CDFG)

Identify vertex or group of vertices (patterns) in the control dataflow graph

CDFG and patterns

Select patterns to minimize the critical path of the graph

Extensible instruction Algorithm

Estimate clock cycle of the instruction based Instruction Generation on the latency of the target processor

Extensible instruction (single or multiple cycles)

Separate into multiple instructions if power dissipation is reduced

Extensible instruction/s

Utilize slack for the extensible instruction to Algorithm minimize power dissipation Battery-Awareness

Extensible Instruction/s schedule in single / multiple cycle that minimizes power dissipation

Figure 7.2: An overview of the automatic instruction generation tool

7.4.1 Instruction Generation Algorithm

The goal of the instruction generation algorithm is to generate an extensible in- struction that maximises the speedup and schedules the instruction into clock cycles based on the latency of the processor. The inputs of this algorithm are the latency of the target processor and a control dataflow graph that represents a computationally in- tensive code segment. The output is an extensible instruction. This algorithm consists of three phases: pattern identification, pattern selection, and clock cycle estimation.

Pattern identification is used to identify a vertex in the graph that can be replaced by well-defined operations3, which could reduce the execution time of the

3Hereafter referred to as patterns. 166 CHAPTER 7. INSTRUCTIONS GENERATION graph. Additionally, if a group of vertices can be merged into a single pattern (for example, where a number of sequential additions could potentially be integrated into a single adder), these are identified during this phase. This phase begins the identification with each vertex, and expands to the connected vertices in order to find patterns until all the vertices are searched. All of these patterns are passed onto the next phase, along with the control dataflow graph.

Pattern selection aims to select patterns that minimise execution time of the control dataflow graph in order to maximise speedup. A heuristic scheme is proposed to replace vertices with identified patterns only along the critical path. This phase first estimates the reduced execution time for each identified pattern along the critical path.

The heuristic scheme then selects patterns that minimises the critical path the most.

The reason for searching the patterns only along the critical path is that the searching space is complex (possible number of patterns in the graph is 2v). Thus, minimising the critical path is sufficient to achieve maximum speedups in a short time, which reduces design turnaround time. However, after patterns are replaced, new critical paths may be formed. Therefore, our scheme continues to minimise the critical path until the execution time of the graph is minimised.

Clock Cycle Estimation: After the critical path of the control dataflow graph is minimised, our tool estimates the number of clock cycles that the graph will take to execute when given the latency of the processor. The reason for estimating the clock cycle is to avoid violation of the processor’s latency. If the graph violates the latency and does not schedule for multiple clock cycles, then the clock period of the base processor core must be increased - decreasing performance significantly. The clock cycle of the instruction is defined as:

et(G) Clockcycleinst = (7.7) »latencyproc ¼ CHAPTER 7. INSTRUCTIONS GENERATION 167

where et(G) is the execution time of the graph; and latencyproc is the latency of the target processor. Figure 7.3 shows the instruction generation algorithm, InstGen.

7.4.2 Battery-Awareness Algorithm

The battery-awareness algorithm optimises the battery lifetime of the product by minimising power dissipation of extensible instructions. This occurs in two phases: instruction separation and slack utilisation.

Algorithm InstGen (G, latencyproc) { for all vertices v ∈ V do if (v == patterns) patterns list = add patterns(v); for all patterns p ∈ patterns list do for all vertices v′ connected to p do G′ = build subgraph(v′, p); if (G′ == patterns) patterns list = add patterns(G′); do criticalpath = find criticalpath(G); for all vertices v ∈ criticalpath do if (delay(p) ≤ Σv∈pdelay(v)) replace patterns(p, v, G); tmp criticalpath = find criticalpath(G); until (tmp criticalpath == criticalpath); CC = estimate clock cycle(G); return G; }

Figure 7.3: Algorithm InstGen for generating extensible instruction that minimises the execution time

Instruction separation aims to separate the instruction into multiple instruc- tions in order to reduce the power dissipation, extending the battery life of a product.

The input is a single extensible instruction generated in the instruction generation algo- rithm, while the output is one or multiple extensible instructions. This phase searches 168 CHAPTER 7. INSTRUCTIONS GENERATION every separation point4 with multiple fan-ins and fan-outs, constructs two subgraphs using separation points and evaluates the power dissipation reduction. The power dis- sipation evaluation is conducted using a power estimator that runs in a few seconds for each instruction. This phase then selects the separation point with the minimum power dissipation. These steps are iterated on the newly separated subgraphs in order to minimise the power dissipation until no further separation point is found. Figure

7.4 shows an example of possible separation points. There are nine vertices (v1, v2,

..., v9) in this control dataflow graph with ten inputs (a, b, ..., j) and two outputs (y, z). Eight of the possible separation points (C1, C2, ..., C8) separate the instruction differently. Assume all vertices consume the same amount of power dissipation. The peak power dissipation of the instruction occurs before vertex, v5. The predecessor vertex, v4, must be scheduled in parallel with either of the vertices, v1, v2, or v3, in order to maximise speedup. Peak power dissipation is doubled when vertices are over- lapped, which often occurs when vertices have multiple fan-ins or fan-outs. However, separating instructions into multiple instructions may reduce the speedup, which in turn may lead to higher energy consumption. The separation function compares power dissipation and the number of clock cycles in the original instruction and the sepa- rated multiple instructions. The separation function is derived from the actual power equation (eqn. 7.6) and Amdahl’s law [163] and is defined as

1+α Iproc + Ixθx SeparationInst = 1 − ηx · (7.8) µIproc + Iorig inst ¶ xX∈Inst where x is the separated instructions from the original instruction, Inst; ηx is the percentage of probability density function5 of instruction x compared with the proba- bility density function of the original instruction; and θx is the percentage of current

4A separation point is a cut that separates a graph into two subgraphs. 5Probability density function is related to the number of clock cycles in the instruction. CHAPTER 7. INSTRUCTIONS GENERATION 169

C1 C2 C3 C4 C5 C6 C7 C8

a y V1 V2 V3 V5 V6 V8 V9 b c d e j z f V4 h V7 g i

Figure 7.4: An example for separating instructions to reduce power dissipation

Patterns Exe Time Power Add 3.3 ns 2.5 mW a mul add add mul add z a a a a a a Add b 4.1 ns 2.2 mW Mul a 5.5 ns 10.5 mW b c d e f Mul b 6.8 ns 9.8 mW (a) A control dataflow graph (b) Patterns

Figure 7.5: An example for utilising the slack of the instruction dissipation6 reduction compared to the original instruction.

Slack utilisation aims to utilise the slack of the instruction. The clock cycle of the instruction is computed using eqn. 7.7, which is an approximated value. Therefore, this phase utilises the slack of the clock cycle time of the graph to further minimise the power dissipation of the instruction. This phase involves searching every path7 of the graph (including the non-critical paths) and ranking them, with the critical path as the highest. For each path, patterns that reduce the power dissipation while maintaining the clock cycle time are selected and replaced. Vertices are then marked and will not be replaced on the successive paths in order to maintain the clock cycle of the instruction.

Figure 7.5 shows an example for utilising slack of the instruction. A control dataflow graph has six inputs, namely a, b, c, d, e, and f; and an output, z, which consists of three adders and two multipliers, as shown in Figure 7.5a. Figure 7.5b shows four patterns that can be used to replace vertices, and their delay and power dissipation.

6Current dissipation is related to the power dissipation where the voltage is assumed constant. 7A path is a group of vertices from an input port to an output port. 170 CHAPTER 7. INSTRUCTIONS GENERATION

Algorithm BattAware (G) { G′ = InstSep(G); path list = find allpath(G′); rankpath list = rank(path list); do for all ranked path pa ∈ rankpath list do slack = estimate slack(pa); Utilise(G′, pa, slack); Mark Update(rankpath list); until (rankpath list == {∅}); return G′; }

Algorithm InstSep (G) { cut list = find allcuts(G); for all cuts c ∈ cut list do if (Separation(c, G) > maxpower red) thecut = c; maxpower red = Separation(c, G); G′ = separate(G, thecut); if (G′ != G) G′ = InstSep(G′ → left); G′ = InstSep(G′ → right); return G′; }

Figure 7.6: Algorithm BattAware for optimising battery lifetime in the instruction

For this example, assume the clock speed of the target processor is 222MHz. Using the instruction generation algorithm, the patterns adda and mula are mapped to the addition and multiplication respectively to minimise the execution time. Therefore, the power dissipation and execution time is 28.5mW and 20.9ns respectively. Thus, the clock cycle required to execute this instruction is 4.64cc ≃ 5cc. To utilise slack, one of the additions with addb is replaced to reduce the power dissipation to 28.2mW while maintaining the number of clock cycles in the instruction. This phase is efficient on non-critical paths of the graph. This approach further explores fine-grain granularity CHAPTER 7. INSTRUCTIONS GENERATION 171

Latency of the target Control Dataflow Experiment for examining the efficiency of the processor Graph (CDFG) automatic instruction generation tool by comparing characteristic of the instructions generation in different algorithms in the tool Identify build-ins and specific patterns in the control dataflow graph Area Instructions CDFG and specific patterns Synthesize (set 1) using Power Select build-ins and specific patterns to Design dissipation shorten the critical length path Instructions Compiler (set 2) Execution Extensible instruction time Estimate clock cycle of the instruction based on the latency of the target processor

Extensible instruction (single or multi cycle) Energy Estimate consumption using Separate into multiple instructions if energy of the equations consumption is reduced application Extensible instruction/s Application Evaluate Runtime of Utilize slack for the extensible instruction to written in C/ performance the C++ minimize energy consumption using ISS application

Extensible Instruction/s schedule in single / Experiment for testing the effectiveness of the multiple cycle that minimizes energy consumption instructions when applying in the application

Figure 7.7: An experimental platform for verifying our automatic instruction genera- tion tool of instruction generation. The battery-awareness algorithm is shown in Figure 7.6.

7.5 Experimental Results

7.5.1 Experimental Setup

For evaluation purposes, the T1050.2 version of the Xtensa processor (0.18µ tech- nology) from Tensilica Inc. [23] was used. This processor has a clock speed of 222MHz, consuming 100mW of power. All experiments were conducted on an Intel run- ning at 2.4GHz (Quad), with 4Gb of RAM. Although these models are based on the

Xtensa processor, the underlying method is general and can be applied to any platform of similar capability (ability to design extensible instructions). Figure 7.7 shows the experimental platform for verifying our automatic instruction generation tool. 172 CHAPTER 7. INSTRUCTIONS GENERATION

This chapter details two separate experiments that were conducted to determine the efficiency of the automatic instruction generation tool. Two different sets of instruc- tions generated in the tool are compared: i) instructions generated by the instruction generation algorithm (set 1); and ii) instructions generated by the tool (with both al- gorithms) (set 2). In the first experiment, the characteristics of the instructions are examined, including area, power dissipation, execution time, and energy consumption of an application. Fifty code segments are selected with a wide range of instructions.

These code segments are first applied in the automatic instruction generation tool. The generated instructions are then compiled and Verilog implementations are generated.

These instructions are synthesised to obtain area, power dissipation, and execution time using Design Compiler from Synopsys, Inc. [6] with a 0.18µm cell library. The energy consumption is computed using the same model described and the characteristics ob- tained. Each instruction is assumed to occupy 25% of the runtime in an application that runs 1,000,000 clock cycles.

The second set of experiments uses the identified code segments in five real-world applications: adpcm encoder (adpcm), g721 encoder (g721e), g721 decoder (g721d), epi encoder (epie), and epi decoder (epid) [135]. Our tool applies these code segments and extracts two different sets of generated instructions from the automated tool. The code segments are replaced by the instructions, and the runtime and energy consumption of the application are evaluated.

7.5.2 Evaluation Results

In our first experiment, we examined area, power dissipation, execution time, and energy consumption of the instructions. Table 7.1 shows the mean and maximum value of the characteristics (columns 2-9); and the comparison of characteristics between instructions in set 1 and instructions in set 2 (row 5). The area and execution time CHAPTER 7. INSTRUCTIONS GENERATION 173

Area Power Execution Energy Overhead Consumption Time Consumption [grids] [mW] [ns] [uJ] Instruction Set Ave. Max. Ave. Max. Ave. Max. Ave. Max. Set 1 15068 66348 21.98 61.51 10.52 26.97 76.48 123.01 Set 2 15357 63014 15.43 42.27 10.85 22.39 72.01 101.21 Reduction [%] -1.92 5.03 29.80 31.28 -3.14 16.98 5.84 17.72

Table 7.1: The characteristics of the generated instructions (set 1 and set 2) for the fifty code segments of the instructions in set 2 have increased 2% and 3% respectively when compared to the instructions in set 1. However, the power dissipation in set 2 is reduced by 29.8% on average compared to the instructions in set 1. The reduction is achieved by using multiple instructions and reducing the number of registers. In addition, the energy consumption of an application is further reduced by 5.8% on average (up to 17.7%) when compared to instructions in set 1. Our tool is unable to reduce the energy consumption for 32% of the code segments in this experiment, as those code segments are too small to separate. If these instructions are separated, then the performance of these instructions is reduced significantly, which leads to a large increase in energy consumption. Figures 7.8a and 7.8b show the energy reduction versus the complexity of the instruction (the number of primitive instructions in a code segment); and the energy reduction versus the original power dissipation, respectively. Both trend lines in these figures show that our tool is more effective when the code segment is large and complex. Thus, separating and utilising large computationally intensive code segments into multiple extensible instructions reduces the energy consumption of the application more effectively than combining a large code segment into a single instruction.

Table 7.2 shows the characteristics of five real-world applications when different generated instructions are implemented, compared to the original application (without any extensible instructions). The first column indicates the application name. The 174 CHAPTER 7. INSTRUCTIONS GENERATION

Energy Reduction vs. Complexity 20.00 Energy Reduction Trendline 18.00 16.00 14.00 12.00 10.00 8.00 6.00 EnergyReduction (%) 4.00 2.00 0.00 5.00 10.00 15.00 20.00 25.00 30.00 35.00 40.00

Complexity of the instruction

(a) Energy reduction verses the number of vertices in the instruction

Energy Reduction vs. Power Dissipation 20.00 Energy Reduction Trendline 18.00 16.00

14.00

12.00 10.00 8.00

6.00 EnergyReduction (%) 4.00 2.00

0.00 0.00 10.00 20.00 30.00 40.00 50.00 60.00 Power Dissipation of instruction (mW)

(b) Energy reduction verses the power dissipation of the instruction

Figure 7.8: Trendlines of energy reduction for extensible instructions CHAPTER 7. INSTRUCTIONS GENERATION 175 second column displays the different extensible instructions (see Figure 7.7) used by the application. The third column shows the instruction average speedup while the fourth column shows the average power dissipation. The runtime and energy consumption of the application are shown in the next two columns. The last column displays the energy reduction comparison between the application using the instructions in set 1, and the application using the instructions in set 2. The average speedup of the instructions in set 1 and set 2 is within 5% of each other. The runtime of these applications using instructions in set 1 and set 2 are also within 5%. The runtime of the epi encoder, epie, is reduced significantly by 15.7×. The average power dissipation of the instructions in set 2 is 32.66% lower than that in set 1. Thus, the energy consumption of the application using set 2 is 6.6% less on average (up to 16.53%) than the energy consumption of the application using instructions in set 1. In addition, for the epi encoder, the energy reduction comparison is only 0.11%, which is due to an increase in the application’s runtime.

7.6 Conclusions

This chapter has presented an automatic tool for generating extensible instructions for the extensible processor platform. Unlike similar work in the field, our tool includes a battery-aware algorithm that contains instruction separation and slack utilisation.

Our tool achieves two major feats:

• To separate instructions and utilise slack of the instruction to reduce the power

dissipation of extensible instructions and to explore fine-grain granularity in in-

struction generation;

• For the first time, battery lifetime (battery behaviour model) is taken into account

in generating extensible instructions, rather than just shortening the execution 176 CHAPTER 7. INSTRUCTIONS GENERATION

Instruction Instruction Application Energy Reduction set Average Average Execution Energy Comparison Speedup Power Time Consumption (Set 1 & [x] [mW] [second] [mJ] Set 2) [%] Original – – 0.111 2.21 Set 1 2.2 15.27 0.095 2.04 adpcm Set 2 2.1 12.98 0.094 2.00 2.18 Original – – 1.58 31.53 Set 1 7.09 44.52 0.64 16.13 g721e Set 2 7.22 28.45 0.61 14.12 12.43 Original – – 1.54 30.72 Set 1 7.09 44.52 0.65 16.33 g721d Set 2 7.25 28.45 0.59 13.63 16.53 Original – – 7.54 150.44

epie Set 1 18.52 34.67 0.48 11.86 Set 2 17.54 16.48 0.52 11.84 0.11 Original – – 0.55 10.97

epid Set 1 8.15 34.67 0.16 3.94 Set 2 7.58 21.48 0.17 3.87 1.96

Table 7.2: The characteristics of the application when different instructions generated in the tool are applied

time to produce energy reduction.

Our tool is able to generate large and complex extensible instructions with low power dissipation. The energy consumption of the real-world application that used the in- structions generated in our tool is further reduced by up to 16.53% compared to applica- tions using instructions generated by previous methods. The automatic tool generates an extensible instruction for a given code segment in a few seconds. The automatic generation tool has been integrated into our processor tool suite. Chapter 8

Conclusions

Design turnaround time has been elevated to become one of the important metrics in the extensible processor platform, for the variety of reasons that were discussed in

Chapter 1. It is therefore a must to automate design approaches to meet the necessary requirements of embedded systems. While significant improvements in the extensible processor platform do result in large reductions in design time, they by no means ad- dress the entire design flow. As a result, it is necessary to develop design automation methodologies for various processes in the design flow. To date, most research and commercial development work in the extensible processor platform has focused on gen- erating specific instructions that offer performance improvements in the application.

This thesis has presented a suite of design automation methodologies for the extensible processor platform: automating the process of identifying code segments, generating in- structions, matching pre-designed instructions, selecting architectural customisations, and evaluating the processor in the design flow.

Chapter 4 described a semi-automatic design system1 to design an extensible proces- sor that maximises application performance while satisfying a given area constraint.

An important problem addressed in Chapter 4 was to develop an understanding of

1Instructions generation and matching instructions to code segments were conducted manually in this design system, thus, it is called a semi-automatic design system.

177 178 CHAPTER 8. CONCLUSIONS how high-level code segment characteristics affect design metrics in extensible instruc- tions. This understanding is observed through the designer’s experience. Based on this understanding, Chapter 4 proposed a fitting function to identify code segments, which involves computing sufficient profile information and extracting high-level code segment characteristics to predict design metrics of the extensible instruction to be im- plemented. Using this fitting function, computationally intensive code segments were identified and replaced with extensible instructions, predefined blocks, and/or param- eter settings. Chapter 4 also presented a two-level hierarchy selection algorithm that involves selecting extensible instructions, predefined blocks, and parameter settings to maximise the application performance when design constraints are given. This algo- rithm takes into account the designer’s input, which involves combining predefined blocks and parameter settings into a set of pre-configured processors to prune design space. This algorithm first selects a pre-configured processor, and then selects a set of extensible instructions to generate an extensible processor. Next, an estimation func- tion is performed to rapidly estimate the performance of the application on a newly configured extensible processor. This function significantly reduces verification time compared to the time consuming cycle-accurate instruction set simulation. Extensive experimentation demonstrated up to 15.71× (on average 4.74×) improvement in ap- plication performance compared to the base processor configuration that meets the same area constraint. Furthermore, the design time of our semi-automatic design sys- tem only occupied 2.5% of the full simulation time, obtaining on average 91% of all

Pareto points in the design space. The estimation function for the proposed extensible processor is on average within 5.68% of the results obtained with an Instruction Set

Simulator.

Building upon the insights of Chapter 4, Chapter 5 presented an automated tool CHAPTER 8. CONCLUSIONS 179 that matches code segments to pre-designed extensible instructions to reduce design and verification time for new instructions. This tool takes into account functional equivalence between instructions and code segments, and uses combinational equiva- lence checking to ensure that the results (i.e. found candidates for extensible instruc- tions) are largely independent of the programming style of the application, resulting in high quality matching results. This tool first translates selected code segments of an application to a hardware description, filters out those code segments that would not match, and then eventually matches code segments to a pre-defined library of extensible instructions using functional equivalence checking. Experimental results showed that the time for matching was on average 7.3× faster than the state-of-the-art, simulation- based approach. The results demonstrated that the identical hand-optimised extensible instructions were matched in this tool.

Chapters 6 and 7 presented methodologies for instructions estimation analysis and optimisation which were developed to generate instructions automatically. Although previous work has attempted to address instructions generation, it was observed that the design space of extensible instructions is extremely complex and unfeasibly large.

To accurately explore the design space, Chapter 6 presented an instructions estima- tion model to estimate area overhead, latency, and power consumption of all possible extensible instructions for a given code segment. Previous work on this topic has ig- nored parallelism techniques and schedules alternatives in the instructions. Chapter

6 demonstrated that parallelism techniques and schedules alternatives are critical to the instructions estimation model. One major problem facing instructions estimation is the lack of component information in an instruction. Chapter 6 showed that it is possible to derive a reasonable estimation model using system decomposition and re- gression analysis to simplify the process of modelling extensible instructions. Extensive 180 CHAPTER 8. CONCLUSIONS experimentation showed that the mean absolute error for a set of instructions used in real-world applications is 3.4% (6.7% max.) for area overhead, 5.9% (9.4% max.) for latency, and 4.2% (7.2% max.) for power consumption. Our estimation model executes in a few seconds for an instruction, while synthesis and subsequent estimation would take hours.

Although the techniques presented in Chapter 6 attempt to explore the design space of extensible instructions, it was observed that a single extensible instruction is often implemented for a large code segment to maximise performance. However, at the same time, large power dissipation and current discharge are needed, and these instructions are not suitable for battery-powered products. To that end, Chapter 7 presented a battery-aware instructions generation tool that generates instructions to minimise the power dissipation while maximising application performance. Unlike other techniques that combine large code segments into a single extensible instruction, our proposed technique merely separates the code segment into multiple instructions and utilises the slack to maximise performance while minimising power dissipation. Hence, it incurs low power dissipation distribution and can be used to reduce energy consumption, which suits battery-powered products. Experimental results demonstrated that the tool reduces energy consumption by a further 5.8% on average (up to 17.7%) compared to extensible instructions generated by previous approaches.

The methodologies and results presented in this thesis demonstrate that various automation methodologies in the design flow do have a significant impact on design turnaround time. Furthermore, it has been shown that efficient exploration of the design space leads to better design metrics tradeoff and application performance im- provements that are far beyond those obtained through the existing design flow. In light of the increasing importance of design turnaround time as a design metric, the CHAPTER 8. CONCLUSIONS 181 results presented in this thesis show that the availability and use of design automation methodologies will enable designers to meet the shortening design time, and can lead to fewer and faster design iterations for efficient design exploration.

There are several related issues of interest that can be explored in the future.

Each of the chapters has outlined a number of the specific improvements that could be made to the methodologies presented in this thesis. There will also be a need in the near future to develop processor architecture that supports various extensible instructions requirements such as pipeline structure, decoding methodology, memory accesses, etc. that further optimises the design metrics of extensible processors.

Finally, it is my hope that the methodologies and insights developed in this thesis will be incorporated into the next generation of extensible processor platform design tools, in order to reduce design turnaround time and close the design productivity gap. 182 CHAPTER 8. CONCLUSIONS Bibliography

[1] ARCtangent Processor. ARC, Inc. (http://www.arc.com).

[2] ASIP-Meister. (http://www.eda-meister.org/asip-meister/).

[3] Blast RTL. Magma, Inc. (http://www.magma-da.com).

[4] Cadence. Cadence, Inc. (http://www.cadence.com).

[5] DAPDNP-2 Dynamically Reconfigurable Processor. IPFlex, Inc.

(http://www.ipflex.com).

[6] Design Compiler. Synopsys, Inc. (http://www.synopsys.com).

[7] FPGA. Xilinx, Inc. (http://www.xilinx.com).

[8] HP Labs. Hewlett-Packard, Inc. (http://www.hpl.hp.com).

[9] Intel processor. Intel, Inc. (http://www.intel.com).

[10] Jazz DSP. Improv Systems, Inc. (http://www.improvsys.com).

[11] Lexra Processor. Lexra, Inc. (http://www.lexra.com).

[12] LisaTek. CoWare, Inc. (http://www.coware.com).

[13] Media Embedded Processor. Toshiba, Inc. (http://www.mepcore.com).

[14] MIPS Cores. MIPS, Inc. (http://www.mips.com).

183 184 BIBLIOGRAPHY

[15] ModelSim. Model, Inc. (http://www.model.com).

[16] Motorola processor. Motorola, Inc. (http://www.motorola.com).

[17] NIOS II/NIOS Embedded Processors. Altera, Inc. (http://www.altera.com).

[18] PICO Technology. Synfora, Inc. (http://www.synfora.com).

[19] PowerTheater. Sequence, Inc. (http://www.sequencedesign.com).

[20] Splus. Insightful, Inc. (http://www.insightful.com).

[21] Stretch S5 engine. Stretch, Inc. (http://www.stretchinc.com).

[22] Tarari Processing Platform. Tarari, Inc. (http://www.tarari.com).

[23] Xtensa Processor. Tensilica, Inc. (http://www.tensilica.com).

[24] J. Abella, A. Gonzalez, J. Llosa, and X. Vera. Near-optimal Loop Tiling by Means

of Cache Miss Equations and Genetic Algorithms. In International Conference

on Parallel Processing Workshops, pages 568–577, August 2002.

[25] S. G. Abraham and B. R. Rau. Efficient Design Space Exploration in PICO. In

International Conference on Compilers, Architectures, and Synthesis for Embed-

ded Systems, pages 71–79, October 2000.

[26] S. Aditya, B. R. Rau, and V. Kathail. Automatic Architectural Synthesis of

VLIW and EPIC Processors. In International Symposium on System Synthesis,

pages 107–113, November 1999.

[27] A. Aho, M. Ganapathi, and S. Tjiang. Code Generation using Tree Matching

and Dynamic Programming. ACM Transactions on Programming Languages and

Systems, 11(4):491–561, October 1989. BIBLIOGRAPHY 185

[28] C. Alippi, W. Fornaciari, L. Pozzi, and M. Sami. A DAG-based Design Approach

for Reconfigurable VLIW Processors. In Design, Automation and Test in Europe

Conference and Exhibition, pages 778–780, March 1999.

[29] V. H. Allan, B. Su, P. Wijaya, and J. Wang. Foresighted Instruction Scheduling

Under Timing Constraints. IEEE Transactions on Computers, 41(9):1169–1172,

September 1992.

[30] J. R. Allen and K. Kennedy. Automatic Loop Interchange. In International

Symposium on Compiler Construction, pages 233–246, 1984.

[31] A. Alomary, T. Nakata, Y. Honma, M. Imai, and N. Hikichi. An ASIP Instruc-

tion Set Optimization Algorithm with Functional Module Sharing Constraint.

In IEEE International Conference on Computer Aided Design, pages 526–532,

November 1993.

[32] M. Arnold. Instruction Set Extension for Embedded Processors. Ph.D. Thesis,

Delft University of Technology, March 2001.

[33] K. Atasu, L. Pozzi, and P. Ienne. Automatic Application-Specific Instruction-

Set Extensions Under Microarchitectural Constraints. In ACM/IEEE Design

Automation Conference, pages 256–261, June 2003.

[34] P. M. Athanas and H. F. Silverman. Processor Reconfiguration Through

Instruction-set Metamorphosis. Computer, 26(3):11–18, March 1993.

[35] F. Barat, R. Lauwereins, and G. Deconinck. Reconfigurable Instruction Set Pro-

cessors from a Hardware/Software Perspective. IEEE Transactions on Software

Engineering, 28(9):847–862, September 2002. 186 BIBLIOGRAPHY

[36] V. Betz and J. Rose. VPR: A New Packing, Placement and Routing Tool for

FPGA Research. In International Workshop on Field-Programmable Logic and

Applications, pages 213–222, June 1997.

[37] V. Bhatt, M. Balakrishnan, and A. Kumar. Exploring the Number of Register

Windows in ASIP Synthesis. In International Conference on VLSI Design, pages

233–238, January 2002.

[38] N. Binh, M. Imai, and Y. Takeuchi. A Performance Maximization Algorithm

to Design ASIPs under the Constraint of Chip Area Including RAM and ROM

Sizes. In Asia and South Pacific Design Automation Conference, pages 367–372,

February 1998.

[39] P. Biswas, V. Choudhary, K. Atasu, L. Pozzi, P. Ienne, and N. Dutt. Introduction

of Local Memory Elements in Instruction Set Extensions. In ACM/IEEE Design

Automation Conference, pages 729–734, June 2004.

[40] A. Bona, M. Sami, D. Sciuto, C. Silvano, V. Zaccaria, and R. Zafalon. Energy

Estimation and Optimization of Embedded VLIW Processors based on Instruc-

tion Clustering. In ACM/IEEE Design Automation Conference, pages 886–891,

June 2002.

[41] K. S. Brace, R. L. Rudell, and R. E. Bryant. Efficient Implementation of a BDD

Package. In ACM/IEEE Design Automation Conference, pages 40–45, June 1990.

[42] R. K. Brayton, G. Hachtel, A. Sangiovanni-Vincentelli, F. Somenzi, A. Aziz,

S. Cheng, S. Edwards, S. Khatri, Y. Kukimoto, A. Pardo, S. Qadeer, R. Ranjan,

S. Sarwary, T. Shiple, G. Swamy, and T. Villa. VIS: a System for Verification and

Synthesis. In International Conference on Computer Aided Verification, pages

428–432, July 1996. BIBLIOGRAPHY 187

[43] R. K. Brayton and S. P. Khatri. Multi-valued Logic Synthesis. In International

Conference on VLSI Design, pages 196–205, January 1999.

[44] R. K. Brayton and C. McMullen. The Decomposition and Factorization of

Boolean Expressions. In IEEE International Symposium on Circuits and Sys-

tems, pages 49–54, November 1982.

[45] R. K. Brayton, R. Rudell, A. Sangiovanni-Vincentelli, and A. Wang. Multi-Level

Logic Optimization and the Rectangle Covering Problem. In IEEE International

Conference on Computer Aided Design, pages 66–69, November 1987.

[46] P. Brisk, A. Japlan, and M. Sarrafzadeh. Area-Efficient Instruction Set Synthesis

for Reconfigurable System-on-Chip Designs. In ACM/IEEE Design Automation

Conference, pages 395–400, June 2004.

[47] P. Brisk, A. Kaplan, R. Kastner, and M. Sarrafzadeh. Instruction Generation and

Regularity Extraction for Reconfigurable Processors. In International Conference

on Compilers, Architecture, and Synthesis for Embedded Systems, pages 262–269,

October 2002.

[48] S. Brown, J. Rose, and Z. Vranesic. A Detailed Router for Field Programmable

Gate Arrays. IEEE Transactions on Computer-Aided Design of Integrated Cir-

cuits and Systems, 11(5):620–628, May 1992.

[49] R. E. Bryant. Graph-Based Algorithms for Boolean Function Manipulation.

IEEE Transactions on Computers, 35(8):677–691, August 1986.

[50] R. Buccigrossi and E. Simoncelli. Progressive Wavelet Image Coding Based on a

Conditional Probability Model. In IEEE International Conference on Acoustics,

Speech and Signal Processing, pages 2597–2600, April 1997. 188 BIBLIOGRAPHY

[51] F. Catthoor, E. DeGreef, and S. Suytack. Custom Memory Management Method-

ology: Exploration of Memory Organisation for Embedded Multimedia System

Design. Kluwer Academic Publishers, Norwell, USA, 1998.

[52] C. Chekuri and S. Khanna. A PTAS for the Multiple Knapsack Problem. In

International Symposium on Discete Algorithms, pages 213–222, January 2000.

[53] H. C. Chen and D. Du. Path Sensitization in Critical Path Problem. In IEEE

International Conference on Computer Aided Design, pages 208–211, November

1991.

[54] K. T. Cheng and L. A. Entrena. Multi-Level Logic Optimization by Redundancy

Addition and Removal. In European Conference on Design Automation, pages

373–377, February 1993.

[55] S. Cheng, R. Brayton, G. York, K. Yelick, and A. Saldanha. Compiling Verilog

Into Timed Finite State Machines. In International Conference on Verilog HDL,

pages 32–39, March 1995.

[56] N. Cheung, J. Henkel, and S. Parameswaran. Embedded Software for SoC, chapter

Rapid Configuration & Instruction Selection for an ASIP: A Case Study, pages

403–417. Kluwer Academic Publishers, 2003.

[57] N. Cheung, J. Henkel, and S. Parameswaran. Rapid Configuration & Instruction

Selection for an ASIP: A Case Study. In Design, Automation and Test in Europe

Conference and Exhibition, pages 802–807, March 2003.

[58] N. Cheung, S. Parameswaran, and J. Henkel. INSIDE: INstruction Selection /

Idenification & Design Exploration for Extensible Processors. In IEEE Interna-

tional Conference on Computer Aided Design, pages 291–297, November 2003. BIBLIOGRAPHY 189

[59] N. Cheung, S. Parameswaran, and J. Henkel. A Quantitative Study and Esti-

mation Models for Extensible Instructions in Embedded Processors. In IEEE

International Conference on Computer Aided Design, pages 183–189, November

2004.

[60] N. Cheung, S. Parameswaran, and J. Henkel. Battery-Aware Instruction Gener-

ation for Embedded Processors. In Asia and South Pacific Design Automation

Conference, pages 553–556, January 2005.

[61] N. Cheung, S. Parameswaran, J. Henkel, and J. Chan. MINCE: Matching IN-

structions using Combinational Equivalance for Extensible Processor. In Design,

Automation and Test in Europe Conference and Exhibition, pages 1020–1025,

February 2004.

[62] H. Choi, J. Kim, C. Yoon, I. Park, S. Hwang, and C. Kyung. Synthesis of Ap-

plication Specific Instructions for Embedded DSP Software. IEEE Transactions

on Computers, 48(6):603–614, June 1999.

[63] H. Choi and I. Park. Coware Pipelining for Exploiting Intellectual Properties

and Software Codes in Processor-based Design. In IEEE International Confer-

ence on Application Specific Integrated Circuits/System-On-Chips, pages 153–

157, September 2000.

[64] H. Choi, J. Yi, J. Lee, I. Park, and C. Kyung. Exploiting Intellectual Properties in

ASIP Designs for Embedded DSP Software. In ACM/IEEE Design Automation

Conference, pages 939–944, June 1999.

[65] P. C. Chu and J. E. Beasley. A Genetic Algorithm for the Multidimensional

Knapsack Problem. Journal of Heuristics, 4(1):63–86, June 1998. 190 BIBLIOGRAPHY

[66] N. Clark, M. Kudlur, H. Park, S. Mahlke, and K. Flautner. Application-Specific

Processing on a General-Purpose Core via Transparent Instruction Set Cus-

tomization. In IEEE/ACM International Symposium on Microarchitecture, pages

30–40, December 2004.

[67] N. Clark, H. Zhong, and S. Mahlke. Processor Acceleration Through Automated

Instruction Set Customization. In IEEE/ACM International Symposium on Mi-

croarchitecture, pages 129–140, December 2003.

[68] E. Clarke, D. Kroening, and K. Yorav. Behavioral Consistency of C and Verilog

Programs Using Bounded Model Checking. In ACM/IEEE Design Automation

Conference, pages 368–371, June 2003.

[69] J. Cong and Y. Ding. An Optimal Technology Mapping Algorithm for Delay

Optimization in Lookup-table Based FPGA Designs. In IEEE International

Conference on Computer Aided Design, pages 48–53, November 1992.

[70] J. Cong, Y. Fan, G. Han, and Z. Zhang. Application-Specific Instruction Gener-

ation for Configurable Processor Architectures. In International Symposium on

Field Programmable Gate Array, pages 183–189, November 2004.

[71] M. Corazao, M. Khalaf, L. Guerra, M. Potkonjak, and J. Rabaey. Instruction

Set Mapping for Performance Optimization. In IEEE International Conference

on Computer Aided Design, pages 518–521, November 1993.

[72] R. Cytron, J. Ferrante, B. K. Rosen, M. N. Wegman, and F. K. Zadeck. An Effi-

cient Method of Computing Static Single Assignment Form. In ACM Symposium

on Principles of Programming Languages, pages 25–35, January 1989. BIBLIOGRAPHY 191

[73] J. W. Davidson and S. Jinturkar. Improving Instruction-level Parallelism by Loop

Unrolling and Dynamic Memory Disambiguation. In IEEE/ACM International

Symposium on Microarchitecture, pages 125–132, November 1995.

[74] A. J. deGeus and W. Cohen. A Rule-Based System for Optimizing Combinational

Logic. IEEE Design and Test of Computers, 2(4):22–32, August 1985.

[75] G. DeMicheli. Performance-Oriented Synthesis in the Yorktown Silicon Compiler.

In IEEE International Conference on Computer Aided Design, pages 138–141,

November 1989.

[76] G. DeMicheli, R. K. Brayton, and A. Sangiovanni-Vincentelli. KISS: A Program

for Optimal State Assignment for Finite State Machines. In IEEE International

Conference on Computer Aided Design, pages 209–211, November 1984.

[77] S. Devadas, K. Keutzer, and S. Malik. Delay Computation for Combinational

Logic Circuits: Theory and Algorithms. In IEEE International Conference on

Computer Aided Design, pages 176–179, November 1991.

[78] N. Dutt and K. Choi. Configurable Processors for Embedded Computing. IEEE

Computer Magazine, 36(1):120–123, January 2003.

[79] K. Ebcioglu, J. Fritts, S. Kosonocky, M. Gschwind, E. Altman, K. Kailas, and

A. T. Bright. An Eight Issue Tree-VLIW Processor for Dynamic Binary Trans-

lation. In IEEE International Conference on Computer Design, pages 488–495,

October 1998.

[80] C. Ebeling, L. McMurchie, S. A. Hauck, and S. Burns. Placement and Routing

Tools for the Triptych FPGA. IEEE Transactions on Very Large Scale Integration

(VLSI) Systems, 3(4):473–482, December 1995. 192 BIBLIOGRAPHY

[81] A. Fauth. Beyond Tool-specific Machine Descriptions. Code Generation for

Embedded Processors, pages 138–152, December 1995.

[82] Y. Fei, S. Ravi, A. Raghunathan, and N. Jha. A Hybrid Energy Estimation

Technique for Extensible Processors. IEEE Transactions on Computer-Aided

Design of Integrated Circuits and Systems, 23(5):652–664, May 2004.

[83] R. J. Francis, J. Rose, and K. Chung. Chortle: a Technology Mapping Program

for Lookup Table-based Field Programmable Gate Arrays. In ACM/IEEE Design

Automation Conference, pages 613–619, June 1990.

[84] M. R. Garey and D. S. Johnson. Computers and Intractability: A Guide to the

Theory of NP-Completeness. W.H. Freeman and Company, San Francisco, USA,

1979.

[85] C. H. Gebotys. Utilizing Memory Bandwidth in DSP Embedded Processors. In

ACM/IEEE Design Automation Conference, pages 347–352, June 2001.

[86] J. Gong, D. Gajski, and A. Nicolau. Performance Evaluation for Application-

Specific Architectures. IEEE Transactions on Very Large Scale Integration

(VLSI) Systems, 3(4):483–490, December 1995.

[87] R. Gonzalez. Xtensa: A Configurable and Extensible Processor. IEEE Micro

Magazine, 20(2):60–70, March 2000.

[88] D. Goodwin and D. Petkov. Automatic Generation of Application Specific Pro-

cessors. In International Conference on Compilers Architectrue and Synthesis for

Embedded Systems, pages 137–147, October 2003.

[89] G. Goossens, J. Rabaey, J. Vandewalle, and H. DeMan. An Efficient Mi-

crocode Compiler for Application Specific DSP Processors. IEEE Transactions BIBLIOGRAPHY 193

on Computer-Aided Design of Integrated Circuits and Systems, 9(9):925–937,

September 1990.

[90] D. Gregory, K. Barrtlett, and A. J. deGeus. Automatic Generation of Combina-

torial Logic from a Functional Specification. In IEEE International Symposium

on Circuits and Systems, pages 986–989, May 1984.

[91] M. Gr¨unewald, D. Le, U. Kastens, J. Niemann, M. Porrmann, U. R¨uckert,

A. Slowik, and M. Thies. Network Application Driven Instruction Set Exten-

sions for Embedded Processing Clusters. In IEEE International Conference on

Parallel Computing in Electrical Engineering, pages 209–214, September 2004.

[92] M. Gschwind. Instruction Set Selection for ASIP Design. In International Work-

shop on Hardware/Software Codesign, pages 7–11, May 1999.

[93] J. Gu and Z. Li. Efficient Interprocedural Array Data-flow Analysis for Automatic

Program Parallelization. IEEE Transactions on Software Engineering, 26(3):244–

261, March 2000.

[94] J. Guohua and C. Fujie. Hybrid Loop Interchange: Optimization for Parallel

Programs. In International Symposium on Parallel Processing, pages 680–685,

March 1992.

[95] T. Gupta, R. Ko, and R. Barua. Compiler-directed Customization of ASIP

Cores. In International Symposium on Hardware/Software Co-Design, pages 97–

102, May 2002.

[96] T. Gupta, P. Sharma, M. Balakrishnan, and S. Malik. Processor Evaluation in an

Embedded Systems Design Environment. In International Conference on VLSI

Design, pages 98–103, January 2000. 194 BIBLIOGRAPHY

[97] J. Gyllenhaal, W. Hwu, and B. Rau. HMDES Version 2.0 specification. Technical

Report IMPACT-96-03, University of Illinois at Urbana-Champaign, March 1996.

[98] G. Hadjiyiannis, P. Russo, and S. Devadas. A Methodology for Accurate Perfor-

mance Evaluation in Architecture Exploration. In ACM/IEEE Design Automa-

tion Conference, pages 927–932, June 1999.

[99] A. Halambi, P. Grun, V. Ganesh, A. Khare, N. Dutt, and A. Nicolau. EXPRES-

SION: A Language for Architecture Exploration Through Compiler/Simulator

Retargetability. In Design, Automation and Test in Europe Conference and Ex-

hibition, pages 485–490, March 1999.

[100] S. Hanono and S. Devadas. Instruction Selection, Resource Allocation, and

Scheduling in the AVIV Retargetable Code Generator. In ACM/IEEE Design

Automation Conference, pages 510–515, June 1998.

[101] S. Hauck, T. Fry, M. Hosler, and J. Kao. The Chimaera Reconfigurable Func-

tional Unit. In IEEE Symposium on Field-Programmable Custom Computing

Machines, pages 87–96, April 1997.

[102] J. R. Hauser and J. Wawrzynek. Garp: a MIPS Processor with a Reconfigurable

Coprocessor. In IEEE Symposium on Field-Programmable Custom Computing

Machines, pages 24–33, April 1997.

[103] J. Henkel. Closing the SoC Design Gap. IEEE Computer Magazine, 36(9):119–

121, September 2003.

[104] P. Hoang and J. Rabaey. A Compiler for Multiprocessor DSP Implementation.

In IEEE International Conference on Acoustics, Speech, and Signal Processing,

pages 581–584, March 1992. BIBLIOGRAPHY 195

[105] A. Hoffmann, F. Fiedler, A. Nohl, and S. Parupalli. A Methodology and Tooling

Enabling Application Specific Processor Design. In International Conference on

VLSI Design, pages 399–404, January 2005.

[106] A. Hoffmann, T. Kogel, A. Nohl, G. Braun, O. Schliebusch, and O. Wahlen. A

Novel Methodology for the Design of Application-Specific Instruction-set Pro-

cessors (ASIPs) using a Machine Description Language. IEEE Transactions on

Computer-Aided Design of Integrated Circuits and Systems, 20(11):1338–1354,

November 2001.

[107] B. Holmer. A Tool for Processor Instruction Set Design. In European Conference

on Design Automation, pages 150–155, September 1994.

[108] J. Hoogerbrugge and L. Augusteijn. Instruction Scheduling for TriMedia. Journal

of Instruction-Level Parallelism, 1:1–21, February 1999.

[109] I. Huang and A. Despain. Synthesis of Application Specific Instruction Sets.

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Sys-

tems, 14(6):663–675, June 1995.

[110] T. C. Huang and C. M. Yang. Further Results for Improving Loop Interchange

in Non-adjacent and Imperfectly Nested Loops. In International Workshop on

High-Level Parallel Programming Models and Supportive Environments, pages

93–99, March 1998.

[111] M. Itoh, S. Higaki, J. Sato, A. Shiomi, Y. Takeuchi, A. Kitajima, and M. Imai.

PEAS-III: An ASIP Design Environment. In IEEE International Conference on

Computer Design, pages 430–436, September 2000. 196 BIBLIOGRAPHY

[112] M. Jackson and E. S. Kuh. Performance-driven Placement of Cell Based IC’s. In

ACM/IEEE Design Automation Conference, pages 370–375, June 1989.

[113] M. Jacome, G. Veciana, and V. Lapinskii. Exploring Performance Tradeoffs for

Clustered VLIW ASIPs. In IEEE International Conference on Computer Design,

pages 504–510, November 2000.

[114] M. K. Jain, M. Balakrishnan, and A. Kumar. ASIP Design Methodologies: Sur-

vey and Issues. In International Conference on VLSI Design, pages 76–81, Jan-

uary 2001.

[115] M. K. Jain, M. Balakrishnan, and A. Kumar. Exploring Storage Organization

in ASIP Synthesis. In Euromicro Symposium on Digital System Design, pages

120–127, September 2003.

[116] M. K. Jain, M. Balakrishnan, and A. Kumar. Integrated On-chip Storage Eval-

uation in ASIP Synthesis. In International Conference on VLSI Design, pages

274–279, January 2005.

[117] M. K. Jain, L. Wehmeyer, S. Steinke, P. Marwedel, and M. Balakrishnan. Eval-

uating Register File Size in ASIP Design. In International Symposium on Hard-

ware/Software Codesign, pages 109–114, September 2001.

[118] P. K. Jha and N. D. Dutt. Rapid Estimation for Parameterized Components in

High-level Synthesis. IEEE Transactions on Very Large Scale Integration (VLSI)

Systems, 1(3):296–303, September 1993.

[119] K. Kang and K. Choe. On the Automatic Generation of Instruction Selector Us-

ing Bottom-Up Tree Pattern Matching. Technical Report CS-TR-95-93, Depart- BIBLIOGRAPHY 197

ment of Computer Science, Korea Advanced Institute of Science and Technology

(KAIST), April 1995.

[120] V. Kathail, S. Aditya, R. Schreiber, B. Rau, D. Cronquist, and M. Sivaraman.

PICO: Automatically Designing Custom Computers. Computer, 35(9):39–47,

September 2002.

[121] M. Kaul, R. Vemuri, S. Govindarajan, and I. Ouaiss. An Automated Temporal

Partitioning and Loop Fission Approach for FPGA Based Reconfigurable Synthe-

sis of DSP Applications. In ACM/IEEE Design Automation Conference, pages

616–622, June 1999.

[122] B. Kernighan and S. Lin. An Efficient Heuristic Procedure for Partitioning

Graphs. The Bell System Technical Journal, 49:291–307, February 1970.

[123] B. Kernighan and R. Pike. The Practice of Programming. Addison-Wesley Pro-

fessional, Menlo Park, CA, USA, 1999.

[124] K. Keutzer. DAGON: Technology Binding and Local Optimization by DAG

Matching. In ACM/IEEE Design Automation Conference, pages 617–623, June

1987.

[125] K. Keutzer, S. Malik, and A. Saldanha. Is Redundancy Necessary to Reduce

Delay. In ACM/IEEE Design Automation Conference, pages 228–234, June 1990.

[126] K. Keutzer and D. Richards. Computational Complexity of Logic Synthesis and

Optimization. In International Workshop on Logic Synthesis, pages 1–15, May

1989.

[127] S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi. Optimization by Simulated

Annealing. Science, 220(4598):671–680, May 1983. 198 BIBLIOGRAPHY

[128] J. M. Kleinhans, G. Sigl, and F. M. Johannes. GORDIAN: A New Global Op-

timization/Rectangle Dissection Method for Cell Placement. In IEEE Interna-

tional Conference on Computer Aided Design, pages 506–509, November 1988.

[129] S. Kobayashi, H. Mita, Y. Takeuchi, and M. Imai. Design Space Exploration for

DSP Applications using the ASIP Development System PEAS-III. In IEEE In-

ternational Conference on Acoustics, Speech, and Signal Processing, pages 3168–

3171, May 2002.

[130] D. J. Kolson, A. Nicolau, N. Dutt, and K. Kennedy. A Method for Register

Allocation to Loops in Multiple Register File Architectures. In International

Symposium on Parallel Processing, pages 28–33, April 1996.

[131] T. Kong and K. D. Wilken. Precise Register Allocation for Irregular Archi-

tectures. In IEEE/ACM International Symposium on Microarchitecture, pages

297–307, November 1998.

[132] K. K¨u¸c¨uk¸cakar. An ASIP Design Methodology for Embedded Systems. In Inter-

national Workshop on Hardware/Software Codesign, pages 17–21, May 1999.

[133] K. Lahiri, A. Raghunathan, S. Dey, and D. Panigrahi. Battery-driven System

Design: A New Frontier in Low Power Design. In Asia and South Pacific Design

Automation Conference/International Conference of VLSI Design, pages 261–

267, January 2002.

[134] L. Lavagno, S. Malik, R. K. Brayton, and A. Sangiocanni-Vincentelli. MIS-

MV Optimization of Multi-Level Logic with Multiple-valued Inputs. In IEEE

International Conference on Computer Aided Design, pages 560–563, November

1990. BIBLIOGRAPHY 199

[135] C. Lee, M. Potkonjak, and W. H. Mangione-Smith. Mediabench: A Tool for

Evaluating and Synthesizing Multimedia and Communications Systems. In

IEEE/ACM International Symposium on Microarchitecture, pages 330–335, De-

cember 1997.

[136] C. Y. Lee. An Algorithm for Path Connections and its Applications. IRE Trans-

actions on Electronic Computers, EC-10(2):346–365, 1961.

[137] J. Lee, K. Choi, and N. Dutt. Efficient Instruction Encoding for Automatic

Instruction Set Design of Configurable ASIPs. In International Conference on

Computer Aid Design, pages 649–654, November 2002.

[138] J. Lee, K. Choi, and N. Dutt. Energy-Efficient Instruction Set Synthesis for

Application-Specific Processors. In International Symposium on Low Power Elec-

tronics and Design, pages 330–333, August 2003.

[139] M. T. Lee, V. Tiwari, S. Malik, and M. Fujita. Power Analysis and Minimization

Techniques for Embedded DSP Software. IEEE Transactions on Very Large Scale

Integration (VLSI) Systems, 5(1):123–135, March 1997.

[140] Y. F. Lee, B. G. Ryder, and M. E. Fiuczynski. Region Analysis: A Parallel

Elimination Method for Data Flow Analysis. IEEE Transactions on Software

Engineering, 21(11):913–926, March 1995.

[141] R. Leupers and P. Marwedel. Instruction-Set Modelling for ASIP Code Genera-

tion. In IEEE International Conference on VLSI Design, pages 77–80, January

1995. 200 BIBLIOGRAPHY

[142] R. Leupers and P. Marwedel. Retargetable Code Generation Based on Structural

Processor Descriptions. Design Automation for Embedded Systems, 3(1):1–36,

January 1998.

[143] S. Liao, S. Devadas, and K. Keutzer. Code Density Optimization for Embedded

DSP Processors Using Data Compression Techniques. IEEE Transactions on

Computer-Aided Design of Integrated Circuits and Systems, 17(7):601–608, July

1998.

[144] S. Liao, S. Devadas, K. Keutzer, and S. Tjiang. Instruction Selection using

Binate Covering for Code Size Optimization. In IEEE International Conference

on Computer Aided Design, pages 393–399, November 1995.

[145] C. Liem, T. May, and P. Paulin. Instruction-Set Matching and Selection for DSP

and ASIP Code Generation. In European Conference on Design Automation,

pages 31–37, February 1994.

[146] H. D. Linden and T. B. Reddy. Handbook of Batteries. McGraw-Hill, New York,

NY, USA, 1995.

[147] M. Lipasti. Value Locality and Speculative Execution. Ph.D. Thesis, Carnegie

Mellon University, April 1997.

[148] A. Lodi, M. Toma, F. Campi, A. Cappelli, R. Canegallo, and R. Guerrieri. A

VLIW Processor With Reconfigurable Instruction Set for Embedded Applica-

tions. IEEE Journal on Solid-State Circuits, 38(11):1876–1886, November 2003.

[149] J. C. Madre and J. P. Billion. Proving Circuit Correctness Using Formal Com-

parison Between Expected And Extracted Behaviour. In ACM/IEEE Design

Automation Conference, pages 205–210, June 1989. BIBLIOGRAPHY 201

[150] J. C. Madre, O. Coudert, and J. P. Billion. Automating The Diagnosis And The

Rectification Of Design Errors With PRIAM. In IEEE International Conference

on Computer Aided Design, pages 30–33, November 1989.

[151] N. Manjikian. Combining Loop Fusion with Prefetching on Shared-memory Mul-

tiprocessors. In International Conference on Parallel Processing, pages 78–82,

August 1997.

[152] R. Marculescu, D. Marculescu, and M. Pedram. Switching Activity Analysis

Considering Spatiotemporal Correlations. In IEEE International Conference on

Computer Aided Design, pages 294–299, November 1994.

[153] J. P. Marques-Silva and K. A. Sakallah. GRASP: A New Search Algorithm for

Satisfiability. In IEEE International Conference on Computer Aided Design,

pages 220–227, November 1996.

[154] S. Martello and P. Toth. Knapsack Problems: Algorithms and Computer Imple-

mentations. John Wiley & Sons, Inc., New York, NY, USA, 1990.

[155] P. McGeer and R. K. Brayton. Efficient Algorithms for Computing the Longest

Viable Path in a Combinational Network. In ACM/IEEE Design Automation

Conference, pages 561–567, June 1989.

[156] M. Moskewicz, C. Madigan, Y. Zhao, L. Zhang, and S. Malik. Chaff: Engineering

and Efficient SAT Solver. In ACM/IEEE Design Automation Conference, pages

530–535, June 2001.

[157] T. Nakra, R. Gupta, and M. Soffa. Value Prediction in VLIW Machines. In

International Symposium on Computer Architecture, pages 258–269, June 1999. 202 BIBLIOGRAPHY

[158] G. J. Nam, F. Aloul, K. A. Sakallah, and R. A. Rutenbar. A Comparative Study

of Two Boolean Formulations of FPGA Detailed Routing Constraints. IEEE

Transactions on Computers, 53(6):688–696, June 2004.

[159] J. Ng, D. Kulkarni, W. Li, R. Cox, and S. Bobholz. Inter-procedural Loop

Fusion, Array Contraction and Rotation. In International Conference on Parallel

Architectures and Compilation Techniques, pages 114–124, September 2003.

[160] C. Norris and L. L. Pollock. An Experimental Study of Several Cooperative

Register Allocation and Instruction Scheduling Strategies. In IEEE/ACM Inter-

national Symposium on Microarchitecture, pages 169–179, November 1995.

[161] F. Onion, A. Nicolau, and N. Dutt. Incorporating Compiler Feedback Into the

Design of ASIPs. In European Design and Test Conference, pages 508–513, March

95.

[162] P. R. Panda, H. Nakamura, N. D. Dutt, and A. Nicolau. Augmenting Loop Tiling

with Data Alignment for Improved Cache Performance. IEEE Transactions on

Computers, 48(2):142–149, February 1999.

[163] D. A. Patterson and J. L. Hennessy. Computer Architecture: A Quantitative

Approach. Morgan Kaufmann Publishers, Palo Alto, CA, USA, 1989.

[164] P. G. Paulin, C. Liem, T. C. May, and S. Sutawala. Flexware: A Flexible

Firmware Development Environment for Embedded Systems. Code Generation

for Embedded Processors, pages 65–84, December 1995.

[165] M. Pedram and Q. Wu. Battery-powered Digital CMOS Design. IEEE Transac-

tions on Very Large Scale Integration (VLSI) Systems, 10(5):601–607, October

2002. BIBLIOGRAPHY 203

[166] A. Peymandoust, L. Pozzi, P. Ienne, and G. Micheli. Automatic Instruction-Set

Extension And Utilization For Embedded Processors. In International Confer-

ence on Application-specific Systems, Architectures and Processors, pages 108–

118, June 2003.

[167] C. Pixley and G. Beihl. Calculating Resetability and Reset Sequences. In IEEE

International Conference on Computer Aided Design, pages 376–379, November

1991.

[168] A. Pnueli, O. Shtrichman, and M. Siegel. The Code Validation Tool CVT: Auto-

matic Verification of a Compilation Process. International Journal on Software

Tools for Technology Transfer, 2(2):192–201, July 1998.

[169] M. Potkonjak and J. Rabaey. Power Minimization in DSP Application Spe-

cific Systems Using Algorithm Selection. In IEEE International Conference on

Acoustics, Speech, and Signal Processing, pages 2639–2642, May 1995.

[170] L. Pozzi. Methodologies for the Design of Application-Specific Reconfigurable

VLIW Processors. Ph.D. Thesis, Politecnico di Milano, January 2000.

[171] S. P. Rajan, M. Fujita, A. Sudarsanam, and S. Malik. Development of an Opti-

mizing Compiler for a Fujitsu Fixed-point Digital Signal Processor. In Interna-

tional Workshop on Hardware/Software Codesign, pages 2–6, May 1999.

[172] S. Ravi, A. Raghunathan, N. Potlapally, and M. Sankaradass. System Design

Methodologies for a Wireless Security Processing Platform. In ACM/IEEE De-

sign Automation Conference, pages 777–782, June 2002. 204 BIBLIOGRAPHY

[173] R. Razdan and M. D. Smith. A High-Performance Microarchitecture with

Hardware-Programmable Functional Units. In IEEE/ACM International Sym-

posium on Microarchitecture, pages 172–80, November 1994.

[174] R. Rudell. Dynamic Variable Ordering for Ordered Binary Decision Diagrams. In

IEEE International Conference on Computer Aided Design, pages 42–47, Novem-

ber 1993.

[175] R. Rudell and A. Sangiovanni-Vincentelli. Espresso-MV: Algorithms for Multiple-

Valued Logic Minimization. In IEEE International Conference on Custom Inte-

grated Circuits, pages 230–234, May 1985.

[176] R. Rudell and A. Sangiovanni-Vincentelli. Exact Minimization of Multiple-

Valued Functions. In IEEE International Conference on Computer Aided Design,

pages 352–355, November 1986.

[177] R. Rudell and A. Sangiovanni-Vincentelli. Multiple-Valued Minimizations for

PLA Optimization. IEEE Transactions on Computer-Aided Design of Integrated

Circuits and Systems, 6(5):727–750, September 1987.

[178] J. Sanchez and A. Gonzalez. The Effectiveness of Loop Unrolling for Modulo

Scheduling in Clustered VLIW Architectures. In International Conference on

Parallel Processing, pages 555–562, August 2000.

[179] J. Sanghavi and A. Wang. Estimation of Speed, Area, and Power of Parameteri-

zable Soft IP. In ACM/IEEE Design Automation Conference, pages 31–34, June

2001. BIBLIOGRAPHY 205

[180] H. Savoj and R. K. Brayton. The Use of Observability and External Don’t

Cares for the Simplification of Multi-Level Networks. In ACM/IEEE Design

Automation Conference, pages 297–301, June 1990.

[181] H. Savoj and R. K. Brayton. Observability Relations and Observability Don’t

Cares. In IEEE International Conference on Computer Aided Design, pages 518–

521, November 1991.

[182] H. Savoj, R. K. Brayton, and H. J. Touati. Extracting Local Don’t Cares for

Network Optimization. In IEEE International Conference on Computer Aided

Design, pages 514–517, November 1991.

[183] C. Sechen. Chip-planning, Placement, and Global Routing of Macro/Custom

Cell Integrated Circuits Using Simulated Annealing. In ACM/IEEE Design Au-

tomation Conference, pages 73–80, June 1988.

[184] L. Semeria, A. Seawright, R. Mehra, D. Ng, A. Ekanayake, and B. Pangrle. RTL

C-Based Methodology for Designing and Verifying a Multi-Threaded Processor.

In ACM/IEEE Design Automation Conference, pages 123–128, June 2002.

[185] E. Sha, C. Lang, and N. L. Passos. Polynomial-time Nested Loop Fusion with

Full Parallelism. In International Conference on Parallel Processing, volume 3,

pages 9–16, August 1996.

[186] J. Shu, T.C. Wilson, and D.K. Banerji. Instruction-Set Matching and GA-based

Selection for Embedded-Processor Code Generation. In International Conference

on VLSI Design, pages 73–76, January 1996. 206 BIBLIOGRAPHY

[187] M. Smotherman, S. Chawla, S. Cox, and B. Malloy. Instruction Scheduling for the

Motorola 88110. In IEEE/ACM International Symposium on Microarchitecture,

pages 257–262, November 1993.

[188] S. Søe and K. Karplus. Logic Minimization Using Two-column Rectangle Re-

placement. In ACM/IEEE Design Automation Conference, pages 470–474, June

1991.

[189] M. Stadler, T. Rower, H. Kaeslin, N. Felber, W. Fichtner, and M. Thalmann.

Functional Verification of Intellectual Properties (IP): a Simulation-Based Solu-

tion for an Application-Specific Instruction-set Processor. In International Test

Conference, pages 414–420, September 1999.

[190] T. Stornetta and F. Brewer. Implementation of an Efficient Parallel BDD Pack-

age. In ACM/IEEE Design Automation Conference, pages 641–644, June 1996.

[191] A. Sudarsanam and S. Malik. Memory Bank and Register Allocation in Software

Synthesis for ASIPs. In International Conference on Computer Aid Design, pages

388–392, November 1995.

[192] F. Sun, A. Raghunathan, S. Ravi, and N. Jha. A Scalable Application Specific

Processor Synthesis Methodology. In International Conference on Computer Aid

Design, pages 283–290, November 2003.

[193] F. Sun, S. Ravi, A. Raghunathan, and N. Jha. Custom-Instruction Synthesis for

Extensible-Processor Platforms. IEEE Transactions on Computer-Aided Design

of Integrated Circuits and Systems, 23(2):216–228, February 2004. BIBLIOGRAPHY 207

[194] F. Sun, S. Ravi, A. Raghunathan, and N. Jha. Synthesis of Application-Specific

Heterogeneous Multiprocessor Architectures Using Extensible Processors. In In-

ternational Conference on VLSI Design, pages 551–556, January 2005.

[195] M. Vuleti´c, L. Pozzi, and P. Ienne. Virtual Memory Window for Application-

Specific Reconfigurable Coprocessors. In ACM/IEEE Design Automation Con-

ference, pages 948–953, June 2004.

[196] Y. Wand and R. Weber. An Ontological Model of an Information System. IEEE

Transactions on Software Engineering, 16(11):1282–1292, November 1990.

[197] A. Wang, E. Killian, D. Maydan, and C. Rowen. Hardware/Software Instruc-

tion Set Configurability for System-on-Chip Processors. In ACM/IEEE Design

Automation Conference, pages 184–188, June 2001.

[198] M. J. Wirthlin and B. L. Hutchings. A Dynamic Instruction Set Computer. In

IEEE Symposium on Field-Programmable Custom Computing Machines, pages

99–107, April 1995.

[199] W. Wolf and M. Kandemir. Memory System Optimization of Embedded Soft-

ware. IEEE, 91(1):165–182, January 2003.

[200] J. Xue. Loop Tiling for Parallelism. Kluwer Academic Publishers, Boston, 2000.

[201] J. Yang, B. Kim, S. Nam, Y. Kwon, D. Lee, J. Lee, C. Hwang, Y. Lee, S. Hwang,

I. Park, and C. Kyung. MetaCore: An Application Specific DSP Development

System. IEEE Transactions on Very Large Scale Integration (VLSI) Systems,

8(2):173–183, April 2000. 208 BIBLIOGRAPHY

[202] Q. Yi and K. Kennedy. Improving Memory Hierarchy Performance through Com-

bined Loop Interchange and Multi-Level Fusion. International Journal of High

Performance Computing Applications, 18(2):237–253, August 2004.

[203] P. Yu and T. Mitra. Characterizing Embedded Applications for Instruction-Set

Extensible Processors. In ACM/IEEE Design Automation Conference, pages

723–728, June 2004.

[204] P. Yu and T. Mitra. Scalable Custom Instructions Identification for Instruction-

Set Extensible Processors. In International Conference on Compilers, Architec-

tures, and Synthesis for Embedded Systems, pages 69–78, October 2004.

[205] Y. Zhang, X. Hu, and D. Z. Chen. Global Register Allocation for Minimizing

Energy Consumption. In International Symposium on Low Power Electronics

and Design, pages 100–102, August 1999.

[206] Q. Zhao, B. Mesman, and T. Basten. Practical Instruction Set Design and Com-

piler Retargetability Using Static Resource Models. In Design, Automation and

Test in Europe Conference and Exhibition, pages 1021–1026, March 2002.

[207] H. Zima and B. Chapman. Supercompilers For Parallel nd Vector Computers.

ACM press, New York, NY, USA, 1991.

[208] N. Zingirian and M. Maresca. External Loop Unrolling of Image Processing Pro-

grams: Optimal Register Allocation for RISC Architectures. In International

Workshop on Computer Architecture for Machine Perception, pages 61–65, Oc-

tober 1997. BIBLIOGRAPHY 209

[209] V. Zivojnovic, S. Pees, C. Schlager, M. Willems, R. Schoenen, and J. Meyr.

DSP Processor/Compiler Co-design: a Quantitative Approach. In International

Symposium on System Synthesis, pages 108–113, November 1996.