FACULTY OF ENGINEERING

SCHOOL OF COMPUTER SCIENCE AND ENGINEERING

ADAPT : Architectural and Design exploration for Application specific instruction-set Processor Technologies

Seng Lin Shee

B.E. (Hons I) Computer Engineering (UNSW)

A thesis submitted for the degree of Doctor of Philosophy in Computer Science and Engineering

September 2007 c Copyright by Seng Lin Shee 2008

All Rights Reserved

Acknowledgements

This thesis would never have been possible without the guidance, assistance and support of the individuals mentioned below. In this section, I would like to share with the reader the gratefulness, enjoyment, frustration and satisfaction I felt during the course of my PhD program.

I would like to express my deep appreciation and thanks to my supervisor, A.

Prof. Sri Parameswaran, for his endless support and insightful guidance throughout the course of my PhD program. Whenever there were difficulties and intricacies in my projects and research work, Sri would always provide the necessary encourage- ment and advice to counter such hurdles. I have appreciated the constant push and pressure needed during the course of my program, for without them, the current accomplishments I see today would not have been possible.

I would also like to thank staff members Aleksandar Ignjatovic and Annie Hui Guo for their valuable feedback and advice on the various projects and research matters during my PhD program. It was good experience collaborating with researchers who are so passionate in their related fields. I have gained a lot of experience from working with them.

It has been a great pleasure working alongside dedicated people in my research lab ever since the start of my undergraduate thesis project. I would like to thank Jeremy

Chan See Wei for his constant criticism of my work, wherever I was and whatever

iv I did; Jorgen Peddersen for being a good guide and mentor; Newton Cheung for being the legendary PhD student and Andhi Janapsatya for helping me with a lot of stuff along the way. Many thanks to the other members of the lab, namely Ivan Lu,

Shannon Koh, Lih Wen Koh, Michael Chong, Angelo Ambrose, Krutartha Patel and

Carol He for their companionship, friendship and fun throughout my life in university.

Special thanks to Andrea Erdos who was a summer student in the research lab in

2006. Andrea worked hard alongside me in two published works. It has been a great pleasure to collaborate with such an excellent and fun individual.

Life would not be complete without my residential college friends in International

House. Those boring and slow moments in research have been well compensated for with just plain fun and the interesting characters in college.

Last but not least, I would like to thank my parents for the encouragement and constant motivation which have enabled me to endure and persevere over the hardest hurdles during my PhD program. I am grateful for all the constant phone calls from overseas pestering me on about the progress of my work.

To all the above, I would like to dedicate this work of literature. You may be surprised, but here it is: my PhD thesis. :)

v List of Publications

1. S. L. Shee, A. Erdos, and S. Parameswaran, “Architectural Exploration of Het-

erogeneous Multiprocessor Systems for JPEG,” International Journal of Paral-

lel Programming (IJPP), 36(1):140–162, February 2008.

2. K. Patel, S. Parameswaran, and S. L. Shee, “Ensuring Secure Program Exe-

cution in Multiprocessor Embedded Systems: A Case Study,” Proceedings of

the 5th International Conference on Hardware - Software Codesign and System

Synthesis (CODES+ISSS’07), pp. 57–62, Salzburg, Austria, September 2007.

3. S. L. Shee and S. Parameswaran, “Design Methodology for Pipelined Heteroge-

neous Multiprocessor System,” Proceedings of the 44th Annual Conference on

Design Automation Conference 2007 (DAC’07), pp. 811–816, San Diego, CA,

June 2007.

4. S. L. Shee, A. Erdos, and S. Parameswaran, “Heterogeneous Multiprocessor

Implementations for JPEG : A Case Study”, Proceedings of the 4th Inter-

national Conference on Hardware - Software Codesign and System Synthesis

(CODES+ISSS’06), pp. 217–222, Seoul, Korea, October 2006.

5. S. L. Shee, S. Parameswaran, and N. Cheung, “Novel Architecture for Loop

Acceleration : A Case Study”, Proceedings of the 3rd IEEE/ACM/IFIP Inter-

national Conference on Hardware - Software Codesign and System Synthesis

vi (CODES+ISSS’05), pp. 297–302, Jersey City, NJ, September 2005.

6. J. M. D. Peddersen, S. L. Shee, A. Janapsatya, and S. Parameswaran, “Rapid

Embedded Hardware/Software System Generation”, Proceedings of the 18th In-

ternational Conference on VLSI Design (VLSI’05), pp. 111–116, Kolkata, India,

January 2005.

vii Abstract

The miniaturization of the transistor has made it possible for billions of transistors to be integrated into a single chip package. With new two-dimensional methods of manufacturing chips, ever more features and functionalities can be con- densed into a small area of silicon. However, as in any typical engineering situation, this silicon area is considered a resource that should be conservatively used to mini- mize power and chip packaging size.

This thesis presents a suite of design automation methodologies for the design of a customized processor for specific application domains for an extensible processor platform. The work presents first a single processor approach for customization; a methodology that can rapidly create different processor configurations by the removal of unused instructions sets from the architecture. A profile directed approach is used to identify frequently used instructions and to eliminate unused from the available instruction pool.

A coprocessor approach is next explored to create an SoC (System-on-Chip) to speedup the application while reducing energy consumption. Loops in applications are identified and accelerated by tightly coupling a coprocessor to an ASIP (Application

Specific Instruction-set Processor). Latency hiding is used to exploit the parallelism provided by this architecture. A case study has been performed on a JPEG encoding algorithm; comparing two different coprocessor approaches: a high-level synthesis

viii approach and our custom coprocessor approach.

The thesis concludes by introducing a heterogenous multi-processor system using

ASIPs as processing entities in a pipeline configuration. The problem of mapping each algorithmic stage in the system to an ASIP configuration is formulated. We have also proposed an estimation technique to calculate runtimes of the configured multiprocessor system without running cycle-accurate simulations, which could take a significant amount of time. We present two heuristics to efficiently search the design space of a pipeline-based multi ASIP system and compare the results against an exhaustive approach.

In our first approach, the reduction of the instruction set and the generation of a processor can be performed within an hour. For five benchmark applications, we show that, on average, processor size can be reduced by 30%, energy consumption by

24%, while performance is improved by 24%. In the coprocessor approach, the high level synthesis method provides a faster method of generating coprocessors. However, compared with the use of a main processor alone, a loop performance improvement of 2.57 is achieved using the custom coprocessor approach, as against 1.58 for the × × high level synthesis method, and 1.33 for the customized instruction approach. En- × ergy savings within the loop are 57%, 28% and 19%, respectively. Our multiprocessor design provides a performance improvement of at least 4.03 for JPEG and 3.31 × × for MP3, for a single processor design system. The minimum cost obtained with the use of our heuristic was within 0.43% and 0.29% of the optimum values for the JPEG and MP3 benchmarks respectively.

ix Contents

Statement of Originality ...... i

Copyright Statement ...... ii

Authenticity Statement ...... iii

Acknowledgements ...... iv

List of Publications ...... vi

Abstract ...... viii

1 Introduction 1

1.1 Microprocessor Generations ...... 3

1.1.1 First Generation – 1940-1956 ...... 3

1.1.2 Second Generation – 1956-1963 ...... 4

1.1.3 Third Generation – 1964-1971 ...... 5

1.1.4 Fourth Generation – 1971-Present ...... 7

1.2 Design Challenges ...... 9

1.2.1 Performance ...... 10

Pipeline ...... 10

SIMD ...... 12

Superscalar ...... 13

Coprocessors ...... 14

Multicore / multiprocessor ...... 15

x 1.2.2 Area ...... 16

Moore’s Law...... 16

Specialization ...... 18

1.2.3 Energy ...... 18

Circuit Level ...... 18

Logic Level ...... 19

Architecture / System Level ...... 20

1.3 Extensible Processor Platform ...... 21

1.3.1 Base Configurations ...... 22

1.3.2 Extensible And Customized Instructions ...... 23

1.3.3 Architectural Configurations ...... 23

1.3.4 Heterogeneous Multiprocessor Via An Extensible Platform . . 24

1.4 Design Automation ...... 25

1.5 Motivation ...... 27

1.6 Research Goals and Contributions ...... 30

1.7 Thesis Overview ...... 31

2 Literature Review 34

2.1 Introduction ...... 34

2.2 Embedded Systems ...... 34

2.2.1 Integration of logic-based circuits ...... 39

2.2.2 Functional Upgrades ...... 39

2.2.3 Analogue Replacement ...... 39

2.3 Architectural Designs ...... 42

2.3.1 General Purpose Processors ...... 42

2.3.2 Coprocessor Systems ...... 45

xi 2.3.3 Digital Signal Processors ...... 49

2.3.4 Multiprocessor Systems ...... 52

2.4 Customization of Architectures ...... 56

2.4.1 Field Programmable Grid Array ...... 57

2.4.2 Application Specific Integrated Circuits ...... 60

2.4.3 Extensible Processor Architectures ...... 61

Application Specific Instruction-set Processors ...... 62

Design tools and framework ...... 65

2.5 Parallelizing Architectures ...... 68

2.5.1 Instruction Level Parallelism ...... 69

2.5.2 Task Level Parallelism ...... 70

2.6 Design Space Exploration ...... 74

2.6.1 Processor Generation ...... 74

2.6.2 System Generator - coprocessor generation ...... 76

2.6.3 Multiprocessor / Heterogeneity ...... 79

3 Approach to Customization 84

3.1 Introduction ...... 84

3.2 Shortcomings of Previous Research ...... 84

3.3 Modus Operandi ...... 87

4 Customizing by Removing Instructions 94

4.1 Introduction ...... 94

4.2 Motivation ...... 95

4.3 Microprocessor Generation Framework ...... 96

4.4 Application Specific Processor Generation ...... 97

4.5 Experimental Setup ...... 98

xii 4.5.1 Analysis of Results ...... 101

4.6 Conclusions ...... 103

5 Customizing by Coprocessors 104

5.1 Introduction ...... 104

5.2 The JPEG Encoder ...... 105

5.2.1 Loop Identification ...... 105

5.3 High-level Synthesis Approach ...... 107

5.3.1 Architecture ...... 108

5.3.2 Advantages & Limitations ...... 111

5.4 Custom Coprocessor Approach ...... 112

5.4.1 Architecture ...... 112

5.5 Discussion of the Architecture ...... 114

5.6 Experimental Setup & Tools ...... 116

5.6.1 Verification ...... 118

5.7 Results ...... 119

5.8 Conclusions ...... 120

6 Customizing by Pipelining 123

6.1 Introduction ...... 123

6.2 Background ...... 124

6.2.1 Case Study Application ...... 124

6.2.2 Baseline Processor Description ...... 125

6.3 Methodology ...... 127

6.3.1 Single Pipeline ...... 129

Five Cores ...... 130

Six Cores ...... 131

xiii Seven Cores ...... 131

6.3.2 Multiple Pipelines ...... 132

Nine Cores ...... 132

Seven Cores ...... 133

6.4 Experimental methodology ...... 134

6.5 Results & Analysis ...... 137

6.5.1 Further Architectural Comparison ...... 144

6.6 Conclusion ...... 146

7 Design Space Exploration 147

7.1 Introduction ...... 147

7.2 Background ...... 148

7.2.1 Benchmark Applications ...... 150

JPEG Benchmarks ...... 151

MP3 Benchmarks ...... 152

7.2.2 System Architecture ...... 152

7.3 The System ...... 154

7.4 Design Exploration ...... 157

7.4.1 Problem Definition ...... 158

7.4.2 Exhaustive Search ...... 159

7.4.3 Runtime Estimation ...... 161

7.4.4 Estimation-based Search ...... 164

7.4.5 Priliminary Heuristic ...... 169

Preliminary Results ...... 171

Preliminary Analysis ...... 173

7.4.6 Heuristic ...... 174

xiv 7.5 Experimental methodology ...... 178

7.6 Results & Analysis ...... 181

7.7 Discussion ...... 190

7.8 Conclusion ...... 193

8 Conclusions 194

Bibliography 200

xv List of Tables

1.1 Example of a superscalar processor pipeline flow...... 13

4.1 Mediabench Benchmark Applications used in experiment...... 100

4.2 Different configuration of the SoC architecture...... 101

4.3 Table of results...... 101

5.1 Loop runtimes ...... 107

5.2 Loop Energy and Performance Table ...... 119

5.3 Program Performance and Power Table ...... 119

6.1 Processor Configuration ...... 128

6.2 Utilization in a nine core multipipeline system ...... 133

6.3 Processor configuration with multiple pipeline flows ...... 134

7.1 Processor Configuration ...... 154

7.2 Exploration time ...... 168

7.3 Configurations obtained for the JPEG benchmark (preliminary). . . . 172

7.4 Configurations obtained for the MP3 benchmark (preliminary). . . . 173

7.5 Configurations obtained for the JPEG benchmark...... 183

7.6 Configurations obtained for the MP3 benchmark...... 188

xvi List of Figures

1.1 Ubiquitous computing trend ...... 8

1.2 A 10 stage In-Order Core Pipeline of an Itanium Processor . . . 11

1.3 Transistor count for each processor generation ...... 17

1.4 Microprocessor Development Process ...... 28

2.1 Embedded systems in a vehicle provides a multitude of features . . . 36

2.2 The Apollo Guidance Computer ...... 38

2.3 Architecture of the Intel 4004 microprocessor...... 40

2.4 Example of applications utilizing DSPs...... 41

2.5 Architecture of a lexical/parsing coprocessor...... 47

2.6 Cycle counts for BDTI block FIR filter benchmark ...... 51

2.7 The Xtensa LX Architecture ...... 64

2.8 Cascade design flow for generating coprocessors ...... 78

2.9 A systolic array example of data processing units (DPUs) ...... 81

4.1 Creating a minimized processor description ...... 97

4.2 Experimental Setup ...... 99

4.3 Experimental Results (area, performance, and energy improvement). . 102

5.1 Example code segment & coprocessor interface ...... 108

5.2 Modified loop as an ANSI-C function ...... 109

xvii 5.3 Coprocessor Integration ...... 110

5.4 Example code segment & corresponding graph ...... 114

5.5 Parallel Execution ...... 115

5.6 Experimental Setup ...... 117

5.7 Synthesis and Power Calculation Flow ...... 118

5.8 Area and Loop Energy Usage ...... 121

6.1 The main stages in a JPEG encoder...... 126

6.2 Xtensa queue interface...... 126

6.3 Stages in a pipeline processor architecture...... 130

6.4 A five core system interconnected by queues...... 130

6.5 A nine core system with three internal pipeline flows ...... 132

6.6 Experiment Methodology ...... 136

6.7 Performance of multiprocessor systems without optimizations . . . . . 138

6.8 Utilization of the seven pipeline stage systems ...... 140

6.9 Runtime improvements and area increase ...... 141

6.10 Design Space for JPEG Encoder ...... 142

6.11 Performance of multiprocessor systems without optimizations . . . . . 145

7.1 Possible design configurations for a pipelined multiprocessor system . 149

7.2 Xtensa LX Queue Interface ...... 153

7.3 Design flow ...... 155

7.4 An exhaustive search algorithm ...... 160

7.5 Error distribution when using equation 7.9 ...... 164

7.6 Simulation runs with configurations for each processor in the system . 165

7.7 An exhaustive search using the proposed estimation technique . . . . 167

7.8 A preliminary heuristic for configuration selection ...... 170

xviii 7.9 The final heuristic for configuration selection ...... 176

7.10 Comparison of complexity of estimation-based and heuristic approaches 177

7.11 Benchmark image used for JPEG encoding ...... 180

7.12 First frames of video sequence stages, best viewed in color ...... 182

7.13 JPEG encoding runtime for each frame in the video sequence . . . . . 182

7.14 JPEG multiprocessor pipeline systems design space ...... 184

7.15 MP3 multiprocessor pipeline systems design space ...... 185

7.16 Pareto points of a JPEG multiprocessor pipeline systems design space 186

7.17 Pareto points of an MP3 multiprocessor pipeline systems design space 187

xix Chapter 1

Introduction

The information age has brought with it numerous inventions; many of which have benefited business, the education sector, the health industry, national security and normal everyday lives. Innovation has spurred a cultural revolution leading to de- velopment of ubiquitous devices, helping humans to perform mundane tasks which were formerly considered stressful and time-consuming. The advent and deployment of computers has revolutionized the ways that people conduct business. Millions of transaction records could now be stored and retrieved quickly, while complex scenar- ios could be simulated to undertake risk management activities and to help in higher management in its decision making.

Hardware advancements have led to new services and applications, such as the internet, multimedia devices (MP3 players, video recorders, etc.), navigation devices and health monitoring tools. Ubiquitous computing has revolutionized the ways by which we educate the next generation and how we interact with one another. Faster completion of tasks increases productivity and competitiveness in the market place.

All of the above is enabled by rapid development of both hardware and software that has taken place since the era of mainframes. Today, we have devices embedded

1 CHAPTER 1. INTRODUCTION 2

in common home appliances which are taken for granted. These small computing devices each have more processing power then those first computers which occupied a large hall in a research center. This key technology has been reduced in size due to advances in fabrication technology.

The heart of an is usually implemented using either general purpose processors, ASICs (Application Specific Integrated Circuits) or a combination of both. General Purpose Processors (GPPs) are programmable, but consume more power than ASICs. Reduced time to market, and minimized risk are factors which favour the use of GPPs in embedded systems. ASICs, on the other hand, cost a great deal to design and are non-programmable, making upgradability an impossible dream.

However, ASICs have reduced power consumption and are smaller than GPPs.

Application Specific Instruction-set Processors (ASIPs) are processors with spe- cialized instructions, selected co-processors, and parameterized caches applicable only to a particular program or class of programs. An ASIP will execute the application for which it was designed with great efficiency.

Industry has long been researching various ways to speed up applications, and their relevant hardware mechanisms, without being left behind by competitors. De- sign automation plays a major role in creating different designs to suit a particular customer base. Market forces demand performance increases at decreased prices.

Functionality must also be improved, in order to meet increasing customer expec- tations about flexibility and ease of use. All in all, the system becomes increasing complex over time. With pressure of competition and decreasing product lifetimes, the automation of compute design is proving essential.

The aim of this thesis is to develop methodologies and tools to design multi- processor systems automatically and efficiently on an extensible platform, so as to reduce design turn around time and time to market, thus increasing productivity and CHAPTER 1. INTRODUCTION 3

competitiveness in the industry.

1.1 Microprocessor Generations

The purpose of a computer is to “compute”—to solve complex mathematical prob- lems. World War II sparked the information revolution when governments made considerable effort to exploit advances in science as weapons to gain total supremacy over the enemy. War became the springboard which would bring industry into the

IT era.

1.1.1 First Generation – 1940-1956

The first modern computer which became the template for all current computer de- signs was the ENIAC (Electronic Numerical Integrator And Computer). The ENIAC,

EDVAC (Electronic Discrete Variable Automatic Computer) and UNIVAC (UNI-

Versal Automatic Computer) computers are examples of first-generation computing devices. The UNIVAC was the first commercial computer delivered to a business client, the U.S. Census Bureau in 1951. These were called mainframes. As a result of the development of EDVAC, von Neumann’s 1945 report, [158]“First Draft of a

Report on the EDVAC” was written; reporting what since became known as the von

Neumann Architecture; describing a computer in terms of its four basic functional units—memory, processor, input and output.

“Mainframes” [55] were originally given that name because their circuits were mounted on large metal frames, housed in cabinets. Mainframe installations consisted of a number of these cabinets, on a false floor raised a few inches above the real floor, which provided room for the numerous thick cables connecting one cabinet to another and allowed proper air circulation. The room housing this massive equipment was CHAPTER 1. INTRODUCTION 4

climate-controlled.

The first computers used vacuum tubes for circuitry and magnetic drums for memory, and were often enormous, taking up entire rooms. These machines were very expensive to operate and, in addition to using a great deal of electricity, generated a lot of heat, which was often the cause of malfunctions. First generation computers relied on machine language to perform operations, and they could only solve one problem at a time. Input was based on punched cards and paper tape, and output was produced by line printers.

1.1.2 Second Generation – 1956-1963

Transistors gradually replaced vacuum tubes, ushering in the second generation of computers. The transistor was invented in 1947 but did not see widespread use in computers until the late 50’s. The transistor was far superior to the vacuum tube, allowing computers to become smaller, faster, cheaper, more energy-efficient and more reliable than their first-generation predecessors. Though the transistor still generated a great deal of heat that subjected the computer to damage, it was a vast improvement over the vacuum tube. Second-generation computers still relied on punched cards for input and printouts for output.

IBM, which was the main developer of mainframe computers, recognized the ben- efits of the transistor and began transistorizing the mainframe model; the 709. The result was a large transistorized computer, the 7090, which was still considered a mainframe. The maturing of transistor technology in the late 1950s made it possible to produce solid-state computers at low cost. Machines could be made more compact, allowing multiple machines to be installed in the same sized location that once housed a single mainframe. CHAPTER 1. INTRODUCTION 5

Transistors paved the way for the proliferation of the next generation of computers— minicomputers. The performance of a typical mainframe is limited due to the word length being processed. A short word length does not provide enough bits in an instruction to specify sufficient memory addresses. Arithmetic operation are also lim- ited to simple calculations. Minicomputers were able to get around the drawbacks of mainframes by using more complex instructions. These added complexity to the im- plementation when using vacuum tubes but with the use of transistors, the processor could still remain simple, inexpensive and fast.

Well-known minicomputers include the CDC 160A by the Control Data Corpo- ration and the PDP-1 by DEC (Digital Equipment Corporation). The PDP-1 was the first commercial computer that emphasized interaction with the user rather than the efficient use of computer cycles. It was the first in a long line of computers that focused on interactivity and affordability. The introduction of the minicomputer opened new areas of application, bringing more users (e.g., engineers and scientists) into direct interaction with computing machines. This brought the notion of personal computing to the industry and represents a step closer to ubiquitous computing.

1.1.3 Third Generation – 1964-1971

“Grosch’s Law” [70] argued that a computer system that was twice as big or cost twice as much money will provide not twice but four times as much computing power. The reason this often-cited law was considered reasonable is because computers of that era used magnetic cores for storage, which were cheap but needed circuitry that was expensive to build. Thus, bigger capacity memory cores were considered more cost efficient than smaller memory cores.

The development of the integrated circuit sparked the creation of the third gen- eration of computers. The integrated circuit replaced transistors, resistors and other CHAPTER 1. INTRODUCTION 6

discrete components in the processing units of computers. These circuits were minia- turized and mounted on silicon chips, drastically increasing the speed and efficiency of computers. Magnetic cores were subsequently replaced by solid state memory units.

Integrated circuits effectively miniaturized the transistors, diodes and wiring of a printed circuit board. In 1970, integrated circuit makers adopted a standard for chips using transistors to implement what was called “transistor-transistor logic” (TTL).

TTL chips were inexpensive and easy to design, and were packaged in black plastics or ceramic cases with pins arranged along either side of the chip to provide connectivity to circuit boards.

Advancement in integrated circuit manufacturing led to greater numbers of tran- sistors being packed into a single chip, leading to the terms Medium Scale integration

(MSI), large scale integration (LSI) and finally, very large scale integration (VLSI) becoming used to indicate the number of transistors on a chip.

It was at this time that DEC redefined the concept of the bus architecture which was first used in the electromechanical Mark I and implemented it in the PDP-11 minicomputer. Nearly all the units of the machine, including memory and I/O de- vices, were connected to a common 56-line bus. This enabled DEC and its customers to configure installations for specialized applications or to expand them.

Keyboards and display monitors replaced punched cards and printouts while being interfaced with an that allowed many different applications to be run at any one time, with a central program monitoring and allocating the memory.

Computers became more accessible to wider audiences as mass production enabled them to be made cheaper. CHAPTER 1. INTRODUCTION 7

1.1.4 Fourth Generation – 1971-Present

The birth of the next generation of computers arose from a different paradigm from that of the minicomputer. Calculators that use integrated circuits have demonstrated this well in the market place. In order to achieve the same computing capabilities as minicomputers, a set of integrated circuits incorporating the basic architecture of a general-purpose, stored-program computer was needed. By the late 1960s, a new type of semiconductor, the metal-oxide semiconductor (MOS) emerged. This offered a way to place even more logic elements on a chip, enabling the concept of a computer-on-a-chip to be realized.

Intel produced a programmable calculator that incorporated many of the elements of a general-purpose computer and paved the way for the world’s first commercial microprocessor, the Intel 4004, that was released on November 15, 1971. The micro- processor consists of the , memory and input/output controls, all on a single chip. Successful derivatives of these first include the

Pentium, SPARC, Athlon and Alpha processors.

Further technology advancement led to the use of processors, such as ARM and

MIPS, in embedded systems, due to miniaturization and lower power consumption.

The use of such processors allow the production of smart devices at work and at home, including everyday appliances such as television and radio sets, microwave ovens, watches and mobile phones.

Proliferation of embedded systems

Ubiquitous computing has revolutionized the way humans conduct daily routines that would have been considered laborious and mundane chores 50 years ago. Personal CHAPTER 1. INTRODUCTION 8

computing provides interactivity between user and computer, allowing greater plea- sure and ease of use for such devices. With the advancement in chip manufacturing and design, it is now possible to develop products with computing capabilities in every aspect of life. Ubiquitous computing arrives when technology recedes into the background of daily routine. Alan Kay at Apple describes this era as the “Third

Paradigm” [161] of computing.

Mainframe (one computer, many people)

PC (one person, one computer)

Ubiquitous Computing (one person, many computers) Sales / year

Year

Figure 1.1: Ubiquitous computing trend, sourced from M. Weiser, “Ubiquitous Com- puting”

Figure 1.1 illustrates the emerging trend of ubiquitous computing [161]; its pen- etration has exceeded that of personal computing. One of the avenues of use of ubiquitous computing is in embedded systems, computer devices which are already influencing human lives that are found in computer controlled vehicles, mobile phones, smart homes and digital home entertainment. The embedded system is not a separate entity, but is built in to the device that it is controlling. CHAPTER 1. INTRODUCTION 9

Embedded systems are application-specific computer systems designed to perform particular functions. In contrast to personal or general purpose computers, embedded systems are wired and programmed for very specific requirements. Because of this, it is possible to scale down the size and power requirements of such devices, so that they are just enough to complete the specific task. Designs can be further optimized to provide better cost efficiency and performance, while being mass produced and benefiting from economies of scale.

Examples of embedded processors include ARM, MIPS, Coldfire/68k, PowerPC, x86, PIC, 8051, AVR, Renesas, Xtensa and . Compared with the personal computer market, which is dominated by a few microprocessor architectures, there is a greater variety of processor designs to be considered, in the embedded system space. Different architectures can be selected to suit particular functions and tasks.

In order to make it possible for embedded systems to be used in everyday devices, it is vital for designers to meet three main challenges in embedded system design – performance, area and power.

1.2 Design Challenges

Processor designs have increased in complexity in terms of the number of logic used within the circuit as well as in the features involved in providing fast computations and high throughput. Many of these innovations are derived from the successful development of microprocessors for personal computers and can now be exploited in embedded systems. The challenge in embedded system design would be to provide a design which maximizes performance while taking into account the size of the product and the amount of power consumed by it. CHAPTER 1. INTRODUCTION 10

1.2.1 Performance

More challenging and computation intensive tasks are now being performed by com- puter systems as humans become more reliant on technology to improve productivity and increase efficiency in a competitive environment. Performance becomes more critical in situations where accurate and prompt responses are required to ensure the proper health care and safety of human beings. The following sections present meth- ods that have been developed to increase the performance of microprocessor-based systems.

Pipeline

Pipelining is an implementation technique in which multiple instructions are over- lapped during execution, so that more than one instruction is being executed at any one time. A pipeline, in this context, is a set of data processing elements connected in series, such that the output of one element is the input of the next. Buffer storage is often inserted between processing elements to prepare the processed data for the next processing stage in the pipeline flow.

This technique borrows from the practice in factory assembly lines, where different actions are performed at different stations of the assembly line, each action being per- formed simultaneously on a different items. Pipelining doesn’t decrease the time for a single datum to pass along the processing line; only increasing the throughput of the system when it is processing a stream of data. Pipelining may also increase the time it takes for one instruction to finish due to the additional resources needed to imple- ment the pipeline flow. Figure 1.2 shows the pipeline stages which are implemented in an Intel Itanium Processor.

Other pipelining concepts include graphics pipelines and software pipelines. Graph- ics pipelines are found in high performance graphics cards. The graphics pipeline CHAPTER 1. INTRODUCTION 11 Processor Microar- TM Itanium r Figure 1.2: A 10 stage In-Order Core Pipeline of an Intel Itanium Processor, based on Intel chitecture Overview [9]. CHAPTER 1. INTRODUCTION 12

consist of multiple processing units or complete CPUs which implement the various stages of graphic rendering operations.

Software pipelines represent multiple processes linked up so that the output stream of one process is the input stream to the next one. This technique can be mostly found in Unix systems, where a number of different individual programs can be linked up to perform complex tasks never envisaged by the individual programmers of each program.

SIMD

SIMD (Single Instruction, Multiple Data) is a technique to increase parallelism in data processing. SIMD machines have been exemplified by supercomputers such as the Cray X-MP, Connection Machine, ILLIAC IV, MasPar MP2 and the Distributed

Array Processor (DAP). Supercomputers like the Cray X-MP were known as “vector processors” as these machines operated on long vectors.

DSPs (Digital Signal Processors) are dedicated processors that perform the same tasks as SIMD instructions. However, DSPs are standalone processors with complex instruction sets. SIMD capability is integrated into general purpose processors to handle such data manipulation; the general purpose portion handles the program

flow. DSPs are designed for specific data types such as data, sound or video, whereas

SIMD designs are general purpose.

SIMD instructions operate on small data units which can be contained in a sin- gle register. These units can then be operated upon in parallel using only a single instruction. However, reading data from memory requires that the data in memory should already be arranged suitably for such SIMD operations (i.e, address aligned CHAPTER 1. INTRODUCTION 13

and contiguous). Examples of SIMD instruction sets used for multimedia applica- tions are MMX (Intel), SSE (Intel), SSE2 (Intel), 3DNow! (AMD) and AltiVec (Mo- torola/IBM).

Superscalar

In order to design faster processors, the simple pipeline processor can be extended to include longer and deeper pipeline stages. This technique is called superpipelining.

For example, the Intel Pentium D has 31 pipeline stages in order to scale up the processor to a higher frequency.

There are times when deeper pipeline stages would not be beneficial. Longer pipeline stages would normally increase the latency of execution of a single instruction, thus short operations would not benefit from an architecture with long pipeline stages.

A different trend is to replicate the internal components of the processor so that it can launch multiple instructions in every pipeline stage. A superscalar architec- ture is able to execute more than one instruction at a time by pre-fetching multiple instructions and simultaneously dispatching them to redundant functional units on the processor.

Configuration Pipeline stages ALU or branch instruction IF ID EX MEM WB Load or store instruction IF ID EX MEM WB ALU or branch instruction IF ID EX MEM WB Load or store instruction IF ID EX MEM WB ALU or branch instruction IF ID EX MEM WB Load or store instruction IF ID EX MEM WB ALU or branch instruction IF ID EX MEM WB Load or store instruction IF ID EX MEM WB ALU or branch instruction IF ID EX MEM WB Load or store instruction IF ID EX MEM WB

Table 1.1: Example of a superscalar processor with processing and memory access pipeline flows. The ALU and data transfer instructions can be issued at the same time. CHAPTER 1. INTRODUCTION 14

The Intel Core 2 Duo reduced the number of pipeline stages of the Pentium D to just 14. However, every execution core within Core 2 is wider, allowing each core to complete up to four full instructions instead of just three in the Pentium D. In contrast with a pipeline processor, where instructions are still executed in sequence, a superscalar processor allows multiple instructions to be executed and committed at the same time.

Coprocessors

A coprocessor is a specialized functional unit that supplements the functions of the primary processor; by offloading processor-intensive tasks from the main processor, the coprocessor is able to increase system performance. Examples of coprocessor domains include floating-point, graphics, signal processing, string processing and en- cryption.

A coprocessor may not be a fully capable general-purpose processor, since they may not be able to fetch instructions from memory, execute program flow control instructions, manage memory or perform input/output operations. A main processor is required to fetch the coprocessor instructions and to handle all other operations.

There are more capable coprocessors that can carry out a limited range of functions under the close control of a supervisory processor. A system with a coprocessor is generally not considered a multiprocessor system.

Examples of coprocessor systems include the 486 DX2 with math coprocessor, the

Atari 8-bit processor with Video Display Controller and the Commodore Amiga with graphics processor Copper. Current dedicated coprocessors include Sound Blaster X-

Fi for multimedia acceleration, GeForce processors for 3D graphics acceleration and

PhysX for complex physics computations (so that the CPU and GPU do not have to perform these time and resource-consuming calculations). CHAPTER 1. INTRODUCTION 15

Multicore / multiprocessor

As users become more reliant on the efficiency of computer systems, a greater number of tasks is now executed on such systems at any one time. Different levels of paral- lelism have been explored, including instruction-level and task-level parallelism. With process-level parallelism, different applications can be executed on separate processors simultaneously without affecting the memory space of the other applications.

Multiprocessor systems are systems with two or more microprocessors within a single computer. Different applications (i.e., different processes) can be executed independently on different processors. Moreover, an application can be written and dissected into separate processes in order to make use of such a system. The technique of partitioning a program into distinct processes is called multiprogramming.

With the advent of the SoC (System-on-Chip) paradigm, more features can be fabricated into s single silicon chip. Multicore is a technique by which multiple pro- cessing cores are coupled and fabricated together into a single central processing unit

(CPU) package. A multicore system is a miniaturized version of a system with sep- arate packing for the different processing cores. The miniaturized version saves area space as well as power, thus making it possible for usable systems to be available in low power mobile devices.

A homogeneous multiprocessor system consists of identical processing cores with the same instruction set architecture (ISA) and features. A homogeneous system would be easier to program, as the designer would not need to map applications to different processor designs. Homogeneous multicore systems are currently popular is desktop and server computer systems. For example, the Core 2 Duo features two independent processor cores integrated on chip. Each core runs at the same clock frequency, and shares the L2 cache as well as the front-side bus. Systems CHAPTER 1. INTRODUCTION 16

which treat all cores equally are known as symmetric multiprocessing (SMP) systems, whereas examples of systems where resources are divided differently among the cores are asymmetric multiprocessing (ASMP) and non-uniform memory access (NUMA) multiprocessing systems.

Heterogeneous multiprocessor system are used in performance- and energy-constrained systems. These systems contain two or more different processing core designs which are mapped to different functions of the application. Each processing core is opti- mized for the functions to which it is mapped. A heterogeneous system can be found for video, multimedia, cryptography and various computation-intensive applications.

An example of a heterogeneous multiprocessor system is the Cell processor developed by Sony, and IBM which contains a main processor (Power 970) and eight fully-functional RISC (reduced instruction set computer) coprocessors.

1.2.2 Area

Various techniques and methodologies have been developed to produce microproces- sors on ever-decreasing silicon area; resulting in smaller chip packaging and making it possible to integrate chip technologies into a variety of applications. Further minia- turization would enable designers to include more features into a single chip design.

Smaller area utilization would also result in lower power usage.

Moore’s Law

Gordon Moore, co-founder of Intel Corporation, made an observation in 1965 that

Each new memory integrated circuit contained roughly twice as much capacity as its predecessor and that each chip was released within 18-24 months of the previous chip. It was concluded that this trend would lead computing power would rise ex- ponentially. Figure 1.3 shows the transistor count trend since the production of the CHAPTER 1. INTRODUCTION 17

first microprocessors. Moore’s Law still prevails despite many sceptical claims that a limit to the trend must eventually be reached.

10000000000

Dual Core Itanium 2 1000000000 Itanium 2 with 5MB Power6 cache Core 2 Quad Core 2 Duo Itanium 2 Cell 100000000 AMD K8 AMD K6-III / Barton AMD K7 Pentium 4 AMD K6 Itanium 10000000 Pentium III Pentium Pentium II AMD K5 1000000 80486 Transistors 80386 100000 80286

8088 10000 4004 8080 8008 1000 1970 1975 1980 1985 1990 1995 2000 2005 2010 Year

Figure 1.3: Transistor count for each processor generation

In order to keep up with the increasing trend of logic in integrated circuits, fabri- cation technology has advanced well in miniaturization techniques. Very Large Scale

Integration (VLSI) paved the way by fabricating transistors onto silicon wafers. Pho- tolithography is the use of lights and lasers to etch and mask layers onto silicon wafers.

The production of masking technology in order to fabricate and mask smaller wires onto silicon wafers is a challenge in itself; 130nm, 90nm, 65nm and 45nm wire sizes are currently possible, due to technological advances in silicon chip fabrication processes. CHAPTER 1. INTRODUCTION 18

Specialization

General purpose processors are not area efficient, as they contain logic and features that would not used in certain applications. However, these processors are pro- grammable for a wide range of applications and so are used in embedded systems where rapid design turnaround is crucial.

By specializing and customizing for a particular application, redundancy can be reduced. Microprocessor designs can be customized (i.e., cache sizes, bus width, instruction set architecture, etc.) depending on how computation intensive the ap- plications are.

To achieve further area reduction, application specific integrated circuits (ASICs) are used. ASICs are popular in embedded systems, as they are small, compact and power efficient. These processors achieve high computing performance but are difficult and tedious to design.

1.2.3 Energy

Energy consumption has become an important factor in recent years with the widespread use of portable computing. A great deal of research and development has been done, focusing on power consumption analysis and finding ideas and solutions to decrease power dissipation in processor devices. Power reduction can be carried out at differ- ent levels: technology, circuit, architecture and systems (software) levels. Devadas et al. has produced an excellent survey of power reduction efforts in [53].

Circuit Level

Choices about the placement of individual transistors in the gate will affect the power dissipation of the circuit design, while ordering of gate inputs will affect both the CHAPTER 1. INTRODUCTION 19

power and the delay of the signal. Improvements in power and delay can be obtained by changing the order of transistors within individual complex gates.

Transistor sizes may also have significant impact on circuit delay and power dissi- pation. Delay decreases if the transistors in a given gate are increased in size, however, this increases power dissipation; delay of fan-in gates increases due to increased load capacitance. The usual approaches to this problem include computing the slack at each gate in the circuit, where slack is the extent that the gate can be slowed down before reaching the critical delay in the circuit. Transistor sizes in sub-circuits with positive slacks are reduced until the slack becomes zero, or until the transistors are at their minimum size.

Logic Level

Combination optimization is performed in two distinct phases. The first phase in- volves technology-independent optimization, in which logic equations are manipulated to reduce area, delay or power dissipation. The next phase, technology-dependent optimization involves mapping the equations to a particular technology library using technology mapping algorithms, again optimizing for area, delay or power.

The logic level can also be further optimized by factoring logical expressions, by finding common subexpressions across multiple functions and reusing them. Al- gorithms are used to maximally reduce literal count within the given expressions.

However, when targeting power dissipation, the function controlling cost is not neces- sarily the literal count but could be the switching activity. Logic circuits with reduced expressions result in smaller circuit area and thus in lower power usage, primarily due to leakage power.

Large VLSI designs contain several distinct components, such as register files, arithmetic units and control logic. Several of these units are clocked but would not CHAPTER 1. INTRODUCTION 20

be in use at the same time. Similarly, register values from unused sequential circuits need not be updated regularly. Power reduction can be obtained by gating the clocks of these registers, thus reducing the switching activity within the registers as well as the sequential logic to negligible levels.

Architecture / System Level

Power analysis at the architecture and system level is a valuable way to quickly estimate and design complex SoC (System-on-Chip) systems within the power and performance constraint of the application. Power analysis tools can be of great use to designers, helping them to explore the design space manually. Power models can be obtained by estimating the capacitance that would switch when a particular module is activated [127]. Alternatively, average power costs can be assigned to individual modules, in isolation from others. Power costs of the modules involved in a given computation are added up during the simulation.

Architectural power optimization allows power-aware mapping of the control/data

flow graph to functional units, and of variables to registers, as well as the definition of interconnects (i.e., multiplexors and buses). Hardware sharing and mapping of operation sequence to functional units affects the total switched capacitance in the data path. Thus, accurate architecture and system modeling allows architectural and algorithmic tradeoffs and optimizations that can be used for low power designs.

The increasing use of programmable processors in embedded systems has led to the development of Application Specific Instruction-set processors, which consist of a software and a hardware component. The software component runs a dedicated microprocessor (that may be general-purpose), while the hardware portion consists of application-specific circuits. So it is important to explore the combined hardware and software design space when evaluating designs for low power. Methodologies CHAPTER 1. INTRODUCTION 21

need to be developed to decide which portion of the application should be executed in hardware and which in software.

Software techniques can also be used to decrease power consumption during ex- ecution. Mehta et al. [119] proposed a novel compilation technique which reduces energy consumption by proper register labeling during the compilation phase. The technique reduces the energy of the processor by encoding the register labels such that the sum of the switching costs between all the register labels in the transition graph is minimized, thus reducing the energy of the and the register file decider.

1.3 Extensible Processor Platform

The heart of an embedded system is usually implemented using either general purpose processors, ASICs or a combination of both. General Purpose Processors (GPPs) are programmable, but consume more power than ASICs. Reduced time to market and minimized risk are factors which favour the use of GPPs in embedded systems.

ASICs, on the other hand, cost a great deal to design and are non-programmable, making upgradability an impossible dream. However, ASICs have reduced power consumption and are smaller than GPPs.

Extensible processor systems provide an alternative contender for implementing functionality in embedded systems. These are processors with specialized instruc- tions, selected co-processors, and parameterized caches applicable only to a particular program or class of programs. An extensible processor will execute an application for which it was designed with great efficiency, while they are also capable of executing any other program, but usually at a greatly reduced efficiency. CHAPTER 1. INTRODUCTION 22

An extensible processor consists of a RISC-based processor core that can be ex- tended for a specific application using optional custom instructions. The extensible processor platform provides a configurable microprocessor core that handles not only conventional processor tasks, but is easily augmented to deliver the performance and power of custom logic. In contrast with rigid processor cores designed for worksta- tion architectures, the extensible processor is highly optimized and small enough to be integrated into low-powered embedded systems,

Extensible processor platforms provide designers with a design tool chain ranging from processor generation utilities to compilation tools for each different processor design. Designers would obtain a compiler, linker, assembler and debugger tuned exactly for the configured hardware, and would then be able to profile a specific ap- plication and mould the processor to fit it. New instructions can be developed and automatically used by the compiler when new extensions are developed for an exten- sible processor. Various configurable options can also be chosen to rapidly explore the design space of the processor with respect to performance, power and area use.

Tools such as ASIPmeister [85], Tensilica [20], ARCtangent [4], Jazz [10], Nios [1] and SP5-flex [15] allow rapid creation of extensible processor systems. The designer effectively reduces the design risk with post-silicon programmability using processors instead of RTL blocks.

1.3.1 Base Configurations

An extensible processor platform provides the designer with a highly optimized and

flexible microprocessor core. The core is extremely efficient, small and low-power, with better code density than the conventional microprocessors used in workstation com- puters. Processor options (main memory unit (MMU), registers, local memory types and sizes, hardware multipliers, SIMD instructions and various other instruction-set CHAPTER 1. INTRODUCTION 23

options) are designer-configurable. An extensible processor may also provide options for different pipeline stage implementations depending on the needs of the application, as well as various instruction encodings (e.g., 16- or 24-bit) for higher code density.

1.3.2 Extensible And Customized Instructions

Extensible processor platforms provide tools and framework to enhance the base processor core with customized hardware logic. New functional units and extensi- ble instructions can be added using platform-specific languages. Application-specific instruction extensions provide order-of-magnitude improvements in application per- formance while eliminating RTL blocks from SoC designs. Synthesizable code can be obtained, together with the software tool chains for customized architectures.

Additionally, the variety of extensible platforms would include pre-verified IP blocks that can be used in the customized design. Floating-point units (FPUs) can be added or removed, depending on the needs of the application. Much research has been done to automate the synthesizing of extensible instructions as well as mapping application code to hardware or software blocks.

1.3.3 Architectural Configurations

An extensible processor platform provides a development environment and modeling tools for the system designer to rapidly explore the design space for different archi- tectural configurations, including instruction and/or data cache sizes, communication bus widths, coprocessor integrations and multiprocessor configurations.

More advanced extensible platforms (i.e., Tensilica Xtensa) provide an environ- ment for processor subsystem modeling and simulation. This allows rapid assembly of system-level simulations of one or more processors and various memories and building CHAPTER 1. INTRODUCTION 24

blocks. With such a framework, designers can rapidly build and simulate complete

SoC subsystems that are comprised of multiple, homogeneous or heterogeneous pro- cessors.

1.3.4 Heterogeneous Multiprocessor Via An Extensible Plat-

form

A single processor system is able to execute only one task at any given time. However, it is well known that an application might have some tasks which can be executed independently of each other. Thus, partitioning an application into separate indepen- dent tasks would enable execution of multiple tasks at the same time. Various systems have been implemented that provide a platform where multiple processing entities can perform computation on different parts of the system concurrently [125, 100].

Homogeneous systems consist of processors that are identical, such as those in

Symmetric Multiprocessing (SMP) systems. Such systems would be single-ISA sys- tems, whereby an identical ISA (instruction-set architecture) is used for all processing cores in the system. Heterogeneous processor systems use differing processing entities to maximize performance while minimizing area and power consumption. Such sys- tems may consist of a network of ASIPs, DSPs, coprocessors and ASIC components fabricated on the same silicon die. Each component would be mapped and assigned to specific functions, thus executing multi-threaded applications.

Existing approaches to heterogeneous processor architectures typically map crit- ical regions of software into hardware (i.e., DSP, ASIC, etc.). Each hardware com- ponent is optimized and suited to its specific mapped region to maximize perfor- mance. To increase efficiency and performance of critical systems, extensible pro- cessors [1, 4, 15, 20] have been used in SoCs. The instruction set and underlying CHAPTER 1. INTRODUCTION 25

architecture of the extensible processors can be configured to specific applications in order to improve efficiency. Extensible processors provide a good trade-off between efficiency and flexibility, as the same design can be reused between different product variants and updated with little additional cost.

1.4 Design Automation

Microprocessor design has become more complex as market expectations of perfor- mance, reliability and pricing steadily increase. Ever more demanding application requirements and the ubiquity of computing devices in daily chores calls for rapid de- ployment and greater flexibility of computer systems. The advent of computer-aided design has made numerous technology advances, with faster processors and innova- tive solutions. However, market forces for reduction in price and product turn-around time is motivating the need for comprehensive automation of computer system design.

Increasing complexity

Microprocessor complexity has steadily increased over the years, as features for im- provement of parallelism and power optimization are added to the design. Figure 1.3 shows the current transistor counts in major processors today, compared to the first- generation microprocessors of the 1970s. Such complexity is a burden to system de- signers as microprocessor design becomes more sophisticated and requires more time for verification and testing. Additionally, comprehensive design-space exploration for a multitude of processor configurations demands a systematic and automatic ap- proach. CHAPTER 1. INTRODUCTION 26

Decreasing price

Increasing demand and hence production volume generally decreases prices for prod- ucts for end consumers, due to economies of scale. However, the ever-increasing complexity of processor design tends to curb the production of cheaper computer systems. More time and money need to be spent on designing and verifying complex microprocessor systems, so that greater numbers of designers would be needed to address the increasing features being integrated into the designs.

To meet customer expectations, manufacturers must make technological innova- tions so as to sustain the trend in price reduction. Computer-aided design and design automation is capable of greatly reducing the design effort and time involved in the design of complex SoC systems.

Decreasing product lifetime

Manufacturers and developers are required to keep up with development trends in the market. In the early days of the computer industry, new models were released no more than once every five years. Software and user requirements did not change as rapidly as today. By the 1990’s, a computer’s saleable lifespan can be as little as

1.5 years. In today’s computer industry, new microprocessor models from Intel and

AMD are released more often than one a year. In the embedded systems market, such as for mobile phones and music players, new models have to be produced to keep up with the frequent technology and software advancements in the multimedia and communication areas, due to CODEC (coder-decoder) updates and new features, such as faster protocols and new user interfaces.

Reducing product lifetimes requires a high design turnaround rate, providing an- other motivation for the automated design approach for new products. There is a CHAPTER 1. INTRODUCTION 27

need for automated design-space exploration in order to rapidly identify the most cost effective configuration of a particular product. A company which is able to keep up with this rapid change of product models in the market will remain competitive in the marketplace, and in the long run, provide consumers with cheaper and more innovative products.

1.5 Motivation

A computer system consists of software and hardware that are integrated to perform the task intended of the system. The current design process for software is not fully automated and involves several stages that need the intervention of the software designer. However, some of the stages have been fully automated, including the profiling and compilation stages provided in a standard software development toolkit.

In a similar way, design automation is required at various stages of the micropro- cessor development process.

Figure 1.4 shows the various stages involved. The designer begins the develop- ment process with a description of the intended processor and the requirements of the intended application. The Hardware Design stage processes this information and produces a list of details of the components and interconnections of the microproces- sors. This stage also provides information relevant to building the physical design. In the Physical Design stage, the information from the previous stage is used to place and route the various components efficiently using the available technology libraries of the fabrication process.

The next three stages involve the fabrication of the design in silicon and packaging.

Testing is then performed to determine if the device meets all its specifications. Pat- terns of bits, called test vectors, are applied to it and its behaviour and performance CHAPTER 1. INTRODUCTION 28

observed. Finally, the developed product is deployed and shipped to the customers.

This thesis focuses on the integration of the hardware design and software design stages, termed hardware-software codesign.

Figure 1.4: Microprocessor Development Process

The trend of increasing numbers of transistors in chips continues, upholding

Moore’s Law. Figure 1.3 shows the continuing trend in the microprocessor domain, which have been made possible by the continuing improvement in miniaturization techniques by manufacturers.

The availability of this technology provides designers with the needed resources to include more and more features and components in microprocessors. Systems that were previously manufactured on circuit boards can now be fabricated on a single CHAPTER 1. INTRODUCTION 29

chip package, an SoC (System-on-Chip).

Semiconductor fabrication facilities now offer circuit design teams more transistors than they need. In other words, engineers are not able to keep up with the rate at which manufacturers are making new transistors available for new chip designs. With such numbers of resources, there needs to be a methodology and an automated way of making efficient use of them to provide better features and improve performance.

Consumer applications have become more sophisticated and consumers demand higher performance. Multimedia applications require high performance in order to provide high quality video and music to the audience. Safety devices, such as au- tomotive speed monitoring and anti-brake systems require rapid response timing in order to ensure the safety of users. In the health industry, camera pills (that are swallowed by patients) require fast processing of images while maintaining low power consumption.

Increased design costs and shorter design deadlines, combined with consumer ex- pectation of reduced prices can quickly erode profit margins, sometimes causing a manufacturer eventually to go bankrupt; this calls for increases in designer produc- tivity. Higher productivity reduces both design cost and time, thus affecting product cost and eventually allowing price to drop in the market place.

Application-specific designs, such as ASICs and DSPs, are fast and efficient. How- ever, these designs are not flexible enough to run other applications or to accommo- date major changes that may be needed in the algorithms. Thus current design trends are toward the design of embedded system, built around programmable customizable general purpose processor cores, that require low design turnaround time and yet are power efficient. With custom processors, we can add/remove instructions, and map portions of code incrementally using extended instructions. Customization is a good approach to trading-off performance and power. CHAPTER 1. INTRODUCTION 30

Current trends have shown the need for more parallelism in designs, so as to achieve higher speed while keeping power demands at a minimum. More cores can now be added into System-on-Chips (SoCs) to take advantage of multi-threaded sys- tems. As more cores are added, the system becomes more sophisticated to design, needing a larger design space. There is a growing need for the automated design of multiprocessors in embedded systems, particular for application-specific multiproces- sor systems with a multitude of parameters. Exploration of such a large design space is time-consuming.

1.6 Research Goals and Contributions

This thesis presents several methodologies and case studies to address the following problems:

1. The complexity of embedded applications has steadily increased to provide more

services and functionality in mission-critical systems. High speed general pur-

pose processors (GPP) would not be appropriate for such systems, due not only

to the power and area constraints in an embedded system but to the generality

of the GPP, which would not allow it to provide enough processing power in

high speed applications. An extensible processor framework is used to provide

the necessary customization and processor speed without using a large area.

2. Customization of processors to speed up application performance requires valu-

able time and resources. A formal methodology should be available that would

ensure proper code identification, selection and mapping to customized hard-

ware. This would address the increasing importance of reduction in design

turnaround time. CHAPTER 1. INTRODUCTION 31

3. Parallel processing has evolved from superscalar to multiprocessing architec-

tures. As more processing cores are included in System-on-Chips (SoCs), more

formal approaches have to be used to partition and assign code to various pro-

cessing entities (PEs) in an SoC.

1.7 Thesis Overview

This thesis explores a variety of different approaches to optimizing SoC systems, by focusing on a particular application. The approaches represent methodologies for gen- erating a microprocessor-based design; maximizing its performance while taking into account the area use and power consumption of the completed system. The thesis explores the advantages of extensible processors that offer customization of compo- nents and instruction sets, while maintaining a certain degree of programmability and generality of the system. The work goes on to exploit parallelism within loops and tasks in generating application-specific processor systems. While custom circuits and processors for specific applications have been studied in the past, the work pre- sented in this thesis demonstrates the use of novel techniques and methodologies for designing an extensible processor platform.

Chapter 2 is a necessary literature review of the background, including design approaches to embedded systems, general purpose processors (GPPs), Application-

Specific Instruction-set Processors (ASIPs), customized instructions, parallelism in hardware and software and coprocessor integration. This chapter also points out the importance of different levels of parallelism (i.e., instruction level (ILP), loop level and thread level) and the use of multiprocessor systems.

Chapter 3 describes the nature of the approaches presented in this thesis. This chapter provides a philosophical overview of the approaches, actions taken and the CHAPTER 1. INTRODUCTION 32

overall flow of the thesis, justifying the relationships between the various chapters in this thesis and the motivation behind each and every approach presented. The chapter also describes how the overall flow fits within the goals of this thesis.

Chapter 4 presents a methodology for the rapid creation of different processor configurations by the removal of unused instruction sets from the architecture. The scheme uses ASIPmeister, a processor generation tool which generates VHDL (V ery

H igh speed integrated circuit H ardware Description Language) code of the configured processor. A profile-directed approach is used to identify frequently-used instructions and to isolate unused opcodes from the available instruction pool. The methodology used to reduce and generate represents one of the fastest methods of generating an application specific processor. Five benchmark applications are used for profiling and processor generation.

Chapter 5 attempts to address the problem of achieving further speeups on computation- intensive loops in applications. The chapter presents an approach for accelerating loops by tightly coupling a coprocessor to an ASIP. A case study on a JPEG encod- ing algorithm is conducted to illustrate the advantages of such an approach. One of the main loops of the JPEG algorithm is accelerated by using the proposed co- processor architecture. The acceleration is contrasted with two other approaches: a conventional loosely coupled coprocessor and an extensible processor with custom instructions. The conventional coprocessor is synthesized via a high-level synthesis

(HLS) approach, which is a faster method of generating coprocessors.

Chapter 6 explores the design space of a heterogeneous multiprocessor system via a case study on a JPEG compression algorithm. The chapter presents a novel ar- chitecture in which the sequentially streaming JPEG application is partitioned into tasks that can be executed in parallel, and executed on multiple processors arranged in a pipeline manner. Each algorithmic stage is executed on a single processor, with CHAPTER 1. INTRODUCTION 33

queues between processors for communication. By carefully customizing each proces- sor, the pipeline is balanced (i.e, processing times are nearly equal) allowing speedups with little overhead. The case study provides an analysis of the effect of increasing the number of cores into the system and the extent to which performance improve- ment can be achieved. This chapter also includes a comparison of the pipeline model against a master-slave architecture system.

Building from the case study of Chapter 6, Chapter 7 presents a methodology to create a heterogeneous multiprocessor system using ASIPs as processing entities in a pipeline configuration. The problem of mapping each algorithmic stage in the system to an ASIP configuration (i.e., cache sizes and instruction set) is formulated.

The chapter describes two heuristics which efficiently search the design space for a pipeline-based multi ASIP system which maximizes the performance/area ratio. An estimation technique is also developed to simplify the search space and to quickly provide the run time of an application when the individual run times of each pipeline stage is known. Two benchmark applications are used: JPEG and MP3 encoders.

The heuristic solution is reached within a fraction of a second, while the exhaustive approach takes much longer.

Chapter 8 finally concludes with the summary of the thesis, its achievements and future work that could be undertaken. Chapter 2

Literature Review

2.1 Introduction

Customization has been vital in creating systems that are power efficient, portable and reliable, while also meeting performance requirements, thus providing a competitive advantage in the industry. This chapter provides a necessary review of the literature on design automation for the customization of extensible systems. The chapter opens with a historical overview of embedded systems and the ways they have evolved over time. In later sections, the chapter describes the various architectural designs available for embedded systems. Focus is then directed to the customization of these architectures and how parallelism within the code is exploited. A thorough review on previous research in design space exploration is then provided.

2.2 Embedded Systems

Information processing has traditionally been associated with computers and main- frame servers. However, advances in miniaturization have brought the possibility of integrating processing and computing technologies into smaller devices, such as

34 CHAPTER 2. LITERATURE REVIEW 35

communication systems, televisions, radios, vehicles and tabulating machines. An embedded system is defined as “a combination of computer hardware and software, and perhaps additional mechanical or other parts, designed to perform a dedicated function. In some cases, embedded systems are part of a larger system or product, as in the case of an antilock braking system in a car.” [33].

During the era of the first digital computers, such as the ENIAC, EDVAC and

UNIVAC, computers were also dedicated to a certain task but were too large to be considered embedded systems. The programmable controller concept soon evolved from computer technology, solid state devices and traditional mechanical devices.

Embedded systems form the basis of the so-called post-PC era, in which information processing is moving away more and more from PCs alone to embedded systems [118].

Embedded systems provide applications for almost every function in daily life. Such types of application have been called ubiquitous computing [161], pervasive comput- ing [76] and ambient intelligence [25].

An embedded system consists of two major components; an operating system that controls the microprocessor in the device and the application package which runs on that operating system. However, an embedded system can be fully constructed solely from hardware circuits (i.e., ASICs) that perform a highly specific function. Cisco,

Wind River Systems, Sun Microsystems, Integrated Systems, Microware Systems, and

QNX Software Systems are among the prominent developers of embedded systems.

Embedded systems now appear to be deeply integrated into everyday appliances.

A vehicle may have over 50 microprocessors, controlling functions such as engine man- agement systems, brakes with electronic anti-lock brakes, transmissions with traction control and electronically controlled gearboxes, safety with airbag systems, electronic windows and air-conditioning. Figure 2.1 provides an overview of a typical ’smart CHAPTER 2. LITERATURE REVIEW 36

car’ design. A production example is the Lexus LS [12], with object recognition pre- collision systems, self-steering Lane Keep Assist, and an automated parking system.

Even washing machine may have that provides different washing programs, power controls for the motors and pumps and for video display controls.

Current mobile phones have more processing power than the first-generation comput- ers. These phones have embedded processors which process video and audio, while storing and delivering information via communication networks.

Autonomous Cruise Control Audio Lane Departure Warning System Central Electronics Computerized Engine Management System Climate Control Anti-lock Braking System Passenger Door Traction Control Phone Position Monitoring Power Seat Advanced Parking Guidance System Sun Roof

Figure 2.1: Embedded systems in a vehicle provides a multitude of features

The first recognizably modern embedded system was the Apollo Guidance Com- puter [2], developed by Charles Stark Draper at the MIT Instrumentation Laboratory in the 1960s. The Apollo guidance computer was was the first to use integrated cir- cuits to reduce its size and wight. The use of the then novel monolithic integrated circuits was considered the riskiest item in the Apollo project. The Apollo Guidance CHAPTER 2. LITERATURE REVIEW 37

Computer (AGC) – see Figure 2.2 – had the same necessary architecture as a mod- ern day , with an instruction set, registers, a control unit, memory, interrupts, a user interface and software.

The first mass-produced embedded system was the Autonetics D-17 guidance computer [113] for the Minuteman missile, released in 1961. The D-17 was built using transistor logic. The main memory was a hard disk drive. The Minuteman II (1966), replaced the D-17 with a new computer that incorporated the first high-volume use of integrated circuits (ICs). Prices of embedded systems eventually, through the use of integrated circuits, were dramatically reduced, thus permitting their use in consumer products.

Early microprocessors include the AL1 which was conceived by Lee Boysel [36] and developed using a small number of MOS (Metal-Oxide Semiconductor) chips. The microprocessor, however, was not a single chip and was sold only as an entire com- puter system. offered the first microcontroller in 1972, containing

1 kB ROM and 32 bytes of RAM with a simple 4-bit processor; the TMS1000 [18].

This microcontroller controlled the Texas Instruments calculator. Rockwell Interna- tional also specialized in random logic with four-phase design, aided by internally developed computerized design tools. Rockwell had developed the world’s first com- mercial chipset for hand held calculators for the Japanese Sharp Corporation in 1968.

In 1972, Rockwell announced the PPS-4 [14], a chip was similar to Intel’s MCS-4 (mi- crocontroller containing the Intel 4004) and directly competed with it. The PPS-4 provided a higher degree of parallelism than the Intel 4004; although the former was implemented using a slower technology, its performance was comparable [59]. CHAPTER 2. LITERATURE REVIEW 38

(a) Interface Box

(b) Bloch Logic Module

Figure 2.2: The Apollo Guidance Computer, sourced from Computer History Mu- seum [2] CHAPTER 2. LITERATURE REVIEW 39

2.2.1 Integration of logic-based circuits

Until the 1970s, most control system designs were implemented using individual logic integrated circuits. As integrated circuits developed, separate logic functions were integrated to create higher level functions. For example, a complete adder could be integrated as a single package rather than being constructed out of individual logic gates. As more logic was packaged into a single package, the number of chips were reduced. The microprocessor revolution started in 1971 with the introduction of the

Intel 4004 [59] microprocessor used in calculators.

2.2.2 Functional Upgrades

The introduction of the microprocessor was prompted by the need for faster processing at lower cost when developing calculators. However, the need to add or remove functionality from an embedded system was of paramount importance. As most of the functionality of the calculator is implemented in software, it is possible to create a new product with new functionalities without much change to the hardware. Thus, different systems would be able to share the same hardware base.

Such a programmable framework allows easy maintenance when debugging the functionality of the product. Again, expensive repairs can be avoided. Program updates can be performed by just replacing the software, which is normally stored in

ROM (Read-Only-Memory) chips.

2.2.3 Analogue Replacement

Digital signal processing (DSP) is the transformation of signals that have a digital representation. Digital signal processing (DSP) is replacing conventional analogue processing in a number of fields and has diversified into various types of processing CHAPTER 2. LITERATURE REVIEW 40 1 3 5 7 9 11 13 15 Register Register

0 2 4 6 8

Multiplexer 10 12 14 Scratch Pad Scratch Index Register Select Register Index Stack 4 Bit Internal Data Bus Data Bit Internal 4 Multiplexer Level No. 1 Level No. 2 Level No. 3 Address Stack Address

Program Counter Stack Pointer Stack D0 – D3 Bus Data Bi-Directional and Cycle Buffer Decoder Machine Register Encoding Instruction Data Bus Instruction Test Sync Ph1 Ph1 Reset Test Sync Clocks ing and Control ing and m Ti ALU Flag Flip-Flops Decimal Adjust Decimal CM ROM CM RAM 0-3 ROMControl RAM Control Figure 2.3: Architecture of the Intel 4004 microprocessor. Temp Temp 4 Bit Internal Data Bus Data Internal 4 Bit Register Accumulator CHAPTER 2. LITERATURE REVIEW 41

task, such as voice compression, image recognition, and robotic control systems. In the 1940s, the feasibility of using digital elements to construct a filter was discussed.

By the 1950s, the increasing access to mainframe computers stimulated the develop- ment of digital signal processing applications. A limited set of digital signal processing techniques were used by seismic scientists [130].

GPS Mobile DVD Device Phone Player

FFT MDCT DCT Position Locator Noise Filter MPEG Kalman Filter GSM, GPRS, 3G H.261, H.264 Phase Difference MPEG FIR Filter Time MP3 Delay Lines Synchronization FIR Filter Equalization Filter Feedback Filter

Figure 2.4: Example of applications utilizing DSPs.

In the 1960s, the introduction of integrated circuits offered the possibility of a complete digital signal processing system. Kaiser [91] made contributions in the area of digital filter design, while Cooley et al. [48] contributed to the development of a fast method of computing the Discrete Fourier Transform (DCT). Many variations and extensions to this area are known as Fast Fourier Transforms (FFTs). It ap- peared that digital processing would provide less resolution compared to analogue processing. However, with digital processing, signals or data would can be easily ma- nipulated. Algorithms can be easily changed due to their inherent programmability.

DSP demonstrates cost, size and performance advantages that are not available in analogue processing [61]. CHAPTER 2. LITERATURE REVIEW 42

2.3 Architectural Designs

Different usage in applications has led to a wide variety of computing devices; each belonging to a differential architectural family with a different method of computa- tion. The term “” has been a focal point of debate in the past.

Dasgupta [51, 52] defined architecture as an abstraction of some hardware/firmware system in that it describes the structure, behaviour and performance of the hard- ware/firmware device viewed as an abstract information processing system. For Hen- nesy et al. [78], computer architecture is the aspect of a digital computer that defines the instruction set architecture (instruction set, memory address modes and regis- ters), microarchitecture (constituent parts of the interconnected system) and system design (all other hardware components).

Microprocessors are used differently in a variety of disparate markets, with differ- ent requirements in their respective fields and areas, thus the microprocessors have to be designed specifically to cater for the expectations and workload of the appropri- ate markets. The different markets include those for personal computers, graphics, audio and portable players. These markets differ in terms of processing, power and size requirements. Microprocessors can be designed to be highly multipurpose, such as General Purpose Processors or more application specific, such as Digital Signal

Processors (DSPs).

2.3.1 General Purpose Processors

The Intel 4004 [59] was introduced in 1971 and is known as the world’s first general purpose microprocessor; it was meant to be used in the Busicom calculator and was designed such that new functions could be programmed easily without a complete redesign of the system. General purpose processors (GPPs) are commonly used in CHAPTER 2. LITERATURE REVIEW 43

desktop, laptop and server computers. The ability to run a wide range of program efficiently has made this architecture among the more advanced technically, compared with other architectures. However, its disadvantages include higher cost and power consumption than special purpose processors.

$USD44 billion worth of microprocessors were manufactured and sold in 2003.

Nearly 50% of this was spent on CPUs used in desktop, laptop and server computers but these only represented 0.2% of total sales of the microprocessor market. The embedded system market thus represents the major target for microprocessor sales.

The instruction set architecture (ISA) defines the design of the GPP architec- ture. General purpose processors can be categorized into two types: CISC (Complex

Instruction Set Computer) and RISC (Reduced Instruction Set Computer). Early computers were naturally RISC machines due to their limited instruction sets. How- ever, as more features were added, more complex and sophisticate instruction sets were introduced. It was necessary to provide an assembly language which was easier to use. Consequently, the design of the processor became more complex due to more complex decoding of instructions. These designs incurred a major cost by requiring more complicated logic to implement. The x86 architecture created by Intel is a

CISC-typed architecture.

In the 1980s, RISC architectures became more popular in microprocessor sys- tems, especially in embedded systems. The traditional CISC architecture uses many instructions which perform long, complex operations. In contrast, RISC instructions are executed much quicker than a CISC instruction; allowing most computational tasks to be processed faster. Most current general purpose processor systems use both CISC and RISC architectures.

Software for embedded devices is usually optimized in order to provide low storage CHAPTER 2. LITERATURE REVIEW 44

usage, low memory and power requirements and high performance in execution. A va- riety of optimizations and features have been added to general purpose architectures in order to exploit data level parallelism and to do specific tasks more efficiently [132, 79].

Architectural enhancements include superscalar [143] and pipelining [78]. Techniques for compiling software in order to reduce power consumption have been explored in

Mehta et al. [119].

MIPS Technologies released the first commercial design, the 32-bit R2000. The

first 64-bit RISC processor was the R4000. Various other RISC designs include the

IBM POWER, Sun SPARC, AT&T CRISP, AMD 29000, and ,

Motorola 88000, DEC Alpha and the HP-PA.

Texas Instruments (TI) and Intel developed the first microprocessors, which were

4-bit processors used in pre-programmed embedded applications. TI introduced the

TMS1802NC in 1971, which implemented a calculator on a chip. In that same year, the Intel 4004 was introduced, running in a Busicom Calculator. Microprocessor advancement led to the development of 8-bit microprocessors such as the Intel 8080,

Zilog Z80 and . These microprocessors were eventually used in personal computers, medical grade pacemakers and defibrillators, and automotive, industrial and consumer devices.

16-bit microprocessors such as the Western Design Center (WDC) CMOS 65816 became the core of the Apple 11gs personal computer and were later used in the Su- per Entertainment System. 16-bit processors from Intel include the 80186 and the 80286. Embedded systems also used 32-bit microprocessor designs. The Mo- torola MC68000 was used in the Apple Lisa, Macintosh, Atari ST and Commodore

Amiga devices. Later versions, such as the MC68030 and MC68040 included Mem- ory Management Units (MMUs) integrated into the chip and complemented with a

Floating Point Unit (FPU) for better math performance. However, later version of CHAPTER 2. LITERATURE REVIEW 45

the 68K series were not successful in the Personal Computer market but continued to be manufactured for embedded equipment. Later 64-bit designs are currently used in multimedia boxes such as DVDs, entertainment systems and gaming consoles. The move to 64-bits increases the register widths, register counts and memory address space while providing a larger datapath. Well known 64-bit architectures include the x86-64, MIPS, SPARC, Power Architecture, and Itanium.

Embedded systems in general are power critical. These systems require restrictive power constraints in order to be efficient and portable for consumer purposes. This challenge is being addressed in many research studies. Tiwari et al. [154] describe a systematic approach for modelling power usage in microprocessor systems. This technique has been the basic of much of the instruction-level power modelling used in today’s embedded system designs. The technique quantifies the energy cost of individual instructions and of the various inter-instruction effects. The work has benefitted the industry by providing better insights into energy usage in embedded systems, and provided a useful tool for energy constrained designs.

2.3.2 Coprocessor Systems

Performance of general purpose processors are limited to the instruction set archi- tecture and microarchitecture of the CPU design. A general design usually provides

flexibility in the programs that can be executed, while trading off performance. To improve performance on specific operations and tasks, an additional processor that is specially designed for specific tasks can be added to the system. A coprocessor is a computer processor which is used to supplement the functions of the main proces- sor in the system. Coprocessor functions range from floating-point arithmetic, signal processing, string processing and encryption.

System performance can be accelerated with the addition of coprocessors, since CHAPTER 2. LITERATURE REVIEW 46

processor-intensive tasks can be offloaded from the main processor. Coprocessor systems provide an optional feature in a system. Users who do not need the features of a coprocessor need not include it in the system.

Coprocessors are specialized processors and can be available both as a separately packaged microprocessor chip or as a softcore intellectual property (IP) block, which can be integrated together with the base processor. There are two major categories of coprocessors; 1) loosely coupled and 2) tightly coupled. Coprocessor systems are commonly loosely coupled and share the same communication bus as the main pro- cessor. It is desirable for the coprocessor to have a minimum coupling with main memory and other processors. Instructions are received from the main processor.

Coprocessors may have microcoded instructions as well as sophisticated memory ac- cess modes. A control unit within the coprocessor executes the local instruction stream of the coprocessor. Like the main processor, the coprocessor may have either a RISC or CISC architecture design. The characteristics of a coprocessor system are reviewed in Chu [45].

A tightly coupled coprocessor system integrates directly with internal components of the main processing core and is highly dependent on the scheduling and commit- ted instructions of the main processor; it may have access to internal registers and components which are not available to loosely coupled coprocessors.

Coprocessor computer organization is also associated with parallel algorithms, as suggested in Chu [45]. This paper describes a lexical/parsing coprocessor, which can deliver tokens from lexical processing. The compiler can be partitioned into pipeline stages for which the interpreting stage can be performed in the coprocessor and the rest of the compilation pipeline executed in the main processor. The use of a pipeline configuration speeds up compilation. Furthermore, another benefit is that the need to write the front-end of a compiler or an interpreter can be eliminated. Thus, the CHAPTER 2. LITERATURE REVIEW 47

design and implementation effort of a compiler or an interpreter may be reduced, so that programmers can then consider writing and customizing compilers on their own.

Figure 2.5 illustrates the architecture of such a coprocessor in a typical microprocessor system.

Production Rule Table Lexical & Program Text Parsing Table Main Memory Control Table Semantic-rule Code Table

Memory Bus

Address Instruction Data & Instruction Address

Address Input Instructions Register Register Register

Interrupt Control Unit

Control Lexical Flags CPU Table Pipeline LL(1) Parser Action Code

Lexical / Parsing Token Coprocessor Register Output Register

Figure 2.5: Architecture of a lexical/parsing coprocessor, sourced from Chu [45].

Coprocessors may also assist in scheduling analysis for shared fieldbuses. In real- time applications such as process control, bus messages have to be scheduled to meet timeliness-constraint guarantees. Static scheduling is not feasible when flexibility is required for the addition/removal of components and configuration changes on CHAPTER 2. LITERATURE REVIEW 48

the bus. Dynamic scheduling imposes a high run-time overhead and is best off- loaded to coprocessors. A traffic scheduling and schedulability analyzer coprocessor targeted for centralized scheduling fieldbus systems was introduced in Martins et al., [117]. The coprocessor supports rate-monotonic, deadline-monotonic and priority- based scheduling policies and was implemented on a XC4010XL 12 MHz FPGA.

For modern day communications networks (e.g., clusters of workstations connected via ARM networks), communications coprocessors (CCPs) are used to support fast communication [137]. Communications coprocessors offload computation workload for providing the protection, reliability and protocol handling needed for commu- nications from the main processor. Coprocessors can also be exploited to perform user-level message handling. Example of communication messages include Active

Messages, RPC (Remote Procedure Call) and tagged send-and-receive.

Embedded devices, such as mobile phones run portable applications which are written as Java applications. On desktop applications, high-end processors provide the necessary computation power to run a Java Virtual Machine (JVM) in order to run the Java application (written as Java bytecode). However, embedded devices do not have the luxury of powerful general purpose processors in order to minimize the power consumption and footprint of the microprocessor in the device. This is because each Java instruction requires several base processor instructions in the system. Ad- ditionally, the JVM competes with the other applications for cache usage. S¨antti et al. [136], presented an advanced Java Co-processor to offload the execution of Java bytecode from the base processor. The coprocessor works in asynchronous mode, thus reducing energy consumption in the system. Additionally instruction folding and stack by-passing is used to enhance execution of Java bytecode.

A coprocessor can also be used to speed up execution of code written in different programming models, such as the object oriented programming model. Donzellini et CHAPTER 2. LITERATURE REVIEW 49

al., [54] presented the design of an Object Coprocessor (OCP) that cooperates with a

RISC-architectre processor (the ARM7). The coprocessor implements low-level pro- cessing and control steps in hardware that are needed by the object-oriented model.

Object oriented concepts which are supported include encapsulation, inheritance, polymorphism and virtual methods. The OCP is able to take control of the execu- tion flow when necessary. The coprocessor is equipped with special instructions for typical operations of Object Oriented languages. These are the management meth- ods, call methods and returns, in static and virtual modes. When needed, the OCP disconnects from the bus and controls both the core and the external memory sys- tem. When no coprocessor instruction is fetched, the coprocessor stays idle, while the main core (i.e., ARM7) remains connected to the external bus and controls the whole system.

2.3.3 Digital Signal Processors

Digital Signal Processors (DSPs) are very important components in many consumer, communications, medical and industrial products. DSPs are often more cost-effective than customized hardware, especially in low-volume products; compared with general purpose microprocessors, they provide an added advantage in speed, cost and energy efficiency. Eyre et al. [58] provides a good overview and describes the key architectural features of DSPs.

The design and architecture of DSP processors have been influenced by the com- putational requirements of DSP algorithms. Most features in a DSP processor can be traced back to the computation methods used in DSP algorithms. For example,

finite impulse response (FIR) filters would require fast multipliers in a DSP proces- sor, as the filter is mathematically expressed as (x h). FIR filters have very high P ∗ computational requirements and so need several independent execution units that are CHAPTER 2. LITERATURE REVIEW 50

capable of operating in parallel. Consequently, a multiply-accumulate (MAC) unit, an arithmetic-logic unit (ALU) and a shifter are usually included in DSP proces- sors. Good DSP performance requires high memory bandwidth compared to general purpose processors due to the need to fetch instructions, data samples and filter coefficients from memory in a single cycle.

Most DSP processors also use a fixed-point numeric data type instead of the

floating-point format most commonly used in scientific applications by general pur- pose processors. A DSP processor would also support zero-overhead looping in which programming loops can be executed without expending any clock cycles for updat- ing and testing the loop counter or for branching back to the top of the loop. DSP processors might implement specialized serial/parallel I/O interfaces and streamlined

I/O handling mechanisms, such as low-overhead interrupts and direct memory access

(DMA), so as to allow data transfers without affecting the performance of the main processing unit. A DSP processor would also implement specialized instruction sets for the intended algorithm.

There are different ranges of DSP processors [58]. Conventional DSP processors typically provide few execution units and only have a single mutilplier/MAC unit and an ALU; these processors issue and execute one instruction per clock cycle.

Examples include the Motorola DSP560xx , Texas Instruments TMS320C2xx and

Analog Devices ADSP-21xx. Conventional DSP processors run at low clock speeds while maintaining very modest power consumption and memory usage. Higher-end

DSP processors would include deeper pipelines, barrel shifters, instruction caches and higher clock speeds.

Enhanced-conventional DSP processors provide even greater clock speeds and bet- ter hardware improvements. Enhancements include parallel execution units, extended instruction sets and wider data buses. These DSP processors would also use wider CHAPTER 2. LITERATURE REVIEW 51

instruction words to provide additional parallel operations within a single instruc- tion. The conventional and enhanced-conventional DSP processors are both hard to program for due to complex assembly languages and difficulty in compiling programs for the architecture.

Multi-issue architectures for DSP processors are closely related to current modern general purpose processor architectures. Multi-issue DSP processors use very simple instructions that typically encode a single operation. A high level of parallelism is achieved by issuing and executing instructions in parallel groups. The two meth- ods for parallel instruction execution are VLIW (Very Long Instruction Word) and superscalar. VLIW and superscalar DSP processors target applications that have demanding computational requirements while having constraints on cost or energy usage. Figure 2.6 presents a sample DSP benchmark result for a selection of DSP processors, compared with a general purpose processor, the Intel Pentium III.

ADI ADSP-2106x 812

ADI DSP-2116x 573

LSO Logic LSI400 607

Lucent DSP16xx 1264

Lucent DSP16xxx 757

Motorola DSP563xx 943

StarCore SC140 183

TI TMS320C54x 730

TI TMS32c62xx 347

Hitachi SH-DSP 904

Intel Pentium III 1498

Figure 2.6: Cycle counts for BDTI block FIR filter benchmark, sourced from Eyre et al. [58]. CHAPTER 2. LITERATURE REVIEW 52

The design of signal processing systems is very complex, especially in the embed- ded system domain. There is an increasing challenge in terms of increasing func- tionality and changing standards. There are also high performance constraints and the need for low power dissipation. For better portability and reliability with lower production cost, it is important to produce a system with low power dissipation.

As with general purpose processors, similar research in energy/power modeling has been performed. Lee et al. [105] proposed an instruction-level based energy modeling methodology for DSP systems. The study used a dual-slope integrating ammeter in series with the power supply of a microprocessor so as to obtain visual current readings that are proportional to the power consumed by the processor. However, this method is limited, since the program length has to be much shorter than the sampling rate of the meter in order to obtain a stable reading.

In contrast to the work by Tiwari et al. [154] and Lee et al. [105], Gebotys et al. [63] derived software power prediction models using statistical optimization, and verified these models against actual power measurements. The use of a more sophisti- cated meter, specifically the Fluke 867B GMM, allowed the power consumption of an entire program to be empirically determined. The general approach of this statistical methodology can be applied to any processor architectures.

2.3.4 Multiprocessor Systems

Multiprocessor systems are computer systems which are composed of several inde- pendent processors. The motivation for the use of multiple processors is mainly for performance reasons, since device technology limits the speed of any single processor.

In order to overcome such limitations, multiple processors are used. As a consequence, the workload has to be arranged and partitioned into parallel tasks to obtain the ex- pected performance improvements. Establishing such tasks is not a trivial matter, CHAPTER 2. LITERATURE REVIEW 53

and represents major research in the multiprocessing field.

Parallelism in execution has been explored in previous chapters of this thesis. In single processor systems, techniques such as superscalar execution, pipelining and

SIMD are ways of increasing the data throughput of a system. Pipeline machines produce high performance by executing several stages of an instruction execution simultaneously. Superscalar devices dispatch instructions to readily available pro- cessing resources to minimize idle time. SIMD instructions operate on vector data or arrays of data. For each of these cases, only a single program is used to operate on multiple data or on arrays of data.

SIMD instructions are able to achieve performance improvements only if data is arranged and accessed to suit the needs of the architecture. Conventional algorithms have to be rewritten in order to make use of such an approach. In some cases, overhead may be incurred due to the extra work needed to align data in memory.

Some operations cannot easily be organized into repetitive operations on uniformly structured data, and tend to be unstructured and unpredictable.

Thus, multiprocessor architectures offer viable solutions for the above problems.

Such an architecture is referred to as a multiple-instruction stream, multiple-data stream (MIMD) architecture. A multiprocessor architecture is composed of indepen- dent processors interconnected in certain ways to allow programs to exchange data and synchronize activities. Each individual processor in the system may or may not share memory and input/output units with other processors.

A multiprocessor can only operate at peak performance when all processors are engaged in useful work and no processors are idle [147]. If no extra instructions

(i.e., overheads) are used in a multiprocessor environment and all workloads take the same amount of time on each processor, then an N processor system would effectively contribute a performance improvement by a factor of N times. However, CHAPTER 2. LITERATURE REVIEW 54

such a state is rarely achieved, due to delays in interprocessor communications and overhead in synchronization between two processors. Efficiency is lost when one or more processors have run out of tasks to do. If two or more processors execute the same instructions, then there is wasted effort in doing so. Finally, there are processing costs for controlling the system and scheduling operations. Exploring the design of multiprocessor systems has been a major research topic and is addressed in the next few sections.

Multiprocessor systems can be either single-ISA systems, or multi-ISA systems.

Single-ISA heterogeneous systems allow any application stage to be assigned and mapped to any core in the system with little reconfiguration and modification [37].

Multi-ISA systems consist of totally different processors that can range across various

DSP to CPU implementations [125, 148, 134]. Systems providing a platform on which multiple processing entities perform computation on different parts of the system concurrently have been implemented [125, 100].

MPSoCs (Multiprocessor System-on-Chips) can be categorized into homogeneous and heterogeneous systems. Homogeneous systems consist of processors which are identical, as in Symmetric Multiprocessing (SMP) systems, and would consist of single-ISA systems. Heterogeneous processor systems use differing processing entities to maximize performance, while minimizing area and power consumption. Such sys- tems might consist of a network of ASIPs, DSPs, coprocessors and ASIC components fabricated on the same silicon die. A heterogeneous system may be either a single-ISA or a multi-ISA system. Each component would be mapped and assigned to specific functions, thus executing multithreaded applications. Usually such systems exhibit coarse-grained parallelism. CHAPTER 2. LITERATURE REVIEW 55

Existing approaches to heterogeneous processor architectures typically map crit- ical regions of software into hardware (i.e., DSP, ASIC, etc.). Each hardware com- ponent is optimized and suited to its particular mapped region to maximize per- formance. To increase efficiency and performance of critical systems, Application

Specific Instruction-set Processors (ASIPs) [1, 4, 15, 20] have been introduced into such processor architectures. An ASIP’s instruction set and its underlying architec- ture can be configured to a specific application in order to improve efficiency. ASIPs provide a good trade-off between efficiency and flexibility, as the same design can be reused between different product variants and updated at little additional cost.

Various heterogeneous multiprocessor systems have been implemented, primarily in the domains of automotive real-time systems [31] and video or image encoding.

Strik et al. [148] explored the use of a heterogeneous system in a real-time video and graphics stream-management system, while in Zhang et al. [167], the authors applied an adaptive job assignment scheme to perform data partitioning for a multiprocessor implementation of MPEG2 video encoding. A high degree of computational power and parallelism are inherent in heterogeneous multiprocessor systems, motivating the development of HDTV systems based upon multiprocessors [35].

Gopalakrishnan et al. [69] used heterogeneous systems in a different manner. Their work generalizes the approach started by Baruah [34] that replicates recurring tasks on multiple processing units to ensure a degree of fault tolerance. Maintaining replicas of a task at different processors ensures that single processor failures are tolerated well. CHAPTER 2. LITERATURE REVIEW 56

2.4 Customization of Architectures

The ever-increasing complexity of applications and the demand for more performance has resulted in the search for techniques to increase computational resources while taking into the account the restrictions of embedded systems: energy consumption and chip size. There are plenty of paths to accelerating applications for embedded systems. As discussed in previous sections, general purpose processor solutions can be enhanced with parallel processing mechanisms to achieve higher data throughput.

However, relying only on those enhancements would not create efficient architectures, due to the generality of such an approach.

The microprocessor design gap has been widening. The National Technology

Roadmap for Semiconductors noted that the rate of increase in the number of tran- sistors that could be put on a die has been much greater than the growth rate of the number of transistors that could be designed into new interdependent circuits.

Microprocessor designers are now “wasting” transistors that could be used for greater efficiency and performance. With the variety of transistors that are now available, designers will be able to exploit modularity in designs to customize embedded sys- tems. However, in order to develop a modular product architecture with standardized interfaces among subsystems, it is necessary to waste some of the functionality that is theoretically possible. Modular design compromises performance and bleeding edge technology in favour of greater efficiency and reliability.

Customizing processor architectures to the target application is now the major trend in embedded systems. The most popular strategy is to build a system that consists of a number of special-purpose ASICs coupled with a low cost core processor.

These SoCs would include specially-designed hardware accelerators coupled with a general purpose processor. However, ASICs are costly to design and offer only a CHAPTER 2. LITERATURE REVIEW 57

hardwired solution that does not allow reprogramming, thus precluding changes that might be needed in the future.

The current trend today is directed more at reliability, customization, convenience and power efficiency, by augmenting the core processor with special-purpose hardware to increase its computational capabilities in a cost-effective manner (i.e., ASIPs).

The central challenge of this approach is the large degree of human effort required to identify and create custom hardware units, as well that needed to port the application to the extended processor.

2.4.1 Field Programmable Grid Array

The FPGA (Field Programmable Grid Array) was invented in 1984 by Ross Free- man [24], the co-founder. Although the FPGA is similar to CPLDs (Complex

Programmable Logic Devices), this semiconductor is a newer form of programmable logic. CPLDs and FPGAs belong to a relatively large family of programmable logic elements.

CPLDs are relatively smaller in size and density compared with FPGAs. These programmable logics have densities equivalent to several thousand logic gates, whereas

FPGAs would have densities ranging from thousands to several million logic gates.

CPLDs are less flexible compared to FPGAs, due to a lower number of sum-of- products logic arrays that feed to a relatively small number of clocked registers. This results in more predictable timing delays and a higher logic-to-interconnect ratio. FP-

GAs provide high-level embedded functions such as adders, multipliers, MAC units, as well as embedded memories, to provide commonly used features in an efficient way.

FPGA architectures are predominantly filled with interconnects, which provides far greater flexibility in designing different types of implementations. The disadvantage of such an architecture is its added complexity of design. CHAPTER 2. LITERATURE REVIEW 58

Current research has frequently focussed on the full or partial in-system reconfig- uration of FPGAs, allowing designs to be changed “on the fly” [129]. This feature is useful when a faulty design needs to be upgraded without complete redesign and fabrication of a new chip. Dynamic reconfiguration is also useful when package size and power are critical. An FPGA can be reconfigured with features as needed. Xilinx

S-FPGAs (SRAM-based FPGAs) were used in the Mars Exploration Rover (MER) landers to control critical pyrotechnics during the landing sequence. They are also used to control the rover’s wheel motors [151].

Vranesic [159] surveyed the key issues in the development of FPGAs and examined the design factors which determine their performance and effectiveness. Logic blocks play a critical role in the functionality of the FPGA; two major types are used: based either on lookup tables (LUTs) or on multiplexers. The logic blocks, in addition to the combinational circuity allow combinational or sequential elements to be formed.

Blocks with high number of inputs provide high functionality.

Interconnection resources provide the necessary framework to implement a large circuit on an FPGA, even though plenty of resources exist. Flexibility of the circuit increases with the number of interconnection resources. However, the introduction of multiple routing switches will dramatically increase the propagation delay of a circuit path.

Reprogramming FPGAs involves serial transfer of programming bits, and thus it is vital to minimize the number of bits transferred. Universal Logic Modules (ULMs) can be used in place of LUTs [110, 153]. ULMs are components that can be programmed to implement one equivalence class of functions at any one time. The number of programming bits required to represent equivalence classes would be significantly fewer than the number used if LUTs were actually used.

The size of an FPGA can be dramatically reduced if fewer SRAM cells are actually CHAPTER 2. LITERATURE REVIEW 59

used to store the programming bits. If a data item consists of a number of bits that are handled in the same way, then it is possible to optimize both the routing and logic block resources to reduce the number of control bits needed. Sharing of routing control bits minimizes the number of SRAM cells being used. Logic blocks which implement a common multiplexer function can also be shared.

Long routes in the interconnects increase the latency of circuit signals from one point to another; much research has been done to improve routability in FPGAs. Li et al. [109] presented a dynamic architecture for an FPGA which has field programmable gate arrays and dynamic field programmable interconnect devices. This architecture efficiently exploits the potential communication bandwidth of interconnect resources, by dynamically reconfiguring the interconnect networks. Thus, interconnect resources and existing pins can be reused. This greatly increases the routability of interconnect networks and the overall performance of FPGA systems.

Recent trends have been to combine the architectural approaches of FPGAs (i.e., logic blocks and interconnect resources with related peripherals), with modern SoC designs. Processor cores available as softcore formats have been designed to suit the architecture of FPGAs. These processor cores can be configured due to their configurable nature. Cache structure, peripherals and interfaces can be customized to the applications. Examples include the MicroBlaze [13], PicoBlaze, Nios [1], Nios

II and LatticeMico32.

Extending the notion of a customized microprocessor, Wittig et al. [164] describe a OneChip processor architecture that combines a fixe-logic processor core with re- configurable logic resources. The programmable components in the architecture can be used to enhance the performance of speed-critical applications. The work also eliminates the limitations of custom processor designs by tightly integrating the re- configurable resources into a MIPS-like processor architecture. CHAPTER 2. LITERATURE REVIEW 60

2.4.2 Application Specific Integrated Circuits

Application Specific Integrated Circuits (ASICs) are customized circuits for a specific application and are not reprogrammable like the general purpose processors. ASICs provide a greater performance advantage compared to their general purpose counter- parts, while occupying less packaging footprint and using less power. This is because

ASICs are “hardwired” to perform a specific function and task, and so do not incur the overhead of fetching and interpreting stored instructions.

ASIC designers use a hardware description language (HDL), such as VHDL or

Verilog, to describe the functionality of ASICs. Designers can use logic synthesis tools, such the Design Compiler, to compile HDL descriptions into a gate-level netlist.

An IP core is an “intellectual property” designed and packaged such that it can be used as a sub-component on a larger ASIC design. They either can be provided as

HDL descriptions (soft cores) or as fully routed designs that can be printed directly onto an ASIC’s mask (hard cores). Intellectual property consumes a lot of time and investment to design and create. Thus frequent reuse of IP cores would dramatically decrease design turn-around time for new products.

HDL descriptions are vital in the design flow of ASIC and other chip designs, thus represent an integral core of any CAD framework. Formal specifications, top-down design, and design-for-test have become standard practices for IC designers. HDLs, especially VHDLs, can be used to explore implementation trade-offs [50]. ASIC de- signers can quickly assess the advantages and disadvantages of different ASIC tech- nologies by selecting different standard cell libraries. Designers are able to analyze potential solutions with respect to complexity, speed and testability.

The decline of custom ASICs in embedded systems has been caused by the pop- ularity of processors that are more flexible and quicker to design. ASIC designs are CHAPTER 2. LITERATURE REVIEW 61

still used in products with low design turn-around time, and those sensitive to power requirements.

Example of ASIC designs include product DCT / FFT modules, JPEG encoders,

YUV/RGB colour convertors and mobile phone transceivers that require low power and fast processing of information. Pok et al. [126] designed a monobit receiver

ASIC that is a wide-band (1 GHz) digital receiver designed for electronic warfare applications. The receiver block can be fabricated onto a single multi-chip module

(MCM). In Shee [138], a JPEG streaming encoder core was developed that can be synthesized into an ASIC design; it was a pipelined design in which each pipeline stage referred to different stages in a JPEG encoding algorithm, hard-wired as custom circuits.

There is on-going research to minimize power in ASIC designs. Various techniques are being used, such as voltage scaling, variable clock frequencies, asynchronous de- signs and miniaturization technologies. Taylor et al. [152] used a different type of

ASIC architecture, known as the structured ASIC; these are similar to FPGAs; they are both conceived as two-dimensional arrays of programmable logic units that can be selectively connected using programmable switch boxes at possible junction points.

However, structured ASICs are not field-programmable and are hard-wired at the later stages of the manufacturing process. The work [152] introduced an energy-aware structured ASIC architecture that used gate sizing and selective voltage scaling for power optimization.

2.4.3 Extensible Processor Architectures

Current system design methods tend towards codesign of mixed hardware/software systems targeting SoC (System-on-Chip) systems. Extensible processor systems pro- vide designers with the flexibility for future software modifications, modularity in CHAPTER 2. LITERATURE REVIEW 62

optional and configurable components, reliability in verified computational modules and power efficiency in terms of full customization of instruction-set towards the target application. Extensible processor systems today provide designers with veri-

fied architectural designs as well as software design tools which are retargetable for different configurations of the extensible design.

Application Specific Instruction-set Processors

Recently a new entrant called the Application Specific Instruction Set Processor

(ASIP) has taken center stage as an alternative contender for implementing func- tionality in embedded systems. These are processors with specialized instructions, selected co-processors, and parameterized caches applicable only to a particular pro- gram or class of programs. An ASIP will execute an application for which it was designed with great efficiency, while they also are capable of executing any other pro- gram (usually with greatly reduced efficiency). ASIPs are programmable, quick to design and consume less power than GPPs (though more than ASICs). ASIPs in par- ticular are suited for use in embedded systems where customization allows increased performance, yet reduces power consumption by not having unnecessary functional units. Programmability allows the ability to upgrade, and reduces software design time. Tools such as ASIPmeister [85], Tensilica [20], ARCtangent [4], Jazz [10],

Nios [1] and SP5-flex [15] allow rapid creation of ASIPs.

The Jazz DSP Processor [10], by ImprovSys, permits the modelling and simulation of a system consisting of multiple processors, memories and peripherals. Data width, number of registers, depth of hardware task queue, and additional custom functional- ity can be added. It has a base ISA that supports addition of extensible instructions to further optimize the core for specific applications. The Jazz DSP Processor has a 2–stage instruction pipeline, single cycle execution units and supports interrupts CHAPTER 2. LITERATURE REVIEW 63

with different priority levels. Users are able to select between 16–bit or 32–bit data paths. It also has a broad selection of optional 16–bit or 32–bit DSP execution units that are fully tested and ready to be included in the design. However, Jazz is suitable only for VLIW and DSP architectures. The Jazz DSP System can be configured to handle memories, and on-chip or off-chip bus interfaces clocked at either the same speed as the processor or twice that [22].

The ARC configurable processor [3] also provides a configurable DSP architecture.

The core has the flexibility to add instructions, registers, flags and condition codes to create a processor that is highly optimized for a specific application. The ARCompact

ISA architecture provides 16 or 32-bit instructions, for high code density.

The SP-5flex [15] is a 32-bit configurable DSP processor with a highly superscalar

SIMD architecture design. The SP-5flex is a fully synthesizable core which is based on the SP-5 [15] core architecture. Elements which are configurable include execu- tion unit structure, maximum multiplier size, number of shift units, number of stack pointers, general registers structure and application specific instructions.

Finally, the Xtensa [20] is a configurable and scalable RISC core. It provides both 24–bit and 16–bit instructions to be freely mixed at fine granularity. The base processor supports the 80 base instructions of the Xtensa ISA, with a 5–stage pipeline.

New functional units and extensible instructions can be added using the Tensilica

Instruction Extension (TIE) Language. Synthesizable code can be obtained together with the software chain tools for various architectures implemented with Xtensa.

Figure 2.7 shows that the Xtensa LX architecture, which is used in Chapters 6 and

7, consists of various standard and configurable building blocks which can be used by the designers to customize the processor for a particular application. CHAPTER 2. LITERATURE REVIEW 64 Figure 2.7: The Xtensa LX Architecture: Designed for Configurability and Extensibility, sourced from Tensilica, Inc. [20]. CHAPTER 2. LITERATURE REVIEW 65

Design tools and framework

ASIPs, such as the Tensilica LX processor [20], consist of a basic RISC core with configurable and optional peripherals such as instruction or data caches, buses, com- putation modules, VLIW instructions, coprocessors and external communication in- terfaces. The basic core of an ASIP features a compact instruction set optimized for embedded designs. Additionally ASIP architecture vendors provide a plethora of design tools to customize and optimize an ASIP to the desired application. Ten- silica [20] provides profiling and compilation tools for any architecture generated by the processor architecture generator. TIE (Tensilica Instruction Extension) instruc- tions allow developers to extend the processor architecture for specific applications by creating new datapath functions in the form of added instructions and registers.

Possible instructions include VLIW, fusion and customized instructions. The Ten- silica XPRES (Xtensa PRocessor Extension Synthesis) Compiler is a TIE language synthesis tools that generates tailored Xtensa processors from the C/C++ code of the target application. The tools enables the rapid development of optimized SoC hardware blocks and associated software tools.

Early research into ASIPs focused on instruction set customizations to satisfy the constraints on embedded system designs. In Van Praet et al. [157], ASIPs were originally defined as mask-programmable processors that have architectures and in- struction sets optimized to a specific application domain. Techniques have since been developed to define optimized microinstruction sets for the processors; methods for instruction selection when generating code for a predefined ASIP have also been ad- dressed. The term ASIP was later applied to configurable processor systems with a basic processing core. Huang et al. [81] describes instruction set synthesis for an ap- plication in a parameterized, pipelined microarchitecture, by synthesizing instruction CHAPTER 2. LITERATURE REVIEW 66

sets from application benchmarks, given a machine model, an objective function and a set of design constraints. The semantics of the instructions is modeled using a binary tuple and an instruction formation process integrated into the scheduling process.

Schedules are solved using simulated annealing. Complex instructions which cannot be accommodated within the clock constraint have to be designed as a multi-cycle instructions.

To overcome the limitations of previous approaches that only generated single- cycle instructions, Choi et al. [44] proposed a methodology to generate multi-cycle as well as single–cycle instructions for DSP applications. The approach was effective in making the instructions meet the given constraints without attaching special hardware accelerators, and is able not only to increase performance of applications but also to reduce the application code size.

[99], describes how an existing processor instruction set and architecture can be customized without designing and creating a new processor. This is related to the platform based approach for architectural exploration, and this concept is still used in current ASIP processor systems today. There is a need to broaden the architec- tural space being explored; the issues which need to be addressed for ASIP design exploration are discussed in [88].

In respect of platform-based design, a platform is defined as a family of archi- tectures satisfying a set of constraints to allow the reuse of hardware and software components. Existing hardware can be reused in order to cope with the increasing complexity of embedded systems and stringent time-to-market requirements. A plat- form application programming interface (API) is also needed to extend the platform to application software. In general, a platform is an abstraction that covers several possible lower-level refinements. Every platform gives a perspective from which to map higher abstraction layers into the platform and one from which to define the class of CHAPTER 2. LITERATURE REVIEW 67

lower-level abstractions that the platform implies [135].

The ability of compilers to retarget for various architectures has been an issue.

Leupers et al. [106, 107] have provided an instruction-set modelling technique for

ASIP code generation. The technique covers a broad range of instruction formats and includes detailed views of inter-instruction restrictions. The model captures the peculiarities in the instruction set, such as side effects and residual control. The model can also be generated from HDL processor descriptions and thus can be included in retargetable compilers for different architectures and their instruction formats devel- oped.

One branch of design space exploration for ASIPs is for rapid architectural ex- ploration with added extensible instructions. This has been explored in [41, 43, 149] through the use of the Xtensa processor from Tensilica [20]. The advent of tools to create Application Specific Instruction Set Processors has greatly enhanced the ability to reduce design turn-around time. Despite several efforts to address this,

[41, 46, 43, 121, 149], customization still remains an art rather than a well under- stood science; as well, customization of an ASIP can take a significant amount of time.

Atomic operations within an extensible instruction operation can be duplicated

[46] via various cut enumeration and mapping techniques, thus potentially achieving higher speedups. Extensible instructions are generated via a compilation method.

This work was performed by extending the Nios [1] processor, and managed to show encouraging speedups of 2.75X on average.

Searching for the best extensible instruction is vital for shortening the design time of modern ASIPs. In [43], a matching algorithm is employed to match the traced program with a set of predefined extensible instructions that have been highly optimized while meeting performance and power constraints. [149] took another step CHAPTER 2. LITERATURE REVIEW 68

ahead by generating the extensible instruction directly from the application code.

An estimation method is used to meet the area and latency constraints to avoid synthesizing the chip, slowing down the exploration process. Both of these approaches were demonstrated on the Xtensa [20] platform by extending the available instruction sets.

ASIPs provide a good exploration space for performance / power consumption / area trade-offs. New instructions can be added and removed to observe the effects on performance, power and area trends of the architecture. Various architectural options can also be explored. Lee et al. [104] present an energy-efficient instruction- set synthesis technique that can comprehensively reduce the energy-delay product

(EDP) of ASIPs by way of optimal instruction encoding. This approach optimizes instruction encoding by taking into account instruction bitwidth and the dynamic instruction count when executing the program. This technique for increasing energy efficiency exploits the flexibility of ASIPs at the instruction level.

Several firms [7, 17, 19] bundle IP blocks into both soft and hard cores. Soft cores are software-like descriptions of the IP blocks that can be synthesized into hardware designs, whereas hard cores are pre-verified hardware designs. Such cores enable designers to focus less on a new design and more on system integration. Micropro- cessors and other cores can be selected and integrated into SoC designs that can be manufactured as special purpose components for a specific product.

2.5 Parallelizing Architectures

Current microprocessors employ a variety of techniques to decrease execution time and increase data throughput. Apart from technology advances that improve micro- processor clock speeds and reduce memory latency, designers also exploit features of CHAPTER 2. LITERATURE REVIEW 69

algorithms and tasks to improve performance. Parallel processing is a key concept for many architectural advancements and modifications; it is the simultaneous execution of similar or different tasks on multiple processing entities, in order to obtain results faster. This section highlights two parallelizing concepts; instruction level parallelism and task / thread level parallelism, and describes the architectures that facilitate them.

2.5.1 Instruction Level Parallelism

Instruction level parallelism (ILP) is a measure of the extent to which atomic oper- ations in a computer program can be performed simultaneously. Two instructions can be executed simultaneously if they are independent of each other or have the same dependency. Thus, computer programs have to be written and compiled specif- ically if they are to exploit the micro-architectural techniques used in processors. ILP enables compiler designers to develop techniques to overlap the execution of mul- tiple instructions or change their order, so as to maximize parallel execution while minimizing dependency stalls. Micro-architectural techniques used to exploit ILP include pipelining, superscalar, out-of-order execution, register renaming, speculative execution and branch prediction.

ILP techniques have significantly increased the computational power of micro- processors. Consequently, the demand on the memory system has increased; but, unfortunately, memory speeds have not kept pace with microprocessor power. Poor cache performance leaves many scientific applications bound by main-memory access time, even though multiple levels of cache have been introduced to reduce latency.

Hiroyuki et al. [79] and Song et al. [144] have both highlighted the importance of loop unrolling techniques to further increase ILP and use memory latency hiding in order to address the long latency time of memory systems. Ramakrishnan [132] proposed an CHAPTER 2. LITERATURE REVIEW 70

iterative modulo scheduling scheme to apply software pipelining to innermost loops for a wide variety of algorithms and heuristics. Improved cache performances for better ILP has been studied in [39, 122]. Carr [38] combined scheduling methods and cache improvement techniques to drive a transformation called ’unroll-and-jam’ that improves ILP and cache performance simultaneously.

Zhang et al. [168] proposed the Dominant-Subsidiary (DS) architecture for exploit- ing instruction level parallelism. The DS program consists of two instruction sub- streams: the control flow substream and the computational task substream. These in- dividual substreams are then executed in a superscalar fashion, delivering superscalar performances with less complex hardware and with better potential for fast clock rates. Other split-stream architectures include ZS-1 [142], PIPE [67] and MISC [156].

Arya et al. [29] presented an architecture (Software Scheduled SuperScalar) for instruction level parallelism based on control flow and data flow solutions; it is a of Superscalar and VLIW and includes features that are anticipated to increase ILP.

The architecture delegates the task of instruction scheduling to the compiler; only a small amount of logic is used to route the instructions. The compiler examines large sections of the program to perform static instruction scheduling and to use the machine resources optimally. The number, size and bandwidth of hardware resources such as functional units, condition bits and registers are increased, so as to execute as many independent instructions as fast as possible.

2.5.2 Task Level Parallelism

A single core microprocessor can only provide performance up to a certain level; it is limited by technology and how much latency hiding can be exploited within the software code. However, it is evident that such a single chip core could not scale sufficiently well to facilitate the increasing workload of an embedded system. CHAPTER 2. LITERATURE REVIEW 71

Instruction level parallelism is not scalable, because of its inherent limitations of parallelism. Task level parallelism (TLP) provides a higher level of concurrency, which can minimize the run time of the overall program execution. TLP includes both thread and process level parallelism. Thread level parallelism us the parallelism inherent in an application that runs multiple threads at once, benefitting overall execution; when one thread is delayed due to long memory latency, another thread will be able to do useful work.

Parallel threads for improving performance in applications are hard to find. Thus, speculative thread level parallelism has been proposed as a source of parallelism.

Marcuello et al. [116] found that thread-spawning associated with loop iterations is the most effective technique. The performance of this spawning policy is critically dependent on value prediction. This paper proposed an increment predictor, which has outperformed other conventional value predictors such as last value, stride, and context-based predictors.

Multithreading has also been investigated in multiprocessor designs. Jacobs et al. [87] implement MPEG-4 and H.264 coders on a customized multiprocessor ar- chitecture. The work exploits thread-level parallelism (at the macroblock level) to share the computational workload between the processors in the multiprocessor sys- tem. This leads to significant power savings due to the lower frequency at which each processor is required to run. The work demonstrated a highly balanced multi-thread implementation for the two coders (MPEG-4 and H.264).

Further architectures have been researched to exploit a higher level of parallelism: process level parallelism. Like TLP, process level parallelism concurrently executes parallel software code, albeit in different processing entities and separate memory space. Process level parallelism can be found in multiprocessor and multicore designs.

One core can handle one task while another executes an independent function. In a CHAPTER 2. LITERATURE REVIEW 72

mobile phone, audio could be handled by one processor, while video is implemented in another, the communication stack in yet another, and so on. Chip Multi-Thread

(CMT) processors [56, 145] offer another approach to providing support for many simultaneous hardware threads or processes being executed in various ways.

As a means of gauging the effectiveness of task level parallelism in a multipro- cessor system, an operating system can be augmented with workload measurement mechanisms. Flautner et al. [60] and Lundberg [114] provided static mechanisms to collect information about tasks or processor utilization, by adding hooks or multi- thread libraries. Task level parallelism can be calculated offline from the information collected. The overheads of these activities do not affect efficiency since the mecha- nisms are only for static analysis purposes.

To address the overhead that may have been incurred in Flautner et al. [60] and Lundberg [114], Hung et al. [83] presented several mechanisms to dynamically estimate the amount of task level parallelism in multiprocessor as well as single pro- cessor systems, albeit with a little overhead. The Linux operating system is modified to collect information about processor utilization task activities, from which such parallelism can be calculated. The work used time stamp counter (TSC) hardware which enabled the framework to provide an estimation of task level parallelism at fine granularity.

Coprocessor and multiprocessor systems are excellent examples of systems which make use of task level parallelism, and particularly of process level parallelism. The

Linda machine [27] is a parallel computer that has been designed to support the

Linda parallel programming environment in hardware. The system is in fact a grid of processors immersed in tuple space. Each processor has a computation processor and a Linda coprocessor (which handles Linda-specific operations such as communications within nodes). The framework is meant to support a thousand computing nodes so CHAPTER 2. LITERATURE REVIEW 73

that researchers and users would be able to design their own parallel applications in an integrated environment.

On the other hand, a simultaneous multithreading (SMT) processor exploits both instruction level and task level parallelism in a single processing core. In fine-grain multithreading, instructions for different threads are issued after every cycle. Course- grain multithreading switches between threads after a certain interval or when a current thread stalls due to memory access latencies, etc. Simultaneous multithread- ing however, is able to issue multiple instructions from multiple threads in a single cycle, thus requiring superscalar capabilities.

Eggers et al. [56] shows that simultaneous multithreading adds minimal hardware complexity to conventional dynamically scheduled superscalar processors. The work replicated superscalar resources to support simultaneous multithreading: state for the hardware contexts (registers and program counters) and per-thread mechanisms for pipeline flushing, instruction retirement, trapping, precise interrupts, and subroutine return. Per-thread (address space) identifiers to the branch target buffer and trans- lation look-aside buffer were added. The instruction fetch unit and the processor pipeline, were the only components to be redesigned to benefit from the SMT multi- thread instruction issue. Simultaneous multithreading needs no special hardware to schedule instructions from the different threads onto the functional units.

The Intel Xeon was the first modern commercial implementation of an simulta- neous multithreading (SMT) processor for the desktop market. The implementation is also known as Hyper-Treading Technology (HTT) [98] and provides a basic two- threaded SMT engine, emulating a two-core processor system. HTT makes a single physical processor appear as two logical processors. Other commercial implementa- tions include the MIPS MT [95] and the IBM Power5 [92] architectures. CHAPTER 2. LITERATURE REVIEW 74

2.6 Design Space Exploration

2.6.1 Processor Generation

Designing processors is itself considered an art and requires a well-structured method- ology to produce designs that are both correct and feasible. The general feeling is that a description of some new architecture or component carries little credibility until and unless the design is implemented physically. These is because design descriptions lack sufficient precision to assess them for feasibility, correctness or performance. Ag¨uero et al. [26] presented a plausibility-driven approach to computer architecture design, which categorized the nature of plausible constraints and defined a way by which plausibility statements can be developed to make claims about a design’s merits.

The study also defined the main evolutionary properties of a plausible design and provided reasoning guidelines for plausibility-driven design methods. The Axiomatic

ADL S*M [163], which allows for the axiomatic and non-procedural specification of clocked architectures, was used to describe new architectures, suggesting that an architectural design should be viewed as a specification of constraints that are to be met, by a system implemented as a combination of hardware and firmware.

There are three categories for processor description language typed frameworks.

Languages such as nML [62] and ISDL [73] capture the instruction set behaviour of the processor. MIMOLA [107], a structure-centric language, captures the net-list of the target processor. The last category covers languages that capture both the structure and behaviour of the processor. LISA [160] and EXPRESSION ADL (Architecture

Description Languages) [74] are languages of that kind. The LISA machine provides information consisting of model components for memory, resource, instruction set, behaviour, timing and micro-architecture. The HDL code of the control path and CHAPTER 2. LITERATURE REVIEW 75

structure of the pipeline can be generated. The study in [86] proposed a micro- operation description-based synthesizable HDL generation. However, this work relies on the compiler to insert necessary NOP instructions into the execution code. This has also formed the basis for the PEAS [85, 5] project.

With the demand for shorter design turnaround times, many commercial and research organizations have provided base processor cores, so that fewer modifications have to be made on the design to achieve the particular performance requirements.

This has led to the emergence of reconfigurable and extensible processors. Xtensa [20],

Jazz [10], PEAS-III [85], ARCtangent [4], Nios [1] and SP5-flex [15] are examples of processor template-based approaches that build ASIPs around base processors.

PEAS-III [85, 5] is able to capture a target processors’ specification using a GUI.

Estimation of area, delay and power consumption of the target processor can be ob- tained in the architectural design phase. A micro-operation level simulation model and RT level description for logic synthesis can be generated, along with software chain tools. It provides support for several architecture types and a library of con-

figurable components. The core produced follows the Harvard-style memory archi- tecture. Several JPEG encoder designs were produced and evaluated within a short span in [96] by using the PEAS-III approach.

While estimation provides fast exploration of the design space, it is vital that area, power and clock frequency be considered in justifying the decision to select a certain architectural configuration. Recently, [121] used a synthesis-driven design exploration

flow for rapid investigation of the different configuration processor architectures. The

EXPRESSION ADL [74] was used to generate the design tool chains (compiler, as- sembler and simulator). A functional abstraction approach was used to facilitate the generation of HDL code via a HDL generator. Thus, chip area, clock frequency and power consumption could be determined from the result of the synthesis of the HDL CHAPTER 2. LITERATURE REVIEW 76

code. Consistency across the software tool chain and HDL code can be maintained for a wide range of pipelined architectures, because the framework tools, hardware model, compiler and simulator are generated from the same ADL specification. The framework is able to add support for additional pipeline paths, interlocking, stalling,

flushing and multi-cycle operations [120]. Modification of the pipeline features and

ISA can be made by simply making changes to the ADL specification. Architectural features, such as VLIW and Superscalar, have been implemented on the DLX [78] architecture in this work. However, the incidence of data dependencies from previous instructions have not been addressed, other than stalling of the pipeline [93].

2.6.2 System Generator - coprocessor generation

Coprocessors have been used in applications to speed up computation, offloading much of the work performed by the main processor in the system. They come in many flavours (e.g. instruction based, functional based, SIMD based and vector processor based). Typical coprocessor examples are: graphic accelerators [101, 165]; numeric & floating point units (FPUs) [111]; Digital Signal Processors (DSPs) [112]; and Multiply & Accumulate (MAC) units [21].

These coprocessors have one common purpose, which is solely to use the specialize functional units predesigned into them, instead of using general purpose instructions executed in the main general purpose processor.

In application specific processors, researchers have tried to improve program ex- ecution by accelerating loops via coprocessors. [66] showed that by using dedicated coprocessors, the system consumed less power but was less flexible when the algorithm had to be changed.

Early approaches to hardware-software partitioning was demonstrated in [57, 71,

123] by the extraction of loop segments into coprocessors. However, these approaches CHAPTER 2. LITERATURE REVIEW 77

often did not fully achieve the maximum possible performance improvements.

In [146], the authors implemented a framework to profile a program dynamically.

The executed loops are detected via an onboard hardware profiler, decompiled and synthesized onto an FPGA coprocessor in the SOC. A dynamic partitioning approach is used to extract the appropriate loops. However, the system has a limited amount of memory to cache the loops and thus could only deal with much smaller regions of code. The framework executes only a single cycle loop body and the number of iterations of the loop has to be determined before the loop executes.

CriticalBlue [49] provides a complete methodology with a toolset for converting functions to individual coprocessors. Being software programmable, the coprocessors generated by the system have some flexibility in accommodating changes to stan- dards. Figure 2.8 shows the software development flow by CriticalBlue. The tool,

Cascade, analyzes the application binaries to synthesize programmable coprocessors covering a range of performance, power, and area trade-offs. When off-loaded appli- cation binaries are determined, the approach generates all coprocessor RTL design and verification descriptions.

Like the CriticalBlue [49], the system by Gupta et al. [72], called SPARK, is also a modular and extensible high-level synthesis system. SPARK uses parallelizing com- piler technology to enhance instruction-level parallelism. The framework provides an integrated flow from the behavioural C language to a hardware description language.

The SPARK compiler incorporates ideas of mutual exclusivity of operations, resource sharing and a hardware cost model. The hardware description is then generated as

VHDL code. The SPARK tool performs a set of transformations that include specu- lative code motions and dynamic transformations. Additionally, a heuristic scheduler employs optimizing synthesis and compiler techniques.

The authors in [115] proposed a processor-coprocessor architecture for high-end CHAPTER 2. LITERATURE REVIEW 78

Figure 2.8: Cascade design flow for generating coprocessors, sourced from Critical- Blue [6]. video applications. High level synthesis was used to map algorithms in the application to the coprocessor and to minimize the number of computing units on the chip. For example, the number of ALUs were increased until the ALUs were no longer the bottleneck. However, the work did not address how the overall program would be actually executed on the system.

In contrast to the above approaches that use dedicated buses, [77, 164, 133] ex- plored closely-coupled components, with the host CPU using FPGAs as the reconfig- urable components. Extended instructions could be created and performed by these blocks, which share the register files and pipeline registers of the host CPU. However, performance is limited to FPGAs with higher latency with respect to the ASIC host processor. Similarly to this category of tightly coupled components, our work explores a new architecture for accelerating loops via a coprocessor component in tandem with the host CPU. The coprocessor would run at the same clock rate as the host CPU, since it is also implemented as an ASIC. CHAPTER 2. LITERATURE REVIEW 79

As far as JPEG and image processing are concerned, accelerations have been achieved through the use of hardware IP cores. These specialized chips [11, 84, 138] are ASIC implementations of their software counterparts [8, 102]. The hardware encoders are fast and efficient, but highly inflexible if major changes are made to the system. A slightly more flexible approach would be processors that use loop accelerators [101, 112, 165] for image processing purposes. Such approaches require existing programs to be modified extensively to work with coprocessors.

Higher performance and memory access requirements demand a fundamentally more extensive architectural template of the HLS output. Huang et al. [80] proposed an architectural template for the HLS output that consists of a controller-datapath associated with a memory subsystem called computation-unit integrated memory

(CIM). Instead of using the system bus to access memory, the CIM offers higher memory bandwidth to computation units located within it, and reduces the overall communication between the memory subsystem and the controller-datapath. The

CIM thus provides a template which is highly suitable for deriving efficient imple- mentations of memory-intensive applications. The paper also presented the synthesis framework to automate the design of the CIM-based architecture.

2.6.3 Multiprocessor / Heterogeneity

A multi-core system requires a number of communication schemes to provide the necessary link between each core in the system. Kim et al. [94] developed a new

CDMA-based on-chip interconnection network, using a Star NoC (Network-on-Chip) topology. To enable rapid design of a multi-core processor system and the evaluation of its interconnect system, Wieferink et al. developed a methodology for retargetable

MPSoC integration [162] at the system level based on LISA processor models [160] and the SystemC framework [16]. CHAPTER 2. LITERATURE REVIEW 80

Single-core applications use instruction level parallelism, enabled by pipelined pro- cessors. Hardware/software implementations can be further enhanced with pipeline scheduling techniques [40]. Extending this scheme, multiprocessors are able to exploit task level parallelism by executing different tasks on separate cores simultaneously.

Jeon et al. [89] partitioned loops into several pipeline stages; their iterative algo- rithm proposed increased parallelism and reduced hardware cost. Kodaka et al. [97] combined both coarse- and fine-grained parallelism (which includes loop pipelining) using a single OSCAR chip multiprocessor, which exploits course-grained parallelism, loop parallelism and instruction level parallelism using the OSCAR compiler. The

OSCAR chip is comprised of several processor-elements (PEs) connected to local and shared memory, facilitating data transfer among processors.

Like multi-core architectures, the systolic array (refer to Figure 2.9) represents a pipe network arrangement of data processing unit which operates when triggered by the arrival of a data object. Data flows across the array between neighboring data processing units; no global communications are involved. Multiple data streams can be sent and received, thus enabling data parallelism. Arnould et al. [28] presented a high-performance systolic array computer, Warp. The processor array consist of replicas of the same processor arranged in a one-dimensional array order. A systolic array assumes all operations are completed in a unit delay to maintain rhythmic data

flow. Thus, the unit delay is the longest delay among all operators, resulting in a low-performance systolic array. Lee et al. [103] organize each cell of a systolic array as another systolic array (i.e., super-systolic array) in order to raise the cell utility of a systolic array.

A declustering technique for scheduling processes onto a multiprocessor system was proposed in [141]. This technique exposes parallelism instances in a Synchronous

Data Flow (SDF) graph in order of importance and attains a cluster granularity that CHAPTER 2. LITERATURE REVIEW 81

INPUT DATA STREAMS

DPU DPU DPU DPU DPU

DPU DPU DPU DPU DPU

DPU DPU DPU DPU DPU

DPU DPU DPU DPU DPU INPUT DATA STREAMS OUTPUT DATA STREAMS

DPU DPU DPU DPU DPU

OUTPUT DATA STREAMS

Figure 2.9: A systolic array example of data processing units (DPUs)

fits the characteristics of the architecture, depending on the number of processors intended.

Cong et al. [47] further explored the technique of clustering in pipeline multipro- cessor systems. The work optimizes the latency and resource usage of each processor representing a stage in the pipeline system under a throughput constraint. The au- thors developed a set of efficient algorithms (e.g., labeling, clustering and packing) for the generation of the application-specific multiprocessor architecture, given the CHAPTER 2. LITERATURE REVIEW 82

application specified as a task graph. The framework decides the number of proces- sors to be used, communication buffer sizes, processor interconnections and mapping of task.

Banarjee et al. incorporated heterogeneous digital signal processors with macro pipeline-based scheduling. The technique [32] used a signal flow graph (SFG) as a basis for partitioning. The work shows that heterogeneous multi-cores are able to improve the throughput rate to several times that of the conventional homogeneous multiprocessor scheduling algorithms.

Sun et al. first explored the use of ASIPs in multiprocessor systems [150]. The work proposed a methodology to simultaneously select custom instructions, and assign and schedule application tasks on extensible processors.

Givargis et al. proposed a technique for efficiently exploring the power/performance design space of a parameterized system-on-chip (SoC) architecture to find all Pareto- optimal configurations [65]. A directed graph was used to capture the interdependen- cies and algorithms that search the configuration space incrementally, and to prune inferior configurations.

A case study was performed that evaluated the performance of such a pipelined multiprocessor system, and compared it against a distributed systems architecture [139].

The work used ASIPs, and demonstrated that selective optimization of the individual cores provides the necessary performance improvement to balance the overall latency of the pipeline stages in the system.

Goossens et al. [68] developed a retargetable tool suite, Chess/Checkers, to enable the design of ASIPs in multicore SoCs. Chess/Checkers offers fast architectural ex- ploration, hardware synthesis, software compilation, inter-ASIP communication and verification. The tools support a broad range of architectures, from small micropro- cessors, through DSP dominated cores, to VLIW and vector processors. The toolset CHAPTER 2. LITERATURE REVIEW 83

provides a graphical user interface for the design of a multiprocessor system. Systems designed include portable audio and hearing instruments, wireless modems, video coding and network processing.

Prakash et al. [128] pioneered the methodology to synthesized heterogeneous mul- tiprocessor systems. The system used a mixed integer linear-programming (MILP) model. The method generates a task execution schedule along with the structure of the multiprocessor system and a mapping of subtasks to processors. The model requires a set of relationships which have to be satisfied in order to guarantee proper ordering of task execution events and correctness of the system. However, the model provided was not practical enough to generate optimal solutions for even a small number of tasks, because of an exponential increase in the number of model factors as the number of subtasks increased.

Rae et al. [131] proposed HeMPS, an application-specific, heterogeneous multipro- cessor synthesis system that combines a form of Evolutionary Computation known as

Different Evolution to rapidly and efficiently search the design-space for an optional or near-optimal solution. The heterogeneous multiprocessor system uses a point- to-point interconnection network for communication among the various processing nodes.

The use of heuristic or randomized search methods by researchers has made pos- sible a reduction in search and synthesis time. Integrating these various optimized methodologies at different stages in the design flow of processor development would make possible the ultimate goal of complete, efficient and fast automated system for generating processor systems for specific applications. Chapter 3

Approach to Customization

3.1 Introduction

This chapter presents a philosophical view on the approaches and methods used in the course of this thesis project. I begin by discussing the shortcomings of the previous approaches provided in Section 2 and how those shortcomings are addressed in this thesis. I then present the motivation for the usage of ASIPs in our case study and the importance of using extensible systems for customizing embedded systems. Extensible and configurable systems allows the removal of instructions to create ’light-weight’ processors that are cost efficient. The chapter rationalizes the move from extensible processors to coprocessors, and finally to multiprocessor pipeline systems.

3.2 Shortcomings of Previous Research

Generating a suitable processor configuration is critical in satisfying the ever more demanding constraints imposed on embedded systems, on performance, power and area. However, a large portion of previous research [44, 81, 99, 135, 157] on Applica- tion Specific Instruction-set Processors (ASIPs) has been focussed only on completely

84 CHAPTER 3. APPROACH TO CUSTOMIZATION 85

customised instruction sets, through extending the work on base processors. Cus- tomizable processor cores such as Xtensa [20], Jazz [10], PEAS-III [85], ARCtangent

[4], Nios [1] and SP5-flex [15] contain a minimal instruction set to be implemented in the base processor, do not provide the flexibility of major architectural changes and are highly dependent on the features of vendor tools. Chapter 4 addresses these lim- itations by developing a framework that provides total control of the implementation and configuration of the base processor, providing opportunities for further design exploration, not only by extending instructions, but also by reducing the instruction set to improve performance of the system.

The current trend to speed up ASIPs is to provide extended instructions [20, 149] to execute a group of instructions that occur frequently in the code. Coprocessors are also used when the application needs special functionality, such as floating point arithmetic and sum-absolute-difference instructions.

Computationally-intensive applications normally spend most of their execution time in a small section of executable code, normally the innermost loop. Despite all the efforts of [49, 57, 146] to create coprocessors for loop acceleration, the coprocessors that are created are still generally separate components connected to the main bus.

Communication overhead occurs between the main processor and coprocessor during execution. Since some coprocessors are complete processors, with their own DMA in- terfaces to reduce memory access latency, these complex coprocessors would increase the area utilization and power in an embedded system. These coprocessors are sep- arable and distinct components connected to the main bus in the system. High-level synthesis tools such as SPARK [72] can also be used to synthesize coprocessors and hardware logic from high-level languages such as ANSI-C. These approaches would incur communication overhead when interacting with the main processor. A sepa- rate unit would increase area and power utilization in the system. Additionally, the CHAPTER 3. APPROACH TO CUSTOMIZATION 86

program which is to be accelerated has to be modified extensively in order to be integrated.

In contrast to loosely-coupled coprocessor, the research work of [77, 133, 164] integrated coprocessors directly into the architecture of the main processor, using

FPGAs as configurable components. These coprocessors are able to provide extended instructions, and to share the register files and pipeline registers of the host CPU.

However, FPGAs have higher latency compared to ASIC host processors and hence this limits the performance of the system. Chapter 5 addresses the shortcomings of such coprocessor systems.

Research studies [94, 162] in multiprocessor systems have explored many different topologies and interconnects to achieve greater performance and efficiency of the system. Sih et al. [141] identified parallel processes in a program flow and allocated them to a predefined number of processors. However, this work mainly targets shared- memory multiprocessors. The work in this thesis aims to explore a pipelined topology configuration that can also exploit parallelism of tasks in targeted programs.

Cong et al. [47] explored synthesis for pipeline multiprocessor systems, albeit in a homogeneous manner; the applications targeted were based on dataflow process network and synchronous dataflow models. This study partitions task graphs into convex clusters and then maps them onto processors. Inter-processor communication is implemented as FIFOs. The proposed labeling and clustering algorithm generates optimized clustering for latency minimization. However, the framework does not address the problem of the large exploration space in a heterogeneous multiprocessor system. A multitude of different configurations for each processor may lead to cluster sizes and partitions.

Using heterogeneously-configured ASIPs in a multiprocessor system is a fairly new approach. Sun et al. explored the use of ASIPs in multiprocessor systems [150], CHAPTER 3. APPROACH TO CUSTOMIZATION 87

proposing a methodology that simultaneously selects custom instructions, and assigns and schedules application tasks to extensible processors. However, the work did not address efficient exploration of the large design and configuration space that is inherent in a multiprocessor system. There is a need to address the problem of obtaining a near-optimal multiprocessor configuration (i.e., core count, configuration and connectivity) suited to a particular application.

Banarjee et al. [32] explored the partitioning of applications into a multiprocessor architecture, but did not consider the possibilities of different configurations for each stage in the pipeline (as demonstrated by their use of ASIPs for each stage of the pipeline). In contrast to Givargis et al. [65] which explored the power/performance design space of a parameterized system-on-chip (SoC) architecture, this thesis chooses a different approach by exploring the design space of pipelined multiprocessor SoC configurations and their area-performance trade-off behaviour. Shee et al. [139] did not formalize the exploration of a pipeline architecture during the case study. Chap- ters 6 and 7 complements the work in Sun et al. [150], by exploring a higher level of abstraction to select different customized processors that would be suitable for a multiprocessor pipeline architecture.

3.3 Modus Operandi

The notion of producing the best microprocessor architecture has been an aspiration of the research community for many years and presents the design challenge of not only meeting the runtime performance requirements, but also reducing power consumption.

In server environments, arrays of processors are used for intensive computations, thus resulting in significant power usage. In mobile and embedded systems, low power requirement is critical to allow continuous usage of devices without the aid of external CHAPTER 3. APPROACH TO CUSTOMIZATION 88

power and heavy battery backups.

In order to address the continuing need for speedup and energy reduction required in embedded systems, various processor architectures and methodologies are explored.

It is postulated that a general purpose processor provides an instruction set archi- tecture to the running program, which would not fully use this capability. The logic circuits used to create and implement the instruction set are thus redundant and of no use to the functionality of the program. Moreover, the unused circuits would result in a larger processor; resulting in lower clock rates and larger power consumption.

A customizable processor was developed and the methodology presented in Ped- dersen et al. [124]. The scheme uses ASIPmeister [85], a processor generation tool.

Chapter 4 describes the contribution of [124] to this thesis, in providing a method of reducing the PISA instruction set and generating a processor for a given applica- tion. The reduction of the instruction set and the generation of processor RTL can be performed within an hour, making this one of the fastest methods of generating an application-specific processor.

The resulting processor is smaller in size and capable of running the same appli- cation with reduced power consumption and at a higher speed (due to reduced clock width). The initial processor was based on the SimpleScalar / PISA [30] instruction set. The selection of this particular instruction set meant that a rich set of tools was available. For five benchmark applications, we show that, on average, processor size can be reduced by 30%, energy consumed reduced by 24%, and performance improved by 24%.

However, customizing the instruction set of the microprocessor would not dra- matically increase performance of critical applications. Extensible and customizable processors such as Xtensa [20], Jazz [10], ARCtangent [4], Nios [1] and SP5-flex [15] CHAPTER 3. APPROACH TO CUSTOMIZATION 89

have demonstrated the need to increase processing power by integrating coproces- sors into the system. Coprocessors provide additional processing power to speed up complex computations such as floating point arithmetic and vector operations. As is the case in general purpose processors, most instruction sets in complex coprocessors would not be fully used, and would only add unnecessary power and area usage to the system.

So a necessary scheme needs to be conjured up to create yet another customized coprocessor platform that would enhance execution time while taking acount of power and area usage. CriticalBlue [49] provides a complete methodology, with a toolset for converting functions to individual coprocessors. Being software programmable, the coprocessors generated by the system have a degree of flexibility to accommodate changes to standards. This capability proves useful to rapidly generate customized and efficient RTL code in order to speed up applications. However, being a con- ventional coprocessor implementation, the base processor still relies on conventional buses and protocols to communicate with the coprocessor.

Chapter 5 traces an approach to creating a novel coprocessor architecture, tightly coupled with the base processor to provide faster interactions between them. This ap- proach is prompted by the high latency communication overhead on conventional co- processor communication buses such as in CriticalBlue [49] and other works [115, 146], whereas a tightly coupled coprocessor would be able to connect and link directly to the internal components of the base processor. Besides this reduction in communi- cation overhead between base processor and coprocessor, other techniques were used to speed up processing time. Latency hiding was used to exploit the parallelism available in this architecture. Memory access (base processor) and loop processing

(coprocessor) can be interleaved and performed in parallel in order to reduce time waiting for memory access. CHAPTER 3. APPROACH TO CUSTOMIZATION 90

A JPEG encoding algorithm was investigated, and one of its loops accelerated by implementing it in a coprocessor. The possible acceleration was assessed by im- plementing the critical segment with two different coprocessors and a set of cus- tomized instructions. The two different coprocessor approaches are: a high-level synthesis (HLS) appraoch; and a custom coprocessor approach. Approaches such as

CriticalBlue [49] and SPARK [72] present high-level synthesis mechanisms to copro- cessors generated from C language-type specifications. However, these coprocessor approaches remain independent from the implementations of the base processor. By taking advantage of the architecture of the base processor, it is possible to exploit the internals of the CPU to the advantage of the program being run. A loop per- formance improvement of 2.57 is achieved using the custom coprocessor approach, × compared to 1.58 for the HLS approach and 1.33 for the customized instruction × × approach, each with reference to the main processor alone. With the integrated co- processor approach, more computations can be offloaded from the base processor to the coprocessor, to achieve better performance compared with the high-level synthesis approach. Energy savings within the loop are respectively 57%, 28% and 19%.

Extensible processors have been shown to provide good cost efficiency and speed improvements on single processor applications [20, 149]. Designers are able to pick the best-fit CPU core and select processor peripherals based on the application being run.

On a different note, a coprocessor itself is effectively a microprocessor, so that a co- processor system is in fact a multiprocessor system. Extending this concept further, extensible processors can be used to create customized cores for parallelized parti- tioned programs. Extensible on-chip multiprocessors promise to have a significant impact on embedded system design [150]. Such systems would allow the designer to configure different processors with variable instruction sets, different cache sizes and even different register widths and coprocessors. This is superior to a homogeneous CHAPTER 3. APPROACH TO CUSTOMIZATION 91

solution (a.k.a. SMP architecture) in terms of cost efficiency and area usage.

There are various ways to connect multiple processors into a single system. The various topologies include a shared bus network and an interconnected network. The shared bus network allows easier system design and routing placements; however, the communication overhead would be extremely large if a significant number of processors were to use the bus at the same time. In contrast, various processing cores would be able to communicate with each other if several one-to-one connections were made in an interconnected network topology; although this approach would be less

flexible if communicate paths need to be changed when these interconnects have been

fixed at design time.

We focus our research on streaming applications, such as JPEG encoding and

MP3 encoding, and believe that the designer can best judge which program segments should be partitioned into separate processors. Automatic code partitioning and scheduling tools [32, 141] are ideal for a complete automation methodology. However, such approaches depend heavily on the represented input structures of the target algorithm.

With streaming applications, it would be possible to pipeline the applications into several pipeline stages as demonstrated in Jeon et al. [89]. The main application loop would be pipelined into several stages, similar to that in a pipeline processor architecture [78], where each pipeline stage is mapped to a task in the program. If a pipeline stage is heavily used and represent a bottleneck in the pipeline, additional parallel pipeline stages can be introduced which allow input from the previous stage and feed the outputs from the parallel stages back to the next stage in the pipeline.

The disadvantage of this approach is that it only applies to streaming applications, and loops that require output values to be fed back as input values to previous stages would not be appropriate for this approach. CHAPTER 3. APPROACH TO CUSTOMIZATION 92

A case study was first performed to investigate the pipeline multiprocessor ap- proach. Several versions of the JPEG encoder program was made manually as a pipeline multiprocessor application (see Chapter 6). An extensible processor, Tensil- ica’s Xtensa LX, was used for processing cores in the multiprocessor system. Queues were used as means of communications between cores. The application was parti- tioned into multiple sequential blocks; each block representing a stage in a sequential pipeline. A number of differing configurations, ranging from a single core to a nine- core system was implemented. Based on the performance of these systems with re- spect to the area size, the seven-core system was chosen for further optimization. By carefully customizing each processor, the pipeline is balanced (i.e. processing times are nearly equal). With selective optimization on the seven-core system, a speedup of up to 4.6 was obtained, with an area increase of only 3.1 (area increase to speedup × × ratio of just 0.68).

The case study showed the possibility of a more cost-efficient system compared with a single processor system, using a pipeline multiprocessor architecture. Proceed- ing to the next approach included the development of a methodology of selecting the best configuration, for each pipeline system, that could be automated and executed in a short amount of time. An initial heuristic was developed, based runtime × area as cost values, and was used in configuration selection. For the JPEG and MP3 benchmarks, the minimum cost obtained through this heuristic was within 19.94% and 5.74% of optimal values. Optimal values were searched exhaustively with the aid of the branch-and-bound scheme for faster searching.

Pushing for more accuracy, a new heuristic was developed in Chapter 7. The cost functions are now transformed into linear equations. The individual runtime and area cost are now normalized and scaled, using coefficient values predetermined for each pipeline design. The minimum cost obtained through our heuristic is now within CHAPTER 3. APPROACH TO CUSTOMIZATION 93

0.43% and 0.29% of the optimum values for JPEG and MP3 benchmarks respec- tively. The heuristic solution was reached within a fraction of a second, while some configurations took several hours to converge when the branch-and-bound method was used. Chapter 4

Customizing by Removing

Instructions

4.1 Introduction

The first methodology to be presented in this thesis will be the generation of an effi- cient RTL scheme for a SimpleScalar / PISA Instruction set architecture. The section shows a method of reducing the PISA instruction set and generating a processor for a given application. This reduction and generation can be performed within an hour, making this one of the fastest methods of generating an application-specific proces- sor. For five benchmark applications, we show that, on average, processor size can be reduced by 30%, energy consumed reduced by 24%, and performance improved by

24%.

94 CHAPTER 4. CUSTOMIZING BY REMOVING INSTRUCTIONS 95

4.2 Motivation

Searching for the best processor architecture for a particular application is vital to ob- tain the most cost efficient implementation while maintaining a certain performance and energy constraints. With the advent of ASIPs, this becomes possible without the need to redesign the whole instruction set architecture (ISA) and architectural con-

figuration. By rapidly creating processors with custom instruction sets, the designer effectively reduces the design turnaround time and the design-to-market gap.

A large proportion of previous research on ASIPs has been focussed on completely customizing instruction sets through extending the work on base processors. We have developed a framework that provides total control of the implementation and config- uration of the base processor, providing opportunities for further design exploration not only by extending instructions, but also by reducing the instruction set to improve performance of the system.

Our research exploits the flexibility of ASIPmeister [85, 5] to either include or exclude any subset of the instruction set. Instead of adding instructions to the base processor, we can remove redundant instructions from it to improve performance in terms of area overhead, power dissipation and latency; so instruction sets can be chosen to closely fit the application being run. Our processor implements the Portable

Instruction Set Architecture (PISA), which is closely linked to the SimpleScalar [30] architecture. This ISA is chosen to take advantage of the tool set already available as part of the SimpleScalar framework. Our work can be further improved to include extended instructions as well, providing extra functionality by adding to the existing instruction set. CHAPTER 4. CUSTOMIZING BY REMOVING INSTRUCTIONS 96

4.3 Microprocessor Generation Framework

A SimpleScalar / PISA processor generator was developed in collaboration with group members at the University of New South Wales. The methodology rapidly designs a configurable microprocessor core; generating a full SimpleScalar architecture (integer) processor core that is synthesizable into an SoC or onto an FPGA for prototyping.

The methodology is fully explained in Peddersen et. al [124]. The work in this chapter is integrated into this methodology, to generate a processor with various subsets of instructions, in contrast with other approaches that just extend the base core processor.

The methodology consists of hardware and software library generation. Rapid generation of the processor is achieved using ASIPmeister [5] as it allows targeting to any processor description; instructions can be added and removed at will. The

ASIPmeister tool produces an HDL model for the given instruction set description.

Register forwarding, pipeline stages and hardware resources description would be included, to create a specification of the desired processor. This complete HDL de- scription can be augmented with additional hardware, such as cache and memory mapped I/O.

The software generation stage of the design involves the generation of instruction and data memories to be interfaced with the processor, the addition of software subroutines needed to interface to hardware, and possibly service interrupts. This includes the creation of a file system and structure for systems where an operating system is not present. A boot loader to initialize the memories and stack used in the system is also created. Finally, the HDL memory models used by the design, including all the initial memory maps are generated, and all these constituents are simulated together as one complete system. CHAPTER 4. CUSTOMIZING BY REMOVING INSTRUCTIONS 97

4.4 Application Specific Processor Generation

Many applications do not use the full range of instructions available in general purpose processor instruction sets (for instance, the floating point instructions or the ’div’ instruction). If a processor is being designed for one of these applications, then the hardware dedicated to decoding and executing the instruction can be removed to potentially increase performance and decrease the area and power costs of the processor. By analyzing the application, it is possible to create a minimized processor for just that application by turning off the instructions that are no longer needed.

Customizing Instruction-set

Application Assembler Assembly written in file C/C++

Instruction Profiler

Instruction Removal

Set of Minimized Resource minimized Processor Analyzer instruction Description set

Customizing Processor Description

Figure 4.1: Creating a minimized processor description

We have performed this task for our applications running on the SimpleScalar processor. The methodology of this process is shown in Figure 4.1. The application is first compiled to an object file by the processor compilation tools. The resulting file is then analyzed to determine which instructions are present. Instructions not present CHAPTER 4. CUSTOMIZING BY REMOVING INSTRUCTIONS 98

in the compiled application can then be removed from the processor description and the processor created again with the hardware generation tool. As well, the removed instructions are analyzed to see if any hardware resources are no longer required. If all the instructions that access a resource (e.g., the divider) are removed, the associated resource can obviously be removed from the system without harm.

Another option to allow reduction in size and power of the design is to replace large hardware resources with software subroutines that perform the same function. This technique is best used when you have a large resource that is used very infrequently so the speed loss incurred will not be too high. To perform this task, the instructions that access the resource are turned into syscall-like jumps to known code locations. To demonstrate this, we have written subroutines for the division instruction to replace the divider in applications where the ’div’ instructions are only used infrequently by infrequent conversion functions such as printf() and scanf().

The process of reducing instructions and replacing large infrequently-used hard- ware components can reduce the processor size quite substantially, especially if entire design resources are removed, a technique that is not possible with other ASIP design solutions that use a base processor instruction set that cannot be pruned. This allows rapid processor generation targeted at the application the processor will execute.

4.5 Experimental Setup

The experimental setup consisted of the system-on-chip architecture as the Device

Under Test (DUT) connected to instruction and data memory models. The mem- ory models are generated by compiling C programs into SimpleScalar binaries then translating them into VHDL models. The setup is shown in Figure 4.2.

Benchmark applications used in our test are taken from Mediabench [102]. The CHAPTER 4. CUSTOMIZING BY REMOVING INSTRUCTIONS 99

Processor Specifications Application (instruction set written in architecture, pipeline C/C++ stages, registers etc.)

Minimizing SimpleScalar Software Instruction Compiler Libraries Set Suite Generation Stack Creation Assembly Assembler file File Structure

Compiler Syscall Subroutines Instruction Profiler Hardware Instruction Generation Removal Instruction Data Memory Memory

Set of minimized ASIPmeister instruction set SS_CPU

Hardware Generation Simulation Model

Implementation Design Execution Time Find Execution Time Analysis Compiler Estimation of Application

Area Power Maximum Execution Time Frequency

Figure 4.2: Experimental Setup CHAPTER 4. CUSTOMIZING BY REMOVING INSTRUCTIONS 100

specific application used for testing are listed in Table 4.1. Three processors were rapidly generated for use in the experiment, they are listed in Table 4.2. Config- uration A is a processor with all SimpleScalar/PISA instructions; configuration B is a processor with minimized SimpleScalar/PISA instructions based on the adpcm application; and configuration C is a processor with minimized instruction set based on the pegwit application.

The components within the DUT is a synthesizable HDL model of our Sim- pleScalar processor generated through the methodology outlined above.

The SoC architectures described above are used for this thesis to evaluate the area, power, and performance of the DUT. Other customizations, such as existence of a memory hierarchy, multiple data paths, changes to pipeline depth, instruction issue width, etc. can be rapidly generated for evaluating different SoC architectures.

Application Details of the application adpcmenc adpcm file encoder adpcmdec adpcm file decoder pegwitkey pgp key generation pegwitenc pgp encryption pegwitdec pgp decryption

Table 4.1: Mediabench Benchmark Applications used in experiment.

The Synopsys Design Compiler [7] and a 90nm standard cell library were used to evaluate the DUT area, power, and maximum operating frequency. To evaluate the execution time of each application, we estimated the number of cycles needed to run on the processor. A custom tool was created to calculate the total number of instructions executed in the application, plus any additional stalls that occur during execution. This count of total execution cycles provides an accurate measurement of the processor performance without the need to perform lengthy RTL simulations. CHAPTER 4. CUSTOMIZING BY REMOVING INSTRUCTIONS 101

Config. Specification A SS CPU with full instruction set. B SS CPU with minimized instruction set for adpcm application. C SS CPU with minimized instruction set for pegwit application.

Table 4.2: Different configuration of the SoC architecture.

Application Config Area % area Energy % Energy Clock Frequency % Frequency (gates) reduction (mJ) reduction Cycle (MHz) improvement adpcmenc A 196218 53.23 9560870 30.3 B 128843 34.3% 41.94 21.17 9565930 38.4 21.2% adpcmdec A 196218 100.74 18092079 30.3 B 128843 34.3% 79.37 21.20 18094103 38.4 21.2% pegwitkey A 196218 92.75 16654773 30.3 C 133924 31.7% 64.64 30.30 16655279 43.4 30.3% pegwitenc A 196218 217.48 39052206 30.3 C 133924 31.7% 123.75 30.29 39059037 43.4 30.3% pegwitdec A 196218 123.75 22222020 30.3 C 133924 31.7% 86.25 30.28 22228345 43.4 30.3% Table 4.3: Table of results.

4.5.1 Analysis of Results

Area and power figures are measurements of the on-chip components only and do not include the external memory. Results of these measurements are shown in Ta- ble 4.3. Column 1 in Table 4.3 shows the application executed, column 2 shows the processor configuration of the DUT, column 3 provides the area measurement, col- umn 4 gives the percentage area reduction of the minimized instruction set processor compared to a full instruction set processor, column 5 gives the energy measurement, column 6 gives percentage energy reduction when comparing the different processor configuration.

Column 7 shows the total number of estimated execution cycles of the application.

Column 8 gives the maximum frequency based on the longest pipeline delay, and column 9 gives the percentage speedup of the minimized instruction set processor compared to the full instruction set processor.

The graph in Figure 4.3(a) shows that the minimized instruction set processors CHAPTER 4. CUSTOMIZING BY REMOVING INSTRUCTIONS 102 Benchmark Application Energy ImprovementEnergy (c) Energy Improvement. Energy of SS_CPU Energy w ith of SS_CPU Energy ISA minimized raw caudio raw daudio pegw itkey pegw itenc pegw itdec 0 0

5 00

250 200 150 1 Energy (mJ) Energy Configuration ABC

Max. PeriodClock (ns) 5 0 5 0

35 30 25 20 1 1 ns (b) Maximum Clock Period. Figure 4.3: Experimental results showing area, performance, and energy improvement. Configuration ABC Area (Kgates) 0 0

00 5

250 200 150 1 Kgates (a) Area comparison graph. CHAPTER 4. CUSTOMIZING BY REMOVING INSTRUCTIONS 103

have an average reduction of 30% of the area compared with the full instruction set processor. Figure 4.3(b) shows that a minimized instruction set architecture can achieve a speedup of 25% compared with the full instruction set processor. In Fig- ure 4.3(c), the energy reduction of the minimized instruction set processor compared with the full instruction set processor is shown. On average, a 24.5% energy reduction is achieved.

4.6 Conclusions

This section presented a novel methodology for rapid processor generation. Included in this process is a method to tailor the processor to specific applications by reducing the instruction set to the minimum required to execute the application. A six-stage pipelined SimpleScalar/PISA processor implementation developed in a joint project was used as a base processor in this work. Performance figures have been calcu- lated for this processor and for minimized versions of the processor for particular applications, which show a marked improvement with the processor reduction tech- nique. The SimpleScalar hardware implementation we created and the software tools for generating memories for the device have been made available to download from http://www.cse.unsw.edu.au/˜esl/rapid. Chapter 5

Customizing by Coprocessors

5.1 Introduction

In this chapter, we describe a novel way to accelerate loops by tightly coupling a coprocessor to an ASIP. Latency hiding is used to exploit the parallelism available in this architecture. To illustrate the advantages of this approach, we investigate a

JPEG encoding algorithm and accelerate one of its loops by implementing it with two different coprocessors and a set of customized instructions, comparing the accelera- tion achieved in each case. The two different coprocessor approaches are: a high-level synthesis (HLS) approach; and a custom coprocessor approach; the former provid- ing a faster method of generating coprocessors. We show that a loop performance improvement of 2.57 is achieved using the custom coprocessor approach, compared × to 1.58 for the HLS approach and 1.33 for the customized instruction approach × × compared with the main processor alone; energy savings within the loop are 57%,

28% and 19% respectively.

104 CHAPTER 5. CUSTOMIZING BY COPROCESSORS 105

5.2 The JPEG Encoder

The JPEG compression algorithm that is being used in this case study is a lossy compression scheme that removes redundant information invisible to the human eye.

This scheme has its advantages for naturally occurring images that have a variety of shades. This algorithm is ubiquitous in most digital imaging products.

The benchmark program accepts a Portable PixMap (ppm) image (raw file). It reads the file, together with other parameters, and outputs the corresponding JPEG

file. The application in general has two main sections: the lossy compression stage

(DCT transformation + quantization) and the lossless compression stage (Huffman encoding). It cannot be easily determined from the source of the program how it is structured, due to the complex nature of function calls. Thus, a designer would need to entirely understand the program and algorithm in order to produce a fully customized coprocessor for such an application.

5.2.1 Loop Identification

The benchmark program is profiled using tools we developed to support the current

ISA as well as to extract the necessary information not provided by other tools. We created a tool based on a loop detection scheme proposed in [155]. We then performed a detailed profile of the inner most loops in the program against a set of RAW images of different sizes.

Ideally, the theoretical maximum speedup gained from optimizing the loop could be calculated by dividing the total program runtime by the runtime after deducting the loop runtime; this assumes that the loop execution can be eliminated completely. CHAPTER 5. CUSTOMIZING BY COPROCESSORS 106

However, we choose a more realistic definition. We assume that the theoretical max- imum improvement is achieved when all non-memory operations are eliminated com- pletely. This is when all computations (non-memory operations) are moved to the coprocessor (see Section 5.3.1). The profiling stage detects the loop instructions ex- ecuted (LIE), the memory operations executed in the loop (Memory LIE) and the total number of instructions executed (TIE). TIE consists of LIE and instructions in the rest of the code.

The theoretical maximum improvement (TMI) is:

TIE TMI = ( 1) 100% (5.1) ETIE (LI Memory LIE) − × − − where TIE = LIE + non-LIE

The TMI determined in this way will be used to select the loop that has the greatest potential for speedup in our loop acceleration case study. For the image rose.ppm image provided with the benchmark application, which has a resolution of

227 149 (33,823) pixels, Table 5.1 shows the seven most critical loops in the JPEG × encoder. The last three columns in the table shows the percentage runtime of the loop with respect to the whole program, the percentage of non-memory operations in the loop and the theoretical maximum improvement which can be obtained. The loop starting at line 643 of jcphuff.c has the highest TMI value and is thus selected for our case study. CHAPTER 5. CUSTOMIZING BY COPROCESSORS 107

Loops Cycles RT% COMP% TMI% jcphuff.c:643 4213978 19.77 78.82 18.46 jcdctmgr.c:232 1329496 6.24 87.71 5.79 jfdctint.c:220 1101194 5.17 90.11 4.88 jfdctint.c:155 1087578 5.10 89.98 4.81 jchuff.c:766 982554 4.61 91.65 4.41 jchuff.c:684 509413 2.39 91.73 2.24 jchuff.c:673 508646 2.39 91.71 2.24

Table 5.1: Loop runtimes

5.3 High-level Synthesis Approach

A high-level synthesis approach is used to convert the loop written in ANSI-C code to synthesizable VHDL code. Slight modifications are made in the loop code before the VHDL component can be generated. Each I/O pin of the generated component corresponds to the inputs and outputs defined as arguments to the loop function definition (see Figure 5.2). These functional blocks can be used as functional units that provide extra processing power to a System-on-Chip (SoC) Architecture.

The VHDL component has a start signal to control the execution of the component and a done signal to indicate end of execution. The HLS component has no memory location awareness and thus expects input values to be available at its pins at the time of execution.

A high-level synthesis approach was initially used, as it provides an option to unroll loops. Loop unrolling dramatically improves parallelism in the loop while reducing the number of conditional branches being executed that is found at the end of every loop segment. However, loop unrolling would only be beneficial if multiple resources can be used in parallel. In a single pipeline processor system like ours, parallelism through loop unrolling was found to be overly ambitious, since it is the serially executed memory operation by the base processor that is the bottleneck. CHAPTER 5. CUSTOMIZING BY COPROCESSORS 108

5.3.1 Architecture

EOB = 0; Al Ss Se block[.] for (k = cinfo->Ss; k <= Se; k++) { temp = (*block)[jpeg_natural_order[k]]; if (temp < 0) inputs temp = -temp; HLS Internal temp >>= Al; absvalues[k] = temp; if (temp == 1) outputs EOB = k; absvalue[.] EOB }

Figure 5.1: Example code segment & coprocessor interface

The loop segment selected in Section 5.2.1 is shown in Figure 5.1. Variables are identified as either inputs or outputs to the loop. The loop is then wrapped up as a function, with input variables being defined as inputs to the function and output variables being defined as pointers passed to the function; the modified function designed for the HLS framework is shown in Figure 5.2. In order to fetch values, a wrapper was made to fetch the right variables (the wrapper sends the proper address signal to the address line of the read port). We have chosen to connect the inputs of the HLS component to the read register ports of the general purpose register (GPR) and its output ports to the write register ports of the GPR.

In this HLS approach, the number of register read / write ports depends upon the number of parameters that the loop function has; a function with many parameters needs a large number of GPR ports.

The coprocessor and base processor share the same register file in the system.

Our selected loop segment has 4 inputs and 2 outputs, thus requiring us to connect the necessary lines directly to the register file. Originally, the base processor has a register file with 4 read ports and 2 write ports. As these ports are used by the existing components in the pipeline, additional ports have to be assigned to the coprocessor. CHAPTER 5. CUSTOMIZING BY COPROCESSORS 109

void prepass(int Al, int Ss, int Se, int blocks, int *absvalues, int *EOB) int k; { register int temp; *EOB = 0; for (k = Ss; k <= Se; k++) temp = blocks; { if (temp < 0) temp = temp; temp >>= Al; − *absvalues = temp; if (temp == 1) *EOB = k; } }

Figure 5.2: Modified loop as an ANSI-C function

This results in an “8-read, 4-write” register file being created for the integration of the HLS coprocessor into our design.

Figure 5.3(a) shows how the HLS based coprocessor is integrated into the existing design. We have introduced two registers which can be accessed by the base processor.

One bit from the COREG register is connected to the start signal of the coprocessor wrapper. The remaining bits can be connected to additional coprocessors of the same architecture. The CODONE register set signal is connected to the done signal of the coprocessor. This register is used to notify the base processor when the coprocessor has finished executing the loop.

Two new instructions are added into the existing instruction set: SCPR (set coprocessor) is used to set the specific bits in the COREG register, which in turn asserts the start signal of the coprocessor; BCPR (branch coprocessor) behaves like a normal branch instruction, except that it branches only when the CODONE register indicates that the coprocessor has not yet finished executing the loop.

The premise of our approach is to capitalize on the latency hiding approach[79] in the loop execution itself. The base processor performs all memory operations (fetch CHAPTER 5. CUSTOMIZING BY COPROCESSORS 110

ASIP with HLS Coprocessor

Base ID Processor WB

GPR CoReg CoDone

HLS Component with Coprocessor Wrapper

(a) HLS Coprocessor has many lines con- nected to the GPR

ASIP with Integrated Coprocessor

Base ID Processor WB

GPR CoReg CoDone

Custom Coprocessor

(b) The Custom Coprocessor has only one read port and one write port connected to the GPR

Figure 5.3: Coprocessor Integration CHAPTER 5. CUSTOMIZING BY COPROCESSORS 111

/ store), while the coprocessor computes the values obtained from those operations.

The whole body of the loop is pipelined into different stages (see Figure 5.5). The load segment of the loop is repeated before the loop body, to facilitate loop pipelining.

While the base processor performs the fetching for the second iteration of the loop, the coprocessor can perform calculation for the data fetched during the first iteration.

However, the result from the coprocessor needs to be ready before the end of the second iteration, when the value will be used and stored back to memory. The values in the Instruction Decode (ID) and Writeback (WB) pipeline registers synchronizes the execution of the coprocessor with the base processor (i.e. indicating when the values from memory are available).

The following simplifications are made to the design: we assume that there would be no preemption and interrupt request during loop operation – in a multitasking system, a restriction has to be imposed so that swapping does not occur during loop execution. Secondly, the coprocessor must not stall the CPU. This raises a question about what happens when data is not ready for the base processor to use. Such operation latency would already be known to the designer at creation time. If such a situation were to occur, NOP instructions would be inserted into the source code. This would simplify the circuitry as well as logic for such an implementation; alternatively, additional signals could be inserted to handle such a situation.

5.3.2 Advantages & Limitations

1. The high-level synthesis approach provides a fast method for creation of the

coprocessor.

2. Since we do not know the schedule of the HLS based coprocessor, data to the

coprocessor has to be available before it starts executing. This requires a large CHAPTER 5. CUSTOMIZING BY COPROCESSORS 112

number of register ports. Additionally, we need a special wrapper to stall the

coprocessor until the values are ready.

3. Addresses are not known by the automatic C-to-VHDL converter, as such ad-

dresses have to be computed in the base processor. This prevents us from

hiding the latency between the address calculations of data required for the

coprocessor, and memory accesses.

5.4 Custom Coprocessor Approach

To overcome the limitations of the HLS based coprocessor, we created a custom coprocessor for which we scheduled the operations carefully, so that they only wait for the dependent data to be ready. In addition, since only one piece of data can be read / written into the base processor at any one time, when such data becomes available, we schedule the coprocessor to fetch it from the register file, and store it within the coprocessor. This reduces the amount of interconnect between the processor and coprocessor.

5.4.1 Architecture

As in the HLS architecture, the SCPR and BCPR coprocessor instructions are added to the existing instruction set. The coprocessor connects to and shares the register

file of the base processor.

One of the differences, compared with the HLS approach, is that the coprocessor developed now has only one read port and one write port connected to the GPR.

The coprocessor would read the required values from the GPR and store them to intermediate registers within the coprocessor. Figure 5.3(b) shows the similarities CHAPTER 5. CUSTOMIZING BY COPROCESSORS 113

with the HLS approach. However, register file size is now fixed, at 5 read ports and

3 write ports.

Our integrated coprocessor approach (as opposed to the HLS approach) takes assembly code and converts to macroblocks. Figure 5.4 shows a code segment with the corresponding graphs, where lh and lw are load instructions and sw is a store instruction. A macroblock is detected when a group of interdependent instructions are

’sandwiched’ between load instructions and one store instruction. These macroblocks are implemented as components within the coprocessor (manually converted to VHDL and synthesized). Instructions which calculate memory addresses can be grouped together as macroblocks and executed as a coprocessor component (as opposed to the HLS approach where such execution is only performed in the base processor).

Thus, memory operations (such as lh and lw) occur in the base processor while the macroblock is being executed in the coprocessor. By loop pipelining, while data is being fetched from iteration 2 (Figure 5.5), data from iteration 1 is being processed by the coprocessor. The combined calculation of addresses in the coprocessor improves the overall performance of the application. A coprocessor consists of a number of macroblocks.

The main premise of the integrated coprocessor approach is that memory op- erations in a single-pipeline processor can only be performed one by one, so it is unnecessary to read in all the register values at the same time. Reading from the register file one at a time and storing into internal registers within the coprocessor when the data is available would be enough, avoiding the need to increase the size of the register file.

Instruction SCPR takes the state of the integrated coprocessor from Idle to Ready.

To start a macroblock component within the coprocessor (say macroblock 2 in Fig- ure 5.4), special signals are implemented to signal the availability of the values from CHAPTER 5. CUSTOMIZING BY COPROCESSORS 114

lh

lh $20,0($2) lw bgez lw $22,304($7) 2 bgez $20,skip 1 subu $20,$0,$20 << - nop

skip: srav $20,$20,$8 r3 r8 sll $2,$22,0x2 addu $2,$2,$3 + >> sw $20,0($2) sw

Figure 5.4: Example code segment & corresponding graph the load instructions.

5.5 Discussion of the Architecture

The execution schedule in Figure 5.5 shows how memory operations and computation can be performed at the same time using this architectural approach. Figure 5.5 shows that the computation stage in iteration one (marked with ’1’s) can be hidden during the loading stage of iteration two (marked with ’2’s) and so on. Execution latency is hidden during memory operations (loads or stores) performed by the base processor.

The graph shows the execution stages in a single-pipeline processor and in our current approach, which is best used when there are large numbers of computation cycles that can be hidden via the loop pipelining technique [40, 89].

A large pipeline depth of any processor design has always been associated with high penalty costs for branch mispredictions, jumps and data dependencies on prior CHAPTER 5. CUSTOMIZING BY COPROCESSORS 115

single 1 1 1 2 2 2 pipeline 1 2 1 3 2 4 our approach 1 2 3

load compute store

Figure 5.5: Parallel Execution executed instructions [78]. Our coprocessor can read the value from a register only after the value has been written back to the register file (i.e. after the writeback stage). If the coprocessor requires a value from the register file that has been updated or written by a prior instruction, then the coprocessor has to wait six clock cycles from the time that the prior instruction was decoded. Thus, this coprocessor architecture would not be feasible for short loops, or cases where the loop pipelining technique in use does not provide enough time to hide this latency. The custom coprocessor design approach is best used when there is a substantial number of computations

’sandwiched’ between memory operations.

If macroblock 2 in Figure 5.4 is converted to a coprocessor component, then the loaded value in instruction lh can only be read and processed by the component after the writeback stage. The component then writes back the value to the register file before the decode stage of the store instruction sw.

The component created by the HLS framework can be regarded as a function: accepting inputs during an iteration and outputting the results at the end of the iteration. As explained in Section 5.3.2, the addresses of values needed by the HLS component need to be calculated by the main processor before the coprocessor needs them. These calculations are necessary because it is not possible to break the loop functionality of the coprocessor using the HLS framework. Such tasks would be more complicated in situations where indirect memory accesses are required. Note that this CHAPTER 5. CUSTOMIZING BY COPROCESSORS 116

only becomes a limitation when the core generated by the HLS framework is being used as a coprocessor in this architecture.

The number of register ports used by the integrated coprocessor approach remains constant (as opposed to the HLS approach - see Section 5.3.1) and would not affect the size of the base processor when the coprocessor increases in complexity. However,

Figure 5.8 shows that the size of the integrated coprocessor is actually larger than in the case of the HLS-based coprocessor, due to the intermediate registers used in the integrated coprocessor architecture (see Section 5.4). However, when taken together with the base processor, the integrated coprocessor approach achieves a smaller size compared with the HLS approach.

5.6 Experimental Setup & Tools

We used the SPARK [72] framework in our high-level synthesis approach. SPARK is a C-to-VHDL high-level synthesis framework that employs a set of compiler, paral- lelizing compiler, and synthesis transformations. The SPARK methodology is ideal for creating functional ASIC or FPGA blocks / modules from ANSI-C functions that can be used by ASIPs or general purpose microprocessors. Slight modifications are made to the function code before the VHDL component can be generated.

The experimental setup consists of the CPU RTL Model (see Figure 5.6) that is generated by the ASIPmeister CPU generation tool. We used the CPU genera- tion methodology proposed in [124] to provide a basic infrastructure and framework, making use of the existing SimpleScalar [30] toolset. The framework provides rapid generation of an ASIP given a set of CPU specifications. As explained in Section 5.3, two coprocessor instructions are added (iSCPR and BCPR) to the existing PISA ISA in the generated CPU. The RTL Model is connected to instruction and data memory CHAPTER 5. CUSTOMIZING BY COPROCESSORS 117

Assembly C Program SimpleScalar Compiler File

Software Libraries Processor Analyze and detect loop Description Generation Simulation from program source Model

Add coprocessor imem.vhd dmem.vhd Perform basic block and instructions graph manipulations CPU RTL Model Generate RTL Model Base Generate RTL Model of Coprocessor with ASIPmeister Processor Coprocessor

Figure 5.6: Experimental Setup models, which are generated by compiling C programs into SimpleScalar binaries and then translating them into VHDL code. The coprocessors (HLS and integrated co- processor) are added and connected to the overall system after the CPU is generated.

Software simulation is performed via ModelSim SE 6.0c, using the Simulation Model shown in Figure 5.6.

The CPU RTL Model is synthesized to gate-level using Synopsys Design Com- piler [7] W-2004.12 with TSMC 90nm (tcbn90g 110a) standard cell libraries. All registers with a minimum bank size of 4 bits are clock gated by Power Compiler.

The synthesized VHDL files are then simulated in ModelSim SE 6.0c together with the simulation model of data memory and instruction memory. The switching activ- ities obtained are used by Synopsys PrimePower 2003.12 for power calculations (see

Figure 5.7).

We also created customized instructions to compare against the coprocessor ap- proaches. The customized instructions used in the extended processor were gener- ated using the approaches developed in [42]. This framework rapidly generates cus- tomized instructions with corresponding customized components that are included CHAPTER 5. CUSTOMIZING BY COPROCESSORS 118

CPU RTL Area and Gate-level Synthesis Model Timing Report (Synopsys Design Compiler)

VCD / Trace dmem.vhd + Gate-level Power Analysis Information Power Report imem.vhd Simulation (Synopsys (ModelSim) PrimePower)

Figure 5.7: Synthesis and Power Calculation Flow in the pipeline of the base processor to accelerate the critical code segments. This approach is applied only to the selected loop segment of the Huffman encoder for comparison with our customized coprocessor approach and the HLS approach.

5.6.1 Verification

The SimpleScalar [30] simulator is modified to provide a framework for rapid proto- typing of new extended and coprocessor instructions without implementing the VHDL model (synthesizable VHDL models for all three systems were created, usually taking approximately 150 minutes for complete simulation of JPEG using ModelSim). The simulator reduces this time to a few seconds and verifies that the change in code does not adversely affect the functionality of the program. When the VHDL model is de- veloped, the data memory dumps from the VHDL and SimpleScalar simulations are compared using the diff unix command. Both memory dumps should be identical.

For verification of new (coprocessor/extended) designs, the memory dumps of the old and new designs cannot be compared, as the program code would have been changed. A program (hexdump) is developed to extract the output files from the memory dump; the output files from the original and new designs are then compared. CHAPTER 5. CUSTOMIZING BY COPROCESSORS 119

5.7 Results

Tables 5.2 and 5.3 shows the power and performance improvements of the four designs used in our case study. Area and power figures are measurements of the on-chip components only and do not include the external memory.

Design Energy in Loop Energy Loop, Cloop Loop Approaches Loop, Eloop Savings (cycles) Improvement Original Processor 176.903 µJ NA 4,213,978 NA Extended Processor 142.776 µJ 19.29% 3,165,066 1.33x HLS + Processor 126.917 µJ 28.26% 2,667,332 1.58x Integrated + Processor 75.964 µJ 57.06% 1,640,340 2.57x

Table 5.2: Loop Energy and Performance Table

Design Program, Cprog Program Idle Power Loop Power Approaches (cycles) Improvement Usage, Pidle Usage, Ploop Original Processor 21,317,212 NA 3.963 mW 4.198 mW Extended Processor 20,265,496 5.19% 4.232 mW 4.511 mW HLS + Processor 19,772,662 7.81% 4.133 mW 4.758 mW Integrated + Processor 18,721,162 13.87% 4.125 mW 4.631 mW

Table 5.3: Program Performance and Power Table

Column 1 in both tables shows the different processor design approaches used in our case study. The energy in the loop segment is shown in Column 2. Column 3 shows the energy savings compared to the original processor implementation.

We define the energy in a loop as :

Eloop = Cloop Tclk Ploop (5.2) × ×

where clock period : Tclk = 10ns

All synthesized designs have a 10ns clock period. The total execution cycles (in- cluding all iterations) of our loop segment (identified in Section 5.2.1) is shown in CHAPTER 5. CUSTOMIZING BY COPROCESSORS 120

column 4. The loop improvements are shown in column 5. Our approach acceler- ated program runtime by up to 13.87%, which is close to the theoretical maximum improvement of 18.46% shown in Table 5.1.

In Table 5.3, column 2 gives the total execution cycles of the JPEG encoder bench- mark program taken to encode the image chosen in Section 6.2.1. The percentage speedup of the new designs compared to the original processor design is shown in column 3.

Columns 4 and 5 in Table 5.2 respectively show the power used in the processor when no executions are performed (idle stage) and when executing the loop. When not in use, power consumption in our integrated coprocessor is 109.76 µW, whereas the

HLS coprocessor uses 53.758 µW (not shown in Table 5.2). The graph in Figure 5.8 shows that the total energy used in the loop execution of the integrated version is halved compared to that in the original processor.

Figure 5.8 shows the area usage of the synthesized processor designs and the individual coprocessors. The total processor size includes the size of the coprocessors as well, which is shaded in black. Although the integrated coprocessor is more than twice the size of the HLS coprocessor, the savings in the size of the register file offsets this and results in a smaller and efficient processor design compared to the HLS approach.

5.8 Conclusions

We have performed an interesting case study by exploring a novel and tightly cou- pled architecture to accelerate a computationally intensive loop in a JPEG encoder.

Loop pipelining and latency hiding is used to achieve near maximum speedup and parallelism between the base processor and the coprocessor. We also found that the

CHAPTER 5. CUSTOMIZING BY COPROCESSORS 121

Coprocessor

Custom

Coprocessor

HLS

Processor

Extended

Processor Base 0 Loop Energy Consumption

80 60 40 20

180 160 140 120 100 Energy (µJ) Energy

Coprocessor

Custom

Coprocessor

HLS

Processor

Extended

Processor Process or Size Base Figure 5.8: Area and Loop Energy Usage 0

80000 70000 60000 50000 40000 30000 20000 10000

NAND gates NAND

Coprocessor

Custom

Coprocessor HLS

0

Coprocess or Size 7000 6000 5000 4000 gates 3000 NAND 2000 1000 CHAPTER 5. CUSTOMIZING BY COPROCESSORS 122

coprocessor approaches achieve much better speedup and lower energy consumption compared with the customized instruction approach. Additionally, using our inte- grated coprocessor approach, we notice that more computations can be offloaded from the base processor to the coprocessor compared with the high-level synthesis approach so as to achieve better performance. Chapter 6

Customizing by Pipelining

6.1 Introduction

Multicore processors have been used in embedded systems and general computing applications for some time; they execute multiple applications concurrently, with each core carrying out a particular task in the system. Such systems can be found in gaming, automotive real-time systems and video / image encoding devices. They are commonly deployed to overcome deadline misses, which are primarily due to overloading of a single multitasking core. In this chapter, we explore the use of multiple cores for a a single application, in contrast to multiple applications executing in a parallel fashion. A single application is parallelized using two different methods: a master-slave model; and a sequential pipeline model. The systems were implemented using Tensilica’s Xtensa LX processors with queues as the means of communications between the two cores. In a master-slave model, we used a course-grained approach in which a main core distributes the workload to the remaining cores and reads the processed data before writing the results back to file. In the pipeline model, a lower granularity is used. The application is partitioned into multiple sequential blocks;

123 CHAPTER 6. CUSTOMIZING BY PIPELINING 124

each block representing a stage in a sequential pipeline. For each model we applied a number of differing configurations ranging from a single core to a nine-core system.

We found that without any optimization for the seven core system, the sequential pipeline approach had a more efficient area usage, with an area increase to speedup ratio of 1.83 compared to the master-slave approach with a value of 4.34. With selective optimization in the pipeline approach, we obtained speed ups of up to 4.6 × while with an area increase of only 3.1 (an area increase to speedup ratio of only × 0.68).

6.2 Background

This case study is based on mapping different parts of the benchmark program and employing a set of industrial tools to rapidly optimize and simulate the entire system in a multiprocessor configuration. The partitioning of the program (initially based on functions in the source program) is performed by analyzing the benchmark results of the simulation. The set of industrial design tools enables us to quickly explore the extent of improvements and area usage of a heterogeneous multiprocessor system.

6.2.1 Case Study Application

A freeware JPEG compression algorithm implementation is used in this case study.

The simple nature of the program benefits this case study, since various sections of the code can be distinguished, partitioned and separated into a multiprocessor configura- tion. The program partitions are created based on the different stages of processing taken from a standard JPEG encoding algorithm (such as DCT, quantization, zero run-length encoding and Huffman encoding) [8]. This partitioning was done in such a manner that cores could be easily upgraded if one of the algorithms changed. For CHAPTER 6. CUSTOMIZING BY PIPELINING 125

example, to implement a better DCT algorithm, only a single stage had to be altered.

This level of partitioning allows the architecture to be retained while the software and core of a particular stage are changed to match changes in the algorithm.

Figure 6.1 shows the various partitions or stages of the program that have the potential to be allocated to different processors. The arrows indicate the flow of

RAW bitstreams through the various stages of the encoding process before they are written out to file.

The JPEG encoder program initially accepts a configuration file which specifies the name of the RAW file to read, the quality factor and the format of the RAW image.

These tasks are performed in Stage 1. The program then initializes the quantization tables and writes the appropriate JFIF header information to the output file, which includes Quantization and Huffman tables (stages 9 and 10). The program allocates two main buffers; one for the complete RAW image that is read from file and the other for the resulting JPEG file.

The JPEG program then starts reading RGB values from the buffer and converts them to YCbCr values (stage 2). These values are then value shifted (stage 3) (based on JPEG specifications). One macroblock at a time is then selected in the sequence of Y, Cb and Cr to be DCT transformed and quantized, with the values ordered in a zigzag manner (stages 4, 5 and 6). The pixel streams are fed into the Huffman encoder

(stage 7), which processes these streams serially. The generated code is finally output to a file (stage 8).

6.2.2 Baseline Processor Description

This case study uses Tensilica’s Xtensa LX processor [20]. The Xtensa LX is part of Tensilica’s line of microprocessor cores, which is configurable, extensible and sup- ported by automatic hardware and software generation tools. The synthesizable core CHAPTER 6. CUSTOMIZING BY PIPELINING 126

1 Read RAW 5 DCT horizontal 9 Initialize QT

2 RGB to YCbCr 6 Quant / Zigzag 10 JFIF Markers

3 Level Shift 7 Huffman 11 Close bitstream

4 DCT vertical 8 Write to file

Figure 6.1: The main stages in a JPEG encoder. is configurable to allow designers to precisely tailor each processor implementation to match the target application requirements. The Xtensa core ISA has a 24-bit instruc- tion set base and allows 16-bit instructions for higher code density; all instructions can operate on 32-bit data.

The Xtensa LX, like previous Xtensa processors, can support extended instruc- tions, written in Tensilica Instruction Extension (TIE) language. Such instructions can do the work of multiple instructions of a general-purpose processor. Extended instructions include fusion instructions [149], SIMD/vector instructions and Flexible

Length Instruction Xtensions FLIX [23] instructions, which are VLIW-like instruc- tions, by which multiple operations can be done in a single instruction.

Xtensa LX Processor Xtensa LX Processor output Queue input Output Queue Input Queue Interface PushReq PopReq Interface (per output queue) (per input queue) Full Empty

Figure 6.2: Xtensa queue interface.

TIE queues and ports have been introduced in Tensilica’s Xtensa LX processors; they are used to communicate to the world outside of the processor and can do so CHAPTER 6. CUSTOMIZING BY PIPELINING 127

at a much wider bandwidth than existing interconnects. Queue interfaces are used to pop an entry from an input queue for incoming data or push data to an outgoing queue (refer to Figure 6.2). The logic that stalls the processor when it tries to read an empty input queue or write to a full output queue is automatically generated by the Xtensa Toolset. Ports are wires that the processor uses to directly sample the value of an external signal or to drive the value of any TIE state on external signals.

Functions are created to push and pop from the queues. These are blocking functions, since a push into a full queue or a pop from an empty queue results in a stall of that particular pipeline stage. These functions are TIE instructions and form part of the extended instructions of the Xtensa LX processor architecture.

The configuration of the base processor used in the case study had been optimized to provide satisfactory results when executing the benchmark application under a single processor system and is shown in Table 6.1 as LX1. Also shown is a highly stripped down version of the Xtensa LX processor LX2 which will be used to replace under-utilized cores to save area and power (see Section 6.3).

6.3 Methodology

We explore the various ways a multiprocessor system can be configured to speed up a simple application. In this section, we outline the multiprocessor architecture as we increase the number of cores in the system. Our methodology uses the queue interfaces which are available on Tensilica’s Xtensa LX [20] processors. A simplified

JPEG encoder is modified and partitioned to execute on such a system.

We investigate a multiprocessor architecture in a pipeline configuration. The sys- tem consists of different processors, each running a portion of a pipeline stage of a CHAPTER 6. CUSTOMIZING BY PIPELINING 128

Parameter LX1 LX2 Speed 533 MHz Process 90nm GT Pipeline length 5 Size 63,843 gates 39,789 gates Core Size 0.32 mm2 0.18 mm2 Core Power 74.35 mW 41.3 mW Memory Area (mm2) 1.76 mm2 0.15 mm2 Instruction Cache 32kB 1kB Data Cache 32kB 1kB MUL32, MUL16, density instructions, density instructions, ISA boolean registers, boolean registers, Instruction Options zero overhead loops, zero overhead loops, TIE wide stores, TIE wide stores, 32 bits sign extend, TIE arbitrary bytes Max instruction width 8 bytes 3 bytes PIF interface width 128 bits 32 bits

Table 6.1: Processor Configuration program. Each processor has the potential to be configured optimally, by instanti- ating only those resources that are appropriate for a particular stage of the pipeline.

A heterogeneous processor system minimizes the redundancy of resources, as those processors that deal with complex computations may be parameterized with extra resources. Communication among processors is facilitated using the ports and queues that are provided by the Xtensa LX [20] processor architecture.

The multiprocessor pipeline architecture design requires programs that can be broken up into computationally independent blocks, resembling the computational blocks in a pipeline processor architecture [78]. Transfer of data from one processor to another is facilitated by a queue.

In the case of the JPEG encoder, the pipeline architecture is ideal, as the JPEG encoder displays characteristics of a pipeline nature, given that the encoding process is divided into stages which are independent of each other. The proposed architecture CHAPTER 6. CUSTOMIZING BY PIPELINING 129

consists of standalone processors runnin sections of the JPEG encoder program that have been recompiled as individual programs. These subprograms that reside in these processors accept data via the queues of the Xtensa processor, perform the necessary computation, and finally push it via the output queue into the next stage of the pipeline. The computed data traverses the pipeline stages until it is finally written out to file by the last processor in the pipeline. It should be noted that, while one processor is working, the rest of the system is still busy processing workloads for other stages.

The scalability of a multiprocessor pipeline architecture depends entirely on the suitability of the data structure and control flow of the targeted program. A particular configuration is considered efficient when all processors have equal computational workload; when no processor in the pipeline has to wait for another stage to complete.

6.3.1 Single Pipeline

The rationale of having pipeline implementations is to increase throughput during execution. As for to a pipeline microprocessor, a pipeline implementation of a JPEG encoder would be able to encode images at a faster rate. A pipeline approach would greatly introduce processing speed, up to the point where it meets the minimum processing demand for a particular application (its minimum frame rate).

Figure 6.3 shows the stages in a pipeline processor implementation. Registers are used to separate the different stages in the pipeline stage. A pipeline processor imple- mentation uses instruction-level parallelism (ILP) to exploit the parallelism among instructions in a code. However, such an approach is limited by the amount of code which can be overlapped within a basic block – a straight-line code sequence with no branches in except the entry, and no branches out except the exit.

The work in this section also exploits task level parallelism in order to obtain CHAPTER 6. CUSTOMIZING BY PIPELINING 130

IFETCH1 IFETCH2 DECODE EXE MEM WB

Cache Instruction Register Write ALU Memory Access Buffer File Back

Figure 6.3: Stages in a pipeline processor architecture. further performance improvement. As for a pipeline processor implementation, a multiprocessor system is used, where different processors represent different stages in the pipeline. Each stage is mapped to a particular task of a benchmark program; different stages are separated via TIE queues (as opposed to the registers in a pipeline processor).

Five Cores

The JPEG encoding process (Figure 6.1) contains sequential routines that can be broken up into stages, offering the possibility of pipelining the encoding process.

These stages represents functions within the original program and are extracted and compiled as a single program, each to be executed in a single core.

A B C D E

RAW JPEG

Figure 6.4: A five core system interconnected by queues. Each processor is assigned a stage of the JPEG pipeline.

The main program of the encoder sends the required information to the appropri- ate stages of the pipeline in order to initialize the quantization tables and JFIF [75] headers that are written out to a file. As each pipeline stage has only to wait for the CHAPTER 6. CUSTOMIZING BY PIPELINING 131

data from the previous stage, the partition program is constructed such that one core reads the RAW image while another writes the encoded JPEG into a new file. Each stage processes data at a macroblock level (8 8 pixels). × We start with a five core multiprocessor configuration (Figure 6.4), pipelined into

five major stages. The quantization table initialization code shares the same core as the one that implements the quantization stage of the pipeline. This stage receives initial values from the main program (core A) which reads in encoding parameters that define the quality of the resulting image. Core D has the necessary code to initiate the writing of JFIF markers and closing the JPEG bit stream. The last core

(Core E) is initialized by the first core with the name of the output file and writes any receiving bytes from the previous stage (Core D) to file. Table 6.3 summarizes the allocated stages to the respective cores.

Six Cores

We next introduce a new core into the system and allocate the LevelShift stage to it

(refer to Table 6.3). The LevelShift stage accepts YCbCr values from the previous stage, level shifts the values and then pushes it out to the queue in macroblocks of

64 Y’s, 64 Cb’s and 64 Cr’s. As will be shown in Section 7.6, the introduction of this stage into the pipeline does not increase the overall performance of the encoding process.

Seven Cores

DCT (Discrete Cosine Transform) transformations are known to be very computation intensive and there are special circuits for performing them. In our next approach, the two-dimensional DCT function can be split up into two stages; a one-dimensional

DCT vertically, and a one dimensional DCT horizontally. A seven core processor CHAPTER 6. CUSTOMIZING BY PIPELINING 132

configuration benefits from such an approach (refer to Table 6.3).

6.3.2 Multiple Pipelines

Combining both approaches, parallel computations of macroblocks and pipelining, we are able to exploit further parallelism within the JPEG compression algorithm.

The pipelining nature of the previous core systems are maintained. However, from stage four onwards (DCT), macroblocks for luminance (Y), chrominance red (Cr) and chrominance blue (Cb) are processed in separate parallel pipelines. This reduces the processing bottleneck in the DCT and quantization/zigzag stages.

These parallel pipelines include DCT and Quantization (with zigzag) (QZ) stages.

The outputs of these parallel pipelines then converge into a single pipe where Huffman encoding is performed. Huffman encoding depends on serial input and thus, cannot process separated JPEG streams independently.

Nine Cores

DCT QZ

Y pipeline

DCT QZ

Raw file Cb pipeline Huffman Write to + Encoding file RGB conversion + levelshift DCT QZ

Cr pipeline

Figure 6.5: A nine core system with three internal pipeline flows

Following the five pipeline stage multiprocessor approach in Method II, we try to CHAPTER 6. CUSTOMIZING BY PIPELINING 133

increase the throughput of the middle stages of the JPEG compression pipeline by replicating the Quantization stage of the pipeline, because of the heavy utilization rates in the DCT and Quantization stages (refer to Table 6.3). Pipeline flow diverges only at stages four and five (refer to Figure 6.1) into three separate pipeline flows, before being fed into a single processor during the Quantization stage. The E, F and G cores are the Quantization stages and process the Y, Cr and Cb macroblocks separately. These cores initialize their respective quantization tables (Y, Cr, Cb).

This results in a nine core processor system.

The utilization of each core in the system is shown in Table 6.2.

Cores A B C D E F G H I Utilization (%) 76 51 51 51 28 28 28 95 99

Table 6.2: Utilization in a nine core multipipeline system

Seven Cores

Table 6.2 shows that the utilization rates of the Quantization cores (E, F and G) in the nine core system are very low, prompting us to replace all three of these cores with just one. This is due to the bottleneck in the second last stage of the pipeline

(H ), where the Huffman encoding has already reached its maximum throughput.

Note that Stage I is not considered, since it is constantly looping. Outputs from the three separate DCT cores are now channeled into a single core that will perform quantization and zigzag transformations.

With the seven core multiprocessor system, we have methodically reduced the area consumption of the system. Based on the utilization rates of each core in the pipeline, we were able to selectively optimize the required cores using Tensilica’s XPRES com- piler, which automatically generates TIE instructions (SIMD, FLIX, vector, fusion). CHAPTER 6. CUSTOMIZING BY PIPELINING 134

When the runtime of the selectively optimized system closely matches the fully

XPRES compiled version, we replace the cores which have very low utilization rates with simpler ones; including the replacement of LX1 cores with LX2 cores (refer to

Section 6.2.2), and progressively reducing the sizes of the instruction and data caches

(until they reach the same performance as the fully XPRES compiled version, or reach the minimal configuration of 1kB). This methodology results in a heterogeneous multiprocessor system that provides a high ratio of performance improvement to area increase. Stages (single pipeline) Stages (multiple pipelines) Cores 5 cores 6 cores 7 cores 7 cores 9 cores A 1, 2, 3 1, 2 1, 2 1, 2, 3 1, 2, 3 B 4, 5 3 3 4, 5 4, 5 C 6, 9 4, 5 4 4, 5 4, 5 D 7, 10, 11 6, 9 5 4, 5 4, 5 E 8 7, 10, 11 6, 9 6, 9 6, 9 F - 8 7, 10, 11 7, 10, 11 6, 9 G - - 8 8 6, 9 H - - - - 7, 10, 11 I - - - - 8

Table 6.3: Processor configuration with multiple pipeline flows

6.4 Experimental methodology

We used Tensilica’s Xtensa RA2006.4 Toolset for the Xtensa LX family of processors.

The toolset also provides a set of compilation tools to compile C/C++ code, targeted to our specially configured Xtensa LX cores (refer to Section 6.2.2). The Tensilica

Instruction Set Simulator (ISS) and Xtensa Modeling Protocol (XTMP) environment were used to run the multicore systems; XTMP is an API and runtime environment for rapid multiprocessor description and analysis, using its own simulation engine and generating SystemC-compatible models. CHAPTER 6. CUSTOMIZING BY PIPELINING 135

For each system, multiple Xtensa cores were instantiated and XTMP was used to connect them to peripherals and interconnect. The ISS directly models the Xtensa pipeline and operation as a system-simulation component using the XTMP environ- ment. With XTMP, different multiprocessor configurations could be simulated in a short amount of time.

The simulator allows for communication between the cores and peripherals using a cycle-accurate, split-transaction simulation model without using a clock. The ISS was used to generate profiling data for all cores in the system, which were then profiled using Tensilica’s gprof profiler. The profiles can include the cycle counts for all functions executed by the cores. The ISS can also print a summary of the total cycle count and the global stalls of each core.

Each individual core is connected via the queue interface provided by the Xtensa

LX core using the XTMP environment. We create C-code functions and data struc- tures to model the queues within the XTMP environment. The queues are simple

FIFO (first-in, first-out) components that mainly operate via the functions push and pop called by each of the connected cores in the simulation environment. Queues transmitting RAW bit streams between processors are modeled to have 64 entries. A full queue or an empty queue would effectively stall the section of the pipeline.

We created multicore processor systems by identifying hotspots within the single processor benchmark application (see Figure 6.6-1). The hotspots were identified by cross-compiling the benchmark application using the Tensilica Xtensa LX compila- tion tools and by simulating it on a selected configuration, namely LX1 (refer to

Table 6.1). The hotspots were mainly functions identified in Section 6.2.1. We par- tition and allocate these functions based on the methodology defined in Section 6.3

(see Figure 6.6-2). Configurations and topologies of the multiprocessor systems were CHAPTER 6. CUSTOMIZING BY PIPELINING 136

Processor Original Library Program

1) 3) 5) Selective Cross-compilation Cross-compilation Optimization Is pipeline ISS ISS balanced? Profiling Profiling Pipeline 4) 1. Run XPRES on Configuration critical stages 2) Designer 2. Switch under partitions Choose best utilized cores Data Flow Graph architecture 3. Reduce cache of under utilized cores Partition program Configurations

Core Partitioned Configurations Programs

Figure 6.6: Experiment Methodology created manually (see Figure 6.6-3). An XTMP simulation program, specially cus- tomized to generate profiling and other relevant benchmark information was created for each of these multiprocessor systems, which were then simulated and their per- formance and area utilization recorded (see Figure 6.6-4).

The architecture with the best ratio of performance increase to area increase is selected for further optimization. Our approach manually investigates the most appropriate stage to optimize, one step at a time; this selective optimization seeks to produce an architectural configuration that has well-balanced utilization among all stages in the pipeline (see Figure 6.6-5). The generated systems include similar configured cores, not including parameterized components such as the number of CHAPTER 6. CUSTOMIZING BY PIPELINING 137

outgoing and incoming queues.

The toolset also includes the XPRES (Xtensa PRocessor Extension Synthesis) compiler, which creates tailored processor descriptions for the Xtensa processors from native C/C++ code. The XPRES Compiler was used to create custom instructions

(by generating RTL models) for each core in the system. Using the designer-defined input of C programs to be analyzed, XPRES extends the base processor with new instructions, operations and register files using TIE extensions. It does so by automat- ically generating a new TIE file that can then be included for recompiling the source code. XPRES was used to create a distinct TIE file for each core in each system, to optimize each individual core using only the C files that are used on that particular core. Each individual core in the multiprocessor system is compiled through XPRES to explore the extent of improvements that can be obtained via extended instructions.

Area counting includes the base processor, instruction & data caches and the TIE instructions. Each multiprocessor system generated in the case study reads a RAW

file and saves it as a JPEG format file, viewable by any standard image viewing application.

6.5 Results & Analysis

Figure 6.7 shows the runtime improvements and area increase in relation to the orig- inal core, LX1 (refer to Section 6.2.2). The graph shows the two main architectures used in this case study: a single pipeline architecture (5, 6 & 8 cores) and a multip- ipeline architecture (7 & 9 cores). It can be seen that the area increase to runtime improvement ratios for each of the multiprocessor systems has values more than one and actually increases as more processors are added to the system. These ratios do CHAPTER 6. CUSTOMIZING BY PIPELINING 138

4.5 12 Area (original) 4 Area (XPRES) original Runtim e (original) 10 3.5 Runtim e (XPRES)

3 8 XPRES 2.5 6 2

1.5 4 Area (millionArea gates)

1 Runtime (million cycles) 2 0.5

0 0 1 core 5 core 6 core 7 core 7 core 9 core (SP) (SP) (SP) (MP) (MP)

Figure 6.7: Performance of multiprocessor systems without optimizations not justify a need for increasing the number of processors in the system. With mul- tiprocessor pipeline systems, the improvements seem to level off for seven cores and above, due to the fully saturated Huffman encoding stage. Unless the Huffman stage could be further partitioned and parallelized, this would remain a critical stage in the pipeline.

As a form of measurement, the systems have been compiled with XPRES to find the maximum improvement if all cores were optimized. The maximum performance improvement is obtained from the nine processor system, with a performance in- crease of 3.8 ; and 4.7 when run through the XPRES compiler for each of the nine × × processors (refer to Figure 6.7).

It is not viable to continue this approach of adding processors to improve per- formance, as area increases faster than improvement in performance. However, by reducing resources on non-critical processors, we can reduce area, yet keep the same CHAPTER 6. CUSTOMIZING BY PIPELINING 139

level of performance. We selectively optimized critical stages of the pipeline.

We selected the seven core multipipeline system for further optimizations as it performs almost as well as the nine core architecture while using much less area. Fig- ure 6.5 shows the utilization of each of the seven cores in a multipipeline architecture

(refer to Section 6.3.2). The first two graphs in Figure 6.5 on the left shows the utilization of the system without optimizations and with XPRES optimizations re- spectively. Area increase is 7 and 7.7 respectively with performance improvement × × of 3.8 and 4.7 relative to the base processor implementation (represented by the × × decreasing and steady lines in the graph).

It should be noted that the last pipeline stage is always at 100% utilization due to its software implementation which repeatedly polls for incoming data on every simulation cycle. In Partial XPRES 1, we replaced the original core of the Huffman encoding pipeline stage with an XPRES version. This optimization step moves the critical stage to another stage. In Partial XPRES 1, the critical path has moved on to the Quantization stage. We replace the Quantization pipeline stage with an

XPRES version in Partial XPRES 2, once again, making Huffman encoding the crit- ical stage. In this implementation, it can be seen that the parallel pipelines of the

DCT stages are not fully utilized. We replaced these cores with LX2 cores, resulting in a utilization jump from 62.3% to 88.3% while area is reduced from 7.1 to 3.8 . × × At this point, we achieve an area increase to performance improvement ratio of 0.82.

Further optimizations were achieved when reducing the cache sizes of the first core in the pipeline from 32KB to 1KB. This does not significantly affect performance, since this core mainly reads the RAW files and outputs it to the pipeline. The ratio is further reduced to 0.68 while still maintaining a performance improvement of 4.62 . × A complete design space exploration is not feasible due to the huge number of combinations that can be generated when each stage in the pipeline is configured CHAPTER 6. CUSTOMIZING BY PIPELINING 140 4 Partial XPRES 3 Partial XPRES 2 Partial XPRES 1 Partial XPRES compiled Fully XPRES Figure 6.8: Utilization of the seven pipeline stage systems Initial multipipeline 0%

90% 80% 70% 60% 50% 40% 30% 20% 10%

100% Processor Utilization Processor

CHAPTER 6. CUSTOMIZING BY PIPELINING 141 Area Increase (X) Increase Area 10 9 8 7 6 5 4 3 2 1 0 PartialXPRES 1 Partial XPRES 2 Partial XPRES 3 Partial XPRES 4 Figure 6.9: Runtime improvements and area increase com piled FullyXPRES Area Increase Performance Improvement Initial m ultipipeline

9 8 7 6 5 4 3 2 1 0

10 Performance Increase (X) Increase Performance CHAPTER 6. CUSTOMIZING BY PIPELINING 142

Figure 6.10: Design Space for JPEG Encoder independently (cache configuration ranges from 1KB to 32KB for each instruction and data cache, the base instruction set of 80 instructions is extensible to include

MAC, coprocessors, SIMD instructions and custom instructions). This would require vast amounts of time to simulate each and every configuration. However, certain stages in the pipeline (i.e., critical stages) should be optimized ahead of the others.

Thus, the selective approach to optimization as shown in Figure 6.6 is used. This method has greatly reduced the time needed to find a configuration which produces a good ratio of area increase to performance improvement without optimization.

To obtain a holistic view of the optimization problem, an approximation of the complete design space is shown in Figure 6.10, which shows the performance improve- ment of the various configurations. These configurations include the LX1 (XPRESed and non-XPRESed versions) and LX2 cores with various cache configurations. The line in Figure 6.10 denotes the section where the performance increase equals the CHAPTER 6. CUSTOMIZING BY PIPELINING 143

area increase relative to the single processor benchmark configuration (refer to Sec- tion 6.2.2). Runtime improvements above this line represent configurations which have a good area increase to performance improvement ratio. Data sets appearing below the linear line are configurations which are not fully utilized and thus, have the possibility of being down-scaled to decrease area use while maintaining performance improvement. It is also noted that our selective optimization method manages to closely match the maximum performance improvement in the whole design space.

A thorough design space exploration would have taken months of simulation. The approximation in Figure 6.10 was obtained by simulating and profiling each stage of the pipeline independently. The runtime of a particular configuration is

init process final R = R (v1) + R (vcrit) + R (vJ ) (6.1)

init process where R (v1) is the initialization stage of the first stage of the pipeline, R (vcrit)

final is the time of the longest execution time of a kernel in the pipeline, and R (vJ ) is the finalization time of the last stage of the pipeline. Equation 6.1 is used and further explained in Chapter 7.

The equation is used to permute the various configurations and to calculate the total runtime of a set of configurations (different stages have different configurations).

Such calculations would result in a slight error (less than 2%), though this would not affect the overall trend of the graph.

Extending the selective optimization technique at the beginning of this section, a formal methodology could be developed to automatically select the appropriate configuration for the critical stage of the pipeline after each configuration change.

This would be equivalent to filtering out the redundant configurations that would CHAPTER 6. CUSTOMIZING BY PIPELINING 144

otherwise worsen the area increase to performance improvement ratio.

6.5.1 Further Architectural Comparison

In addition to the pipeline architectures used in this chapter, we have also performed a case study of master-slave architectures in [139]; please refer to [139] for implemen- tation details and analysis.

Master-slave models have been used in parallel computing to enable task man- agement and parallel and distributed data structures [82]. In a master-slave model, a master program is given the responsibility to spawn processes, and initialize, col- lect and display results while the slave programs perform the computations [108]. A master-slave model of a multicore system was implemented with a varying number

(from three to seven) of Xtensa LX processor cores, which were instantiated using the

Xtensa LX XTMP/ISS environment. In each model, there was only one main core, and (N - 1) slave cores, where N is the total number of cores in the system.

In contrast to the other approaches using shared communication buses [64], com- munication between cores is achieved via TIE queues. Figure 6.11 shows the perfor- mance of these master-slave models in relation to our single/multi pipeline models.

It can be seen in Figure 6.11 that the area of the master-slave implementations lin- early increases as more cores are added into the system. The runtime value, however, levels off at the six and seven core implementations. The asymptotic trend shows that the runtime improvement is bounded by the processing rate of the master processor.

Figure 6.11 also shows that for the same number of cores implemented in the system, the pipeline models perform better than those using the master-slave concept.

CHAPTER 6. CUSTOMIZING BY PIPELINING 145 Runtime (million cycles) (million Runtime 12 10 8 6 4 2 0 (MP) 9 core (MP) 7 core (P) 7 core (P) 6 core (P) 5 core (M/S) 7 core (M/S) 6 core Area (original) Area (XPRES) Runtime (original) Runtime (XPRES) (M/S) 5 core (M/S) 4 core Figure 6.11: Performance of multiprocessor systems without optimizations (M/S) RES original XP 1 core 3 core 4 3 2 1 0

4.5 3.5 2.5 1.5 0.5 Area (million gates) (million Area CHAPTER 6. CUSTOMIZING BY PIPELINING 146

6.6 Conclusion

In conclusion, we have performed an interesting case study by exploring the use of multiple cores in master/slave and pipeline configurations. Communications amongst these cores are facilitated using queues which are introduced in Tensilica’s Xtensa

LX [20] configurable cores. We have also analyzed the effect of increasing the number of cores in the system, to see the achievable performance improvement. The XPRES tool has been used to selectively optimize over-utilized cores, while under-utilized cores were replaced by cores with fewer resources. The result of the optimization is contrasted against the design space of the benchmark application. Our selective optimization approach closely matches the maximum performance gain achievable in the overall design space. We have found that such a multicore architecture has the potential to minimize the area of those cores that requires less computation in the system. The pipeline architecture can be exploited to provide an increase in performance with the possibility of far outweighing the increase in area. We have shown that a heterogeneous multiprocessor system is able to provide the necessary speedup while minimizing gate count, providing a very low ratio of area increase to performance improvement. Chapter 7

Design Space Exploration

7.1 Introduction

Multiprocessor SoC systems have enabled the use of parallel hardware along with as- sociated software. Approaches have included coprocessors, homogeneous processors

(e.g., SMP) and application specific architectures (e.g., DSP, ASIC). Recently, ASIPs have emerged as a viable alternative to conventional processing entities (PEs) due to their configurability and programmability. In this work, we introduce a hetero- geneous multi-processor system that uses ASIPs as processing entities in a pipeline configuration. A streaming application is manually partitioned into a series of algo- rithmic stages, each of which is executed on a single processor, with queues between processors for communication. By carefully customizing each processor, the pipeline is balanced (i.e., processing times are nearly equal) allowing speedups with little over- head. We formulate the problem of mapping each algorithmic stage in the system to an ASIP configuration, and propose an estimation technique and a heuristic to efficiently search the design space, for a pipeline-based multi ASIP system.

We have implemented the proposed heterogeneous multiprocessor methodology

147 CHAPTER 7. DESIGN SPACE EXPLORATION 148

using a commercial extensible processor system (Xtensa LX from Tensilica Inc.), and evaluated the resulting design by creating two benchmarks (JPEG and MP3 encoders) in a pipelined fashion. Our multiprocessor design provides a performance improvement of at least 4.03 for JPEG and 3.31 for MP3, for a single processor × × design system. The minimum cost (area clock cycle) obtained through our heuristic × was within 0.43% and 0.29% of the optimum values for JPEG and MP3 benchmarks respectively (using the estimation-based search method). The heuristic solution was reached within a fraction of a second, whereas some configurations took several times longer to converge when the estimation-based method was used.

7.2 Background

Our work is based on partitioning sequential streaming applications into a pipelined processing structure, where each stage is executed by at least one processor. The application is assumed to have the following characteristics:

1. Each streaming application contains a kernel (main processing code segment),

which is partitioned into several pipeline stages. This kernel is executed multiple

times (in JPEG for example, this will be executed once every frame). Minor

loops with small repeated code segments are considered atomic and would not

be further partitioned.

2. The application exhibits a dataflow software architecture. Input data is sequen-

tially processed deterministically and output as results in the same order and

manner.

An application with the above characteristics can be partitioned to represent dif- ferent stages in a pipeline flow. The partitioned application is derived from a standard CHAPTER 7. DESIGN SPACE EXPLORATION 149

sequential program. Figure 7.1 shows possible designs that can be implemented as a pipelined multiprocessor system. Each design has a set of cores, where each core represents a stage in the pipeline process. A particular process in a stage takes in- puts from the previous stage; stages are interconnected by FIFO (First In, First Out) queues. So these connected systems allow each core to run independently of the others, provided it has the necessary input to begin data processing.

(a) (b) (c) (d) stage1 stage1 stage1 stage1 stage2 stage2 stage2 stage2 stage3 stage3 stage3 stage3 stage4 stage4 stage4 stage4

5 cores JPEG 6 cores JPEG

(e) (f) stage5 stage5 5 cores 5 cores 9

JPEG stage1 JPEG stage1 stage2 stage2 stage3 stage3 stage4

6 cores 6 MP3 cores 4 MP3

Figure 7.1: Possible design configurations for a pipelined multiprocessor system

Each multiprocessor configuration in Figure 7.1 has a large design space. Cores in the various stages in the pipeline can be configured and mapped to special-purpose hardware. A particular configuration that is generated for a design has to be optimal or near-optimal (i.e., cost efficient). This near-optimal configuration can be achieved CHAPTER 7. DESIGN SPACE EXPLORATION 150

by adjusting hardware parameters of each processor to achieve the required perfor- mance at the lowest possible cost. Finally, the system is optimized, such that the increase in performance more than offsets the area increase incurred by pipelining the system.

7.2.1 Benchmark Applications

Readily partitioned benchmark programs are not freely available to the research com- munity. We created our own set of benchmark applications based on single processor benchmarks. Two freeware compression algorithms, for MP3 and JPEG encoding, were chosen and ported to the Tensilica Xtensa LX platform architecture [20].

The data flow graphs were obtained from these benchmark applications by ana- lyzing the data stream through the benchmark applications. These benchmarks are partitioned manually into various pipeline / data flow stages, adhering to the re- spective standards (JPEG & MP3). The partitions are then mapped to stages in a pipeline system (refer to Figure 7.1) and executed as standalone programs in Xtensa

LX processors.

We methodically merge and duplicate pipeline flows according to profiles obtained through simulations. If neighboring pipeline stages can be combined without affecting the execution time of the entire program, they are merged. On the other hand, pipeline stages that represent bottlenecks in the flow are duplicated. By following such a procedure, it is possible to explore the design space systematically and eventually obtain a configuration that is close to the optimal solution.

We created four multiprocessor configurations for the JPEG encoder and two configurations for the MP3 encoder. Figure 7.1 shows the set of designs for the two benchmark applications. The figure also shows the connectivity of each stage in the pipeline; each arrow denotes a FIFO connection to the next stage in the pipeline. CHAPTER 7. DESIGN SPACE EXPLORATION 151

JPEG Benchmarks

The JPEG benchmark application performs its tasks in several stages. Firstly, it reads in RAW files that contain image information in RGB (Red, Green, Blue) format. A color conversion and level shift is performed on each macroblock in the image before a two-dimensional DCT transformation is performed on it. The transformed dataset is quantized (lossy compression) and zigzag reordered. A zero-runlength encoding is performed before a Huffman encoder is used for lossless compression. The resulting output has appropriate JPEG headers appended and is written as a JPEG file.

Five multipipeline processor JPEG benchmarks were constructed based on major tasks identified in the single processor implementation. The task graph is constructed based on the flow of the dataset (in this case, the macroblock of the image) from one task to another. Communication among tasks is analyzed. As the JPEG benchmark application was originally written as a single processor implementation, shared data structures had to be decoupled so that individual tasks could be executed indepen- dently on different processors. The communications among different tasks in different processors are modelled using queues.

Figure 7.1(a) shows a simple example of the JPEG benchmark program, parti- tioned into five main stages: 1) read raw image, color conversion & level shifting; 2)

2D DCT; 3) quantization & zigzag transformation; 4) Huffman encoding; and 5) write back to file. In another variation (see Figure 7.1(d)), stages 2 and 3 are duplicated three ways to handle the luminance, chrominance red and chrominance blue channels.

In Figure 7.1(c), the DCT, quantization and zigzag transformation are merged into one pipeline stage, reducing the number of pipeline stages to four. The config- uration in Figure 7.1(b) provides the same number of pipeline stages, with the two chrominance component pipeline flows merged into a single flow. CHAPTER 7. DESIGN SPACE EXPLORATION 152

MP3 Benchmarks

The MP3 benchmark is partitioned into several stages of operations. The application

firstly reads in PCM encoded bitstreams from a file. The bitstream then enters the polyphase filtering stage of the encoder. After being filtered into different sub- bands, an MDCT operation is performed. Bit and noise allocation is performed. The bitstream is then formatted before writing back to file as an MP3 stream.

The first implementation of the MP3 benchmark is shown in Figure 7.1(e). The system consists of four pipeline stages: 1) reading PCM file, 2) polyphase filtering,

3) MDCT and 4) bit & noise allocation and writing of the bitstream. Stages 2 and 3 are duplicated to perform parallel computations of the stereo channels (i.e., left and right channels). Figure 7.1(f) shows an implementation with merged pipeline stages

1 and 2. Left and right channel PCM data are read simultaneously. The MDCT stage (stage 3 ) is adequate as a single pipeline flow (based on profiling data).

7.2.2 System Architecture

The Xtensa LX [20] is part of the Tensilica line of cores which is configurable, exten- sible and supported by automatic hardware and software generation tools. The core is synthesizable and allows designers to configure each implementation to match the target application requirements. It supports extended instructions, including fusion instructions [149], SIMD/vector instructions and FLIX [23] (Flexible Length Instruc- tion Xtension) instructions. FLIX is a configuration option that allows designer- defined instructions to consist of multiple, independent operations bundled into one instruction word.

The key feature that is used in this work is the queue interface (introduced in

Xtensa LX - refer to Figure 7.2). This feature supports external communications at a CHAPTER 7. DESIGN SPACE EXPLORATION 153

Xtensa LX Processor Xtensa LX Processor output Queue input Output Queue Input Queue Interface PushReq PopReq Interface (per output queue) (per input queue) Full Empty

Figure 7.2: Xtensa LX Queue Interface much wider bandwidth than existing interconnects. Queue interfaces (defined using

TIE instructions) are used to pop an entry from an input queue for incoming data or push data to an outgoing queue. The Xtensa Toolset automatically generates the logic to stall the processor for attempts to read an empty input queue or write to a full output queue.

TIE (Tensilica Instruction Extension) is a -like language used to describe the desired custom instructions. A designer expresses the desired functionality in the

TIE language to add a new instruction to the Xtensa 7 processor core. Extensible instructions include fusion instructions, FLIX instructions and are also used to define port and queue interfaces for Tensilica LX processors.

A processor implementation, which we call LX1 with a complete ISA (instruc- tion set architecture) is assigned to the MP3 benchmark program first, due to its higher computation demand compared with the JPEG benchmark program. Op- codes that are not used in the JPEG program (such as MAC16, MULUH/MULSH and NSA/NSAU) are removed, and the processor thus implemented is called LX2.

We profiled the MP3 and JPEG benchmarks in both the LX1 and LX2 cores. The mapping (refer to Table 7.1) of the benchmark programs to the processor implemen- tations is justified, as the MP3 program achieves a lower cost value (runtime area) × with the LX1 core than with the LX2 core. This shows that the LX1 core provides a better performance/area ratio compared to the latter core. As expected, the JPEG benchmark program achieves a lower cost value when mapped to the LX2 core. CHAPTER 7. DESIGN SPACE EXPLORATION 154

LX1 LX2 Benchmark MP3 JPEG Speed 533 MHz Process 90nm GT Pipeline length 5 Size 79,885 gates 63,843 gates Core Size 0.41 mm2 0.32 mm2 Core Power 81.28 mW 74.35 mW MAC16 √ MUL32 √ √ MULUH/MULSH √ MUL16 √ √ NSA/NSAU √ MIN/MAX/MINU/MAXU √ √ Sign Extend32 √ √ Enable density instructions √ √ Enable Boolean Registers √ √ Zero Overhead Loop √ √ Synchronize Instruction √ Conditioned Store Synchronize √ TIE arbitrary byte enables √ √ Enable TIE wide stores √ √ Max Instruction Width 8 bytes PIF Interface Width 128 bits

Table 7.1: Processor Configuration

The overall core size for each core include the instruction and data cache area and extended instruction size. This is further expanded upon in Section 7.6.

7.3 The System

The design flow for obtaining the best configuration for a given application is sum- marized in Figure 7.3. The input to the system consists of: an application written in

C/C++; a library of pre-configured processors and a set of cache and XPRES (Xtensa

PRocessor Extension Synthesis) configurations (with the respective area utilization CHAPTER 7. DESIGN SPACE EXPLORATION 155

Cache and XPRES Original Preconfigured configurations (area program processor library and implementation information)

(a) (d) Cross-compilation Simulate configurations Heuristic Algorithm (for selecting ISS Find optimum configuration) configuration Profiling

A list of Designer (b) partitions Analyze designs configurations with cost Data Flow Graph Selected Rank designs configuration Partition program

(c) Methodology for Set of designs: Yes Any more identifying designs Partitioned programs with the lowest and connection libraries designs left? cost function No (e)

Partitioned Core programs configurations

Figure 7.3: Design flow for exploring the heterogeneous multiprocessor design space CHAPTER 7. DESIGN SPACE EXPLORATION 156

information). The XPRES compiler is a tool which creates tailored processor de- scriptions for the Xtensa processors from native C/C++ code. The XPRES compiler automatically determines which functions should be accelerated in hardware.

The process starts with a program which is then compiled and profiled to detect the hotspots in the algorithm (Figure 7.3(a)). The designer derives a data flow graph that describes the flow of data through the program. This information can be used to partition the program, manually (or automatically) (Figure 7.3(b)), into multiple modules (Figure 7.3(c)), each module corresponding to a stage in a pipeline process.

The designer may produce a set of possible architectures as shown in Figure 7.1.

Each design may have a different number of pipeline stages and dissimilar parallel pipeline flows. Each design would consist of a set of individual standalone C pro- grams, each associated with a single stage of a pipeline and capable of being compiled and executed on a single microprocessor core. A heuristic (refer to Section 7.4.6) is used to rapidly explore the design space to find the best architectural configuration

(Figure 7.3(d)).

The heuristic algorithm will produce a configuration that is near optimal, given a particular architecture (i.e., number of pipeline stages and parallel pipelines – see Fig- ure 7.1). The algorithm is run on all possible architectures to obtain a near-optimum configuration for each architecture (Figure 7.3(e)). The optimum design is then the architecture that has the lowest configuration cost (explained in Section 7.4.2). The design flow eventually results in a set of partitioned programs and their associated core configurations (i.e., cache and XPRES configurations). CHAPTER 7. DESIGN SPACE EXPLORATION 157

7.4 Design Exploration

The selection and mapping of different regions of software to multitudes of hardware configuration have been widely explored. Previous approaches to hardware/software codesign using ASIPs in a multiprocessor configuration use various mappings to NP- hard problems, notably the 0-1 knapsack problem and its derivatives [166].

We make the following assumptions:

Each design terminates with only one output processor •

Each stage in the pipeline refers to a physical processor with the assigned task • of the stage

Runtime is calculated assuming the processors are not stalled due to an empty • input queue (POP stall) or a full output queue (PUSH stall). This assumption

is valid since the overall computation time for a pipeline will be dominated by

the stage with the longest execution time (critical stage in the pipeline). Thus,

for the purpose of estimating the pipeline design performance, these stalls can

be ignored. (within 2% accuracy in our experiments - refer to Figure 7.5)

Whenever a design has parallel pipelines, we assume that all parallel parts • are identical in terms of the number of pipeline stages and processors. For

example, in Figure 7.1(d), following stage 1, the pipeline splits into three parallel

pipelines, each with an identical number of stages (stages 2 & 3). These three

flows subsequently merge back into a single pipeline in stage 4. This assumption

allows us to simplify the design process. As an extension to this work, we could

have asymmetrical pipeline stages; however this is beyond the scope of this

thesis. CHAPTER 7. DESIGN SPACE EXPLORATION 158

7.4.1 Problem Definition

We are given a program which is represented as a Directed Acyclic Graph (DAG),

H(V,E), where vertices, V are processes and edges, E are data communications between processes.

V = vj : 1 j J (7.1) { ≤ ≤ }

E = (vi, vj) : 1 i < j J, vi V, vj V (7.2) { ≤ ≤ ∈ ∈ } where J is the maximum number of nodes in the design.

A DAG, G(V ′,E′) is used to represent a multiprocessor pipeline design. Given N multiprocessor pipeline designs G(V ′,E′) (see Figure 7.1), each of which is capable of executing the program, there is a direct mapping from the partitioned program

H(V,E) to the multiprocessor pipeline design G(V ′,E′). Vertices in G(V ′,E′) repre- sent processors and edges represent FIFO connections between processors.

From among the N multiprocessor pipeline designs, find a system G(V ′,E′) and its corresponding configuration for each processor in the system which provides the lowest cost (runtime area) amongst these N possible designs. In this thesis, due × to long simulation times, the configurable options for each processor in the system is limited to include only whether the XPRES compiler is enabled or disabled, and the cache sizes for both the instruction and data caches.

The methodology can be complemented with more configurable options, such as register width, bus size, different instruction sets, register call window size, processor pipeline depth and the number of load/store units. CHAPTER 7. DESIGN SPACE EXPLORATION 159

7.4.2 Exhaustive Search

An exhaustive search approach would require a thorough exploration of each and every possible configuration available to a stage in the pipelined multiprocessor system design. The implementation area for a particular configuration is obtained via the

Tensilica Xtensa Toolset [20].

The K set contains the processor configurations on which the different program stages will be running. The different processor configurations refer to the various com- bination of cache configurations and the enabling of extended instructions generated via the Tensilica XPRES Tool [20].

K = Set of processor configurations (7.3) { }

We define the area cost for each particular pipeline stage implementation:

A : V ′ K Z (7.4) × −→

where V ′ and K are the set of processors and processor configurations respectively.

The runtime, R of each set of configurations is obtained by using an instruction set simulator (refer to Section 7.5). The total cost of the design would be the sum of the costs of each individual processor implementation in the pipeline design.

J A A = X (vj′ , kj) (7.5) j=1

where kj K is the corresponding configuration number of processor v′ , and J is the ∈ j total number of processors in the design. Finally, the cost function which needs to be CHAPTER 7. DESIGN SPACE EXPLORATION 160

minimized is defined as

Θ = R A (7.6) × where R and A are all integer values. Given the definitions above, we now try to solve the Θ minimization problem using an exhaustive search algorithm as shown in

Figure 7.4.

Start with one multiprocessor design with V ′, E′ and K sets Reset Θ

Find all possible design configuration combinations: Let K∗ = (k ...kJ ): j, kj K, 1 j J { 1 ∀ ∈ ≤ ≤ } Find configuration with minimum cost: For each design k∗ K∗ ’ Find total area∈ cost J A A = X (vj′ , kj), where kj K∗ ∈ j=1 ’ Find runtime for this particular implementation R = simulation output with configuration g∗ ’ Calculate cost Θnew = R A × Replace minimum kΘ(Θnew, ∗) End for k∗ obtained would be the configuration which provides the best performance per area ratio in the design space.

Figure 7.4: An exhaustive search to obtain a pipeline multiprocessor design with the lowest cost (R A) × CHAPTER 7. DESIGN SPACE EXPLORATION 161

7.4.3 Runtime Estimation

The problem formulated above (as an algorithm), when solved by brute force, would not provide a solution quickly, due to the need to rerun the simulation program for every different configuration combination. The permutations for different implemen-

V ′ tations would result in an exponential complexity of order, O( K | |), where K is | | | | the maximum number of possible processor configurations, and V ′ the number of | | processors in the multiprocessor configuration. To more effectively explore the design space, we developed an estimation technique to closely match the runtime of a con-

figuration combination, provided we have the appropriate information of each node in the pipeline. In effect, the problem space is simplified in order to dramatically improve search speed. As a trade off, the accuracy of the search is reduced.

In order to obtain the runtime values, the partitioned benchmark programs in each pipeline stage are annotated to divide the sections of code into initialization, process and finalization stages. Separate cycle counts are calculated for each of those sections during simulation. Thus, one partitioned program (code in one stage) would have aggregated runtime values Rinit, Rprocess and Rfinal.

We define the functions below:

R : V ′ K (Z, Z, Z) (7.7) × −→

L : V ′ K Z (7.8) × −→

Equation 7.7 is a set of integer tuples containing the runtime values of the ini- tialization Rinit, core iterations Rprocess and finalization Rfinal of the program stage.

Equation 7.8 refers to the latency for a particular pipeline stage implementation to ex- ecute one iteration loop within the program. Within one loop, the program processes CHAPTER 7. DESIGN SPACE EXPLORATION 162

a fixed number of streaming data sets. Consequently, the core iteration runtime,

Rprocess is the product of the average latency of a pipeline stage implementation, L and the number of iterations for which the core loop is executed. Equations 7.7 and 7.8 are functions of the partitioned processors, V ′ and processor configurations,

K.

The different implementations of the pipeline stages would result in different ex- ecution times within the pipeline. As faster stages stall for slower ones, we assume that on average, each pipeline stage latency would be equivalent to the latency of the critical processor, vcrit′ (i.e., the critical pipeline stage). The critical pipeline stage is defined as the stage which has the largest loop execution runtime (i.e. Rprocess) of all processing stages. The corresponding processor configuration for the critical processor is referred to as kcrit. We now redefine the runtime, R of a configuration to include runtimes of each individual core. The runtime of a particular configuration can be defined as

J Rinit L R = (v1′ , k1) + X (vj′ , kk) j=1

final + (I 1) L(v′ , kcrit) + R (v′ , kJ ) − × crit J where I is the number of iterations and Rinit, Rprocess and Rfinal are mapping func- tions to the initialization, core iteration and finalization execution times respectively

(see Equation 7.7); k is the corresponding implementation for each processor, v′. In pipelined systems, increasing workload in the pipeline would bring the system closer to the theoretical performance improvement [78]. Theoretical performance improve- ment is defined as the ratio of the running time on a single processor over the running time of the critical stage processor in the multipipeline processor system. Similarly, CHAPTER 7. DESIGN SPACE EXPLORATION 163

as the number of iterations, I increases to a significantly large number, the sum of the latencies of the pipeline can be ignored. The equation above can then be simplified to

Rinit Rprocess Rfinal R = (v1′ , k1) + (vcrit′ , kcrit) + (vJ′ , kJ ) (7.9)

For example, given a set of processors with arbitrary configurations mapped in a multipipeline processor system, three processors are identified and singled out: the

first, last and critical stage processor in the pipeline. Each processor would have already been profiled and its the runtime values of the initialization Rinit, core itera- tions Rprocess and finalization Rfinal code segments documented and made available.

The estimated runtime of the complete program would be the sum of the initializa- tion time, Rinit of the first processor, the core iterations time, Rprocess of the critical processor and the finalization time, Rfinal of the last processor. The remaining run- time values of the other processors do not play a role. However, the size of all the processors would determine the total size of the multiprocessor system and thus, the cost function of the system, R A. × Equation 7.9 allows us to calculate the execution time of the pipeline system by using only the runtimes of the initial processor, the critical stage processor and the

final processor. Figure 7.5 shows the distribution of errors to the runtime information of the JPEG system architectures when the above equation is used. These error values were produced by comparing the different runtime values obtained through simulating the entire pipeline system and through estimating the runtime values via

Equation 7.9. It can be seen from Figure 7.5 that the estimated runtimes are within

2.5% of the actual values. CHAPTER 7. DESIGN SPACE EXPLORATION 164

Error Distribution 600

500

400

300

Occurrence 200

100

0 0 0.5 1 1.5 2 2.5 Error (%)

Figure 7.5: Error distribution when using equation 7.9

7.4.4 Estimation-based Search

Simulating the entire pipeline of processors for every configuration using an instruc- tion set simulator (ISS) requires several hours. Based on Equation 7.9, we develop a methodology to obtain the estimated execution time of any configuration combina- tion without tediously simulating every possible configuration via an ISS. With the runtime values of each possible configuration of a pipeline stage known, it is possible to estimate the total runtime of any pipeline configuration using Equation 7.9.

To minimize the number of simulations so as to obtain the individual runtime for each processor in the system, a pipeline simulation is run with all cores configured with the same settings. Effectively, each possible configuration for a particular pro- cessor is only chosen and run once. If there are K possible configurations, then only | | K simulations have to be run. Figure 7.6 shows the first seven simulation runs for | | a four-processor system. To obtain the execution times for each pipeline stage, we simulate all pipelines using the same configuration in one single simulation run. The execution times for each core would be calculated and reported separately from the CHAPTER 7. DESIGN SPACE EXPLORATION 165

Process 1 Process 2 Process 3 Process 4 Configuration Configuration Configuration Configuration Sim 1 1 1 1 1 Sim 2 2 2 2 2 Sim 3 3 3 3 3 Sim 4 4 4 4 4 Sim 5 5 5 5 5 Sim 6 6 6 6 6 Sim 7 7 7 7 7

Figure 7.6: Table depicting which processor configurations are mapped for each pro- cessor in the pipeline system for different simulation runs to obtain aggregate runtime values for each processor other cores. Stall times (waiting time for reading and writing to queues - interpro- cessor communications) from each core are noted and removed from the total cycle count of the core to obtain the net execution time (i.e., processing time) for each core.

Alternatively, the runtime values of each core can be obtained independently if the core can be simulated with data fed in as output from the input queue. In total,

K V ′ independent simulations would need to be run. | | × | | The original problem was presented in Section 7.4.1 and a solution method which is NP-complete given in Section 7.4.2. The runtime estimation technique introduced in Section 7.4.3 reduces the NP-completeness to a linear problem that can be solved in polynomial time. When the runtime of a particular configuration is required,

Equation 7.9 is used. The aggregate runtimes needed for total runtime calculation are only those of the first, critical and last processors. The remaining processors in the system are accounted for in the total area count. Thus, the cost of the system would be the runtime, R multiplied by the total area of the system, A.

The algorithm to systematically explore the transformed design space is broken CHAPTER 7. DESIGN SPACE EXPLORATION 166

down into two independent subproblems. The first is to find the optimal set of con-

figurations assuming the selected processor configuration in a selected stage is the critical processor in the multipipeline processor system. According to the runtime estimation methodology in Section 7.4.3, execution time of the overall system is de- termined by the latency of the slowest pipeline stage (i.e., the processor with the highest execution latency).

The input to the algorithm is a table of runtimes of V ′ processors with K | | | | configurations, similar to the table shown in Figure 7.6. An arbitrary configuration of an arbitrary processor is selected to be the critical node. In order to obtain the lowest cost of the entire system with the current processor configuration as the constraint, the total area of the entire system has to be minimized. For each other processor in the system, it is sufficient to choose the lowest area configuration that satisfies the latency constraint of the critical processor. The complexity of such a search would be O( V ′ K ). If the runtimes for each processor configurations are sorted, a binary | || | search can be used, providing a complexity of O( V ′ log( K )). | | | | The total runtime of the entire system is also dependent on the initialization and

finalization times of the first and last processor respectively. Thus, an exhaustive search on the combinations of the first and last processors is vital to obtain the most optimal solution. The complexity for an exhaustive search on the first and last processor combinations would be O( K 2). | | The second subproblem would be to obtain an optimal solution for each config- uration in Table 7.6. The above methodology is applied to every possible processor configuration in the input table (i.e., an order of O( V ′ K )). If a particular configu- | || | ration cannot be the critical processor due to other processors having runtime values greater than the selected processor, the particular configuration is skipped. The en-

2 tire search process would have a complexity of O( V ′ K ( V ′ log( K ) + K )). The | || | | | | | | | CHAPTER 7. DESIGN SPACE EXPLORATION 167

multiprocessor configuration that has the lowest cost function is the optimum solu- tion to the design space. Thus, the complexity of the exhaustive algorithm can be

2 3 rewritten as O( V ′ K log( K ) + V ′ K ). | | | | | | | || |

Get optimal configurations:

For all processors, v1′ V ′: For configuration k ∈in K configurations:

Assuming processor v1′ with configuration k is the critical stage, find the configurations for remaining processors: For all processors, v′ V ′ except v′ : 2 ∈ 1 Get configurations for each v2′ with lowest area which satisfies runtime of v1′ with configuration k Next v2′ If v1′ is the first processor Find configuration of last processor which minimizes total runtime area × Else if v1′ is the last processor Find configuration of first processor which minimizes total runtime area Else × Find every possible combination of first and last processor such that total runtime area is minimized Endif × Next k

Next v1′

The set of k’s for each processor in V ′ provides the lowest cost (total runtime area) and is the optimal solution for the design space. ×

Figure 7.7: A modified exhaustive search based on the runtime estimation technique in Section 7.4.3 to obtain a pipeline multiprocessor design with the lowest cost (R A) ×

Table 7.2 shows the exploration time taken by the estimation-based search and the heuristic (introduced in Section 7.4.6). The JPEG benchmarks have 648 possible processor configurations available, while 288 processor configurations are made avail- able to the MP3 benchmarks. Column 3 in Table 7.2 shows the computation time for exhaustively searching the design space. As expected, the systems with more CHAPTER 7. DESIGN SPACE EXPLORATION 168

Table 7.2: Exploration time Benchmark System Configuration Estimation-based Search Heuristic JPEG 5 processors (4 stages) 8s <1s JPEG 5 processors (5 stages) 6s <1s JPEG 6 processors 3s <1s JPEG 9 processors 33s <1s MP3 4 processors <1s <1s MP3 6 processors 1s <1s processors need more exploration time. The JPEG system with six processors is an exception, because the critical processor is the last processor in the system. Thus, the internal search algorithm for the best combination of first and last processor (O( K 2)) | | does not apply. Thus, a faster search is seen, compared with a five processor system.

Search times for the MP3 benchmarks were shorter due to a smaller search space.

The estimation-based search is highly influenced by the number of configuration options that are available to the designer. As the configuration option space, K | | becomes larger (while the number of processors remain the same, complexity asymp- totically approaches O( K 3). The configuration spaces that are explored in this paper | | are relatively small compared to those in industrial use, where the configuration space is a few orders of magnitude in size. In reality, the number of possible configurations can range up to the millions, as different combination of instruction sets can be used along with a variety of configurable components and coprocessor options, such as

floating point and MAC (Multiply-And-Accumulate) units. Scratch pads and local memory can be used along side hierarchical cache systems; each able to be config- ured independently of the other components. Exploration time would then increase dramatically.

The algorithm derives its complexity due to an exhaustive search when searching the most efficient combination for the first and last processor in the system (O( K 2)). | | Finally, every possible configuration of the critical processor is explored to find the CHAPTER 7. DESIGN SPACE EXPLORATION 169

most efficient configuration (an additional O( K )). Thus, a heuristic is developed to | | reduce the complexity of the search while trading-off accuracy. The heuristic would not need to search all possible configurations for the critical processor. In other words, a ‘one time pass’ should be sufficient to estimate the cost efficiency of a multipipeline processor system.

7.4.5 Priliminary Heuristic

Although the estimation technique in Subsection 7.4.4 makes it easier to obtain the optimal configurations for a design space, the worst case scenario is still of square- or cubic-complexity that would not scale well when the number of configurations are in- creased or more processors are used in the multipipeline processor system. A heuristic is developed to reliably explore the transformed design space (refer to Section 7.4.3) to obtain the configuration which would provide a cost close to the optimal value. The heuristic relies on the fact that we have the runtime information (e.g. initialization, core iteration and finalization runtime) of every node in the multiprocessor system) with simulation period of the order O( V ′ K ). | | × | | The heuristic approach analyzes each processor configuration only once. The configurations of each processor are evaluated based on cost efficiency and significance amongst the other configurations. This heuristic (refer to Figure 7.9) approach has a complexity of O( V ′ K ). | | × | | Our initial experiments resulted in a preliminary heuristic which generated con-

figurations with cost values within the acceptable error margins. The heuristic is presented and published in [140].

Based on Equation 7.9, we develop the heuristic shown in Figure 7.8. The heuristic relies on the fact that our simulation period is of the order O( V ′ K ), compared | | × | | 2 3 to the complexity of O( V ′ K log( K ) + V ′ K ) in Section 7.4.4. | | | | | | | || | CHAPTER 7. DESIGN SPACE EXPLORATION 170

Get minimum core iteration runtime of each processor: k K process R− = rj : j, 1 j J, rj = MIN ∈ (R (vj)), where vj V { ∀ ≤ ≤ k ∈ } Find critical node:. ’ Critical node is the processor with the worst minimum core iteration runtime process Critical node is vcrit where k K, R (vcrit) = MAX(R−) ∀ ∈ k Start with critical node: For each configuration k in set K: Rprocess C Calculate Cost= k (vcrit) k(vcrit) Next k × Configuration k with smallest cost would be selected Rprocess Critical Runtime = k (vcrit)

Evaluate all other nodes: For every other node, vj V : ∈ Rprocess Filter all configurations k where k (vj) < Critical Runtime For each remaining configuration k in K configurations: Rinit C If this is the first node, calculate Cost= k (vj) k(vj) final × If this is the last node, calculate Cost= R (vj) Ck(vj) k × Otherwise, calculate Cost= Ck(vj) Next k Configuration k with smallest cost would be selected Next node

Output: The set of k’s obtained for each processor

Figure 7.8: A preliminary heuristic for mapping the pipeline stages to the various hardware implementations which tries to maximize the performance per area ratio.

The heuristic above uses the estimation technique in Section 7.4.3 and assumes that all runtime execution times of the various pipeline stages and core configurations have been obtained (see Figure 7.6). The critical process is first identified by distin- guishing the process that has the worst minimum core iteration runtime of all other processes. When the process has been identified, the configuration that results in the lowest cost value is selected. For every other node, the configurations that have core iteration runtimes exceeding the core iteration runtime of the selected critical process CHAPTER 7. DESIGN SPACE EXPLORATION 171

are excluded from the selection set. For each of these processes, configurations that result in the lowest cost value are selected.

Preliminary Results

Tables 7.3 and 7.4 show the runtimes and cost values obtained via the estimation- based, and preliminary heuristic approach shown in Figure 7.8. The first column shows the number of processors and pipeline stages in the system. If there are more processors then pipeline stages, this denotes that a parallel pipeline stage exists. The next four columns show the instruction and data cache sizes, core area and whether the XPRES compiler option has been enabled (for the estimation-based approach).

Note that all cache configurations have an associativity of one unless otherwise stated.

This is followed by four more columns of the same description but representing the heuristic approach. The last column shows the difference in total area count of the pipeline systems, the runtime cycles and cost value (runtime area) between × the estimation-based and heuristic approaches. The area, runtime and cost values are shown in the last three rows of every configuration row. Columns four and seven show the runtime (cycle count) and cost (runtime area) obtained via our heuristics. × Table 7.3 shows the performance of our heuristic approach using the JPEG bench- mark program. The initial heuristic is able to generate configurations with cost function within 20% of the optimal cost value (obtained via the estimation-based approach). Our best result was within 0.04% of the optimal value. The best multi- processor configuration for JPEG is found to be the 5 core (4 pipeline stages) imple- mentation.

In Table 7.4, our heuristic is shown to provide an accuracy of 5.75$ of the optimum result. The best multiprocessor configuration for MP3 is the 4 core (3 pipeline stages) implementation. CHAPTER 7. DESIGN SPACE EXPLORATION 172

Estimation-based Heuristic Configuration ICache DCache Area XPRES ICache DCache Area XPRES Difference (bytes) (bytes) (mm2) (bytes) (bytes) (mm2) 2048 2048 0.69315 2048 2048 0.69315 1024 1024 0.65502 1024 1024 0.65502 5 cores 1024 1024 0.75445 √ 1024 1024 0.75445 √ (4 stages) 1024 1024 0.65502 1024 1024 0.65502 2048 1024 0.86497 √ 8192 4096 1.05850 √ Total Area 2 3.622621 3.816147 5.34% (mm ) ↑ Runtime 2,509,052 2,507,481 0.06% (cycles) ↓ Cost value 9,089,344 9,568,917 5.28% ↑ 2048 2048 0.85636 √ 1024 1024 0.81824 √ 2048 1024 1.01799 √ 1024 1024 0.99893 √ 5 cores 1024 1024 0.65502 1024 1024 0.65502 (5 stages) 1024 1024 0.81981 √ 1024 1024 0.81981 √ 1024 1024 0.65502 2048 2048 0.69315 Total Area 2 4.004213 3.985148 0.48% (mm ) ↓ Runtime 2,385,684 2,585,894 8.39% (cycles) ↑ Cost value 9,552,786 10,305,171 7.88% ↑ 2048 4096 0.73678 2048 4096 0.73678 1024 1024 0.65502 1024 1024 0.65502 6 cores 1024 1024 0.65502 1024 1024 0.65502 (4 stages) 1024 1024 0.65502 1024 1024 0.65502 1024 1024 0.65502 1024 1024 0.65502 4096 4096 0.97129 √ 4096 2048 0.92766 √ Total Area 2 4.328164 4.284536 1.01% (mm ) ↓ Runtime 2,231,559 2,255,188 1.06% (cycles) ↑ Cost value 9,658,553 9,662,435 0.04% ↑ 4096 4096 0.94980 √ 4096 2048 0.73678 4096 2048 0.73678 4096 1024 0.71772 4096 2048 1.08069 √ 4096 1024 0.71772 4096 2048 1.08069 √ 4096 1024 0.71772 9 cores 4096 2048 0.78221 √ 4096 2048 0.73678 (5 stages) 4096 2048 0.78221 √ 4096 2048 0.73678 4096 2048 0.78221 √ 4096 2048 0.73678 4096 2048 0.88646 √ 4096 1024 0.71772 1024 1024 0.65502 2048 2048 0.69315 Total Area 2 7.736048 6.511132 15.83% (mm ) ↓ Runtime 1,812,193 2,582,436 42.50% (cycles) ↑ Cost value 14,019,212 16,814,582 19.94% ↑ Table 7.3: Configurations obtained for the JPEG benchmark (preliminary). The table compares the configurations obtained via the estimation-based and heuristic approaches. CHAPTER 7. DESIGN SPACE EXPLORATION 173

Estimation-based Heuristic Configuration ICache DCache Area XPRES ICache DCache Area XPRES Difference (bytes) (bytes) (mm2) (bytes) (bytes) (mm2) 2048 1024 0.76409 8192 2048 0.91399 4 cores 2048 1024 0.76409 2048 1024 0.76409 (3 stages) 4096∗ 2048 0.99377 4096∗ 1024 0.97471 4096∗ 4096 1.03740 4096 1024 0.80772 Total Area 2 3.559349 3.460499 2.78% (mm ) ↓ Runtime 3,865,873,883 4,204,536,116 8.76% (cycles) ↑ Cost value 13,759,996,041 14,549,794,245 5.74% ↑ 1024 1024 0.74502 4096 2048 0.82678 2048 1024 0.76409 2048 1024 0.76409 6 cores 2048 1024 0.76409 2048 1024 0.76409 (4 stages) 2048 1024 0.76409 2048 1024 0.76409 1024 1024 0.74502 1024 1024 0.74502 4096 1024 0.80772 2048 1024 0.76409 Total Area 2 4.590027 4.628155 0.83% (mm ) ↑ Runtime 3,555,683,787 3,658,802,853 2.90% (cycles) ↑ Cost value 16,320,684,017 16,933,507,412 3.75% ↑ ∗This cache configuration has an associativity of two. Note that all other configurations have set associativity of one. Table 7.4: Configurations obtained for the MP3 benchmark (preliminary). The ta- ble compares the configurations obtained via the estimation-based and heuristic ap- proaches.

Preliminary Analysis

The heuristic above initially attempts to select the optimal configuration for the critical stage in the pipeline. This selection process relies on the area runtime × equation. The equation evaluates the area per performance ratio that determines the cost effectiveness of the particular processor. This approach overall provides reasonable results as shown in Tables 7.3 and 7.4.

However, further analysis shows that the selection process for the critical stage is biased towards obtaining the most cost efficient configuration for a particular pro- cessor, rather then for the entire multiprocessor configuration. This bias is due to the algorithm, which does not consider the possible configurations of the other pro- cessing stages when selecting the current configuration for the critical stage. When a configuration has been selected for the critical stage, configuration selection of the CHAPTER 7. DESIGN SPACE EXPLORATION 174

remaining stages would effectively be dependent on the runtime of the configuration of the critical stage.

In order to obtain better accuracy, we transform the cost function space to provide a different approach to penalizing configurations with high area and runtime values.

We proposed a linearized approach to evaluating the cost, which includes area usage and runtime values, in Section 7.4.6.

7.4.6 Heuristic

In order to obtain more accurate results compared to the preliminary heuristic, we based a different approach on normalizing the runtime and area costs. The resulting heuristic proves superior, with configurations obtaining values less than 0.5% (refer to Tables 7.5 and 7.6) in error compared to the optimal values (based on current benchmark applications used).

In this heuristic, we would like to select cores that would provide overall minimal runtime and area cost. The magnitude of runtime and area cost should not affect the selection process. Thus, for selection purposes, we normalize their values to their minimum and maximum. These are used to identify runtime and area cost that are close to either the smallest or largest value in the set. We define normalized runtime,

¯r()and normalized area cost, ¯a():

process kˇ K process R (v′, k) MIN ∈ (R (v′, k))ˇ ¯r(v′, k) = − (7.10) kˇ K process kˇ K process MAX ∈ (R (v , k))ˇ MIN ∈ (R (v , k))ˇ ′ − ′ kˇ K A(v′, k) MIN ∈ (A(v′, k))ˇ ¯a(v′, k) = − (7.11) kˇ K kˇ K MAX ∈ (A(v , k))ˇ MIN ∈ (A(v , k))ˇ ′ − ′ where v′ V ′ and K is the set of configurations (See Section 7.4.2). α, β and γ are ∈ CHAPTER 7. DESIGN SPACE EXPLORATION 175

coefficient constants which provide the necessary weights to the area, runtime and initialization/finalization ratio constants respectively. These coefficients are explained in Section 7.7.

The heuristic in Figure 7.9 approximates the critical stage in the multipipeline processor system. The critical stage would be the stage which has the largest minimal loop execution runtime (i.e., Rprocess) of all the processing stages. The configuration of the critical stage is selected based on a cost function (refer to Figure 7.9). Next, the processor configurations of the other pipeline stages that do not meet the runtime constraint of the selected critical stage configuration are filtered out. These remaining configurations of each pipeline stage are evaluated based on different cost functions.

The configuration which provides the smallest cost would be selected. Finally, the set of configurations for each processor would be the resulting set that provides a performance per area ratio close to the optimum value.

The improvement from implementing the algorithm in Figure 7.9 over the estimation- based approach is clearly shown in Table 7.2. Column 4 in Table 7.2 shows the search time utilizing the heuristic approach. All search tasks were performed in less than a second. The heuristic algorithm scales linearly with increasing processor configura- tions and processor count.

Figure 7.10 illustrates the scalability of the estimation-based search and the heuris- tic approach when more processors and processor configurations are added into the design space. In the estimation-based approach (see Figure 7.10(a)), the number of possible processor configurations dramatically increases the search time. The effect of increasing configuration count is greatly magnified when more processors are added into the system. In contrast, Figure 7.10(b) shows the magnitude difference for the heuristic approach in search time compared to the estimation-based approach. This shows the scalability of the heuristic approach with respect to increasing processor CHAPTER 7. DESIGN SPACE EXPLORATION 176

Get minimum core iteration runtime of each processor: k K process R− = rj : j, 1 j J, rj = MIN ∈ (R (v′ )), where v′ V ′ { ∀ ≤ ≤ j,k j ∈ } Find critical node:. ’ Critical node is the processor with the worst minimum core iteration runtime process Critical node is v′ where k K, R (v′ , k) = MAX(R−) crit ∀ ∈ crit Start with critical node: For each configuration k in set K:

Calculate ¯a(vcrit′ , k) and ¯r(vcrit′ , k), Initialize ¯x=0 Rinit Rprocess If this is the first node, calculate ¯x= (vcrit′ , k)/ (vcrit′ , k) Rfinal Rprocess If this is the last node, calculate ¯x= (vcrit′ , k)/ (vcrit′ , k) Cost = α ¯a(vcrit′ , k) + β ¯r(vcrit′ , k) + γ ¯x Next k · · · Configuration k with smallest cost would be selected Rprocess Critical Runtime = (vcrit′ , k)

Evaluate all other nodes:

For every other node, vj′ V : ∈ Rprocess Filter all configurations k where (vj′ , k) < Critical Runtime For each remaining configuration k in K configurations:

Calculate ¯a(vj′ , k), Initialize ¯x=0 Rinit If this is the first node, calculate ¯x= (vj′ , k) / Critical Runtime Rfinal If this is the last node, calculate ¯x= (vj′ , k) / Critical Runtime Cost = α ¯a(v′ , k) + γ ¯x · j · Next k Configuration k with smallest cost would be selected Next node

The set of k’s obtained for each processor provides a performance per area ratio closest to the optimum value.

Figure 7.9: A heuristic for mapping the pipeline stages to the various hardware implementations which tries to minimize the cost function. CHAPTER 7. DESIGN SPACE EXPLORATION 177

(a) Estimation-based Approach (Non-heuristic)

(b) Heuristic Approach

Figure 7.10: Comparison of complexity of estimation-based and heuristic approaches CHAPTER 7. DESIGN SPACE EXPLORATION 178

configurations and processor count.

As the number of configurations increases to a large value, the K term would | | dominate (i.e., complexity of O( K 3)) and the exploration time of the estimation- | | based approach would be several orders of magnitude more than the heuristic ap- proach (which would have an exploration time that scales linearly with K ). For | | example, if XPRES generates 20 instructions, then there would be 1,048,575 possible combinations (220 1). Additionally, we multiply this number by the total number − of cache configurations (e.g., 288) and the number of optional processing units (e.g.,

10) to obtain the total number of configurations, K = 3, 019, 896, 000. The V | | | | term in the estimation-based approach (a V 2 term) too, does not scale as well as | | the heuristic approach (a linear V term). Refer to Section 7.4.4 for the number of | | configurations used for the benchmark applications.

7.5 Experimental methodology

We used Tensilica’s Xtensa RA2006.4 Toolset for the Xtensa LX family of processors.

The toolset provides a set of compilation tools to compile C/C++ code for the ar- chitecture described in Table 7.1. The Tensilica Instruction Set Simulator (ISS) and

Xtensa Modelling Protocol (XTMP) environment were used to simulate the multi- core systems. For each system, multiple Xtensa cores were instantiated and XTMP was used to connect the cores together, including memory models and peripherals

(bus and FIFO models). The ISS directly models the Xtensa pipeline and operates as a system-simulation component using the XTMP environment. With XTMP, differ- ent multiprocessor configurations can be set up and simulated simultaneously within a system level simulation.

The simulator allows for communication between the cores and peripherals using a CHAPTER 7. DESIGN SPACE EXPLORATION 179

cycle-accurate, split-transaction simulation model without using a clock. The ISS was used to generate profiling data for all cores in the system, which were then analyzed using Tensilica’s gprof profiler. The profiles can include the cycles for all functions executed by the cores. The ISS can also print a summary of the total cycle count and global stalls of each core.

Each individual core is connected via the queue interface provided by the Xtensa

LX core using the XTMP environment. Queue models have been created and used in the XTMP environment as libraries. In our work, we simulate all queues with a very large number of queue buffers, so that no PUSH stalls will occur. An ideal queue size is dependent on the input and output rate of the connecting processors. The current methodology emphasizes the performance of the system and thus assumes that the interconnecting queues are sufficiently large to prevent PUSH stalls. If a queue size is required, our architecture design can always be mapped to a Kahn

Process Network [90] (KPN), and using the available tools for KPNs, the optimal queue size can be calculated.

We created our benchmark programs by identifying the various stages of the MP3 and JPEG encoders and mapping them to individual processors. We partition and allocate these stages based on the open standards of the respective encoders. We cre- ated four multiprocessor configurations for the JPEG encoder and two configurations for the MP3 encoder. An XTMP simulation program, specially customized to gener- ate profiling, execution time of each stage and other relevant benchmark information is created for each of these multiprocessor systems.

The XPRES compiler is used to create tailored processor descriptions for the

Xtensa processors from native C/C++ code. We are able to reuse the existing ASIP design flow to create custom RTLs for each core in the system. Using the designer- defined input of C programs to be analyzed, XPRES extends the base processor with CHAPTER 7. DESIGN SPACE EXPLORATION 180

new instructions, operations and register files, using TIE extensions. It does so by automatically generating a new TIE file which can be included when recompiling the source code. Half the set of the configuration options defined in Equations 7.4, 7.7 and 7.8 are XPRES enabled.

Figure 7.11: Benchmark image used for JPEG encoding

Area costs include the base processor, instruction and data caches and the TIE instructions. A rose image from the Independent JPEG Group [8] (see Figure 7.11) of size 227 by 149 pixels is used as a raw input stream to the JPEG encoder system, while a 6 second PCM encoded music clip is used as the input stream to the MP3 system.

Note that the systems were extensively tested in other streams. For example, the

JPEG system encoded ten images of the same resolution (227 by 149 pixels), each of the type given in Figure 7.12. The runtime for a single processor system and multi- processor system are given in Figure 7.13. It can be concluded that the configuration

(5 cores, 4 stages – the first configuration in Table 7.5) obtained via the heuristic can be used with other data sets without much variability in performance. The slight runtime increase in the last ten frames is due to the the image (see Figure 7.12(d)) with a complicated background tone and more color changes in the foreground com- pared to the previous three frame sets. The performance improvement over the single processor system mirrors the performance improvement seen in Figure 7.14(a). CHAPTER 7. DESIGN SPACE EXPLORATION 181

7.6 Results & Analysis

In a similar way to Tables 7.3 and 7.4, Tables 7.5 and 7.6 show the configurations obtained via the new heuristic approach shown in Section 7.4.6.

Compared to the previous approach in Section 7.4.5, the new heuristic significantly improves on the accuracy of the previous method. In two configurations for the

JPEG benchmark, the heuristic managed to obtain the exact configuration that was generated via the estimation-based approach. The best configurations obtained are the 5 core (4 pipeline stages) implementation for JPEG and the 4 core (3 pipeline stages) for MP3.

The results from the estimation-based search represent the configuration with the lowest cost (square points in Figures 7.16(b) and 7.17(b)).

The deviation of the heuristics obtained is shown as a percentage of the best possible values obtained from an estimation-based search. Note that our algorithm emphasizes the reduction of the cost (via the cost function) rather than maximizing performance of the application. Nevertheless, our heuristic still produces runtime values close to the best possible runtime for our benchmark applications. The results from the estimation-based search represent the configuration with the lowest cost

(square points in Figures 7.16(b) and 7.17(b)).

Figure 7.14 and 7.15 shows the design space exploration for both the JPEG and

MP3 multiprocessor systems. Figure 7.14(a) shows the design space of the JPEG algorithm implementation. The subfigures on the left show the runtime performance of the benchmarks vs area. The group of data points on the left corner of each graph in Figures 7.14(a) and 7.15(a) corresponds to the single processor implementation.

The data points with different colors on the right corner of the graphs correspond CHAPTER 7. DESIGN SPACE EXPLORATION 182

(a) Tennis (b) Mom

(c) Window (d) Flower Bed

Figure 7.12: First frames of video sequence stages, best viewed in color

6 x 10 JPEG Enconding for Frame Sequences 16

14

12

10 Single core processor

8

6 Runtime (clock cycles) 4

2 Multi−core processor selected via heuristics

0 0 5 10 15 20 25 30 35 Frame Sequence

Figure 7.13: JPEG encoding runtime for each frame in the video sequence CHAPTER 7. DESIGN SPACE EXPLORATION 183

Estimation-based Heuristic Configuration ICache DCache Area XPRES ICache DCache Area XPRES Difference (bytes) (bytes) (mm2) (bytes) (bytes) (mm2) 2048 2048 0.69315 2048 2048 0.69315 1024 1024 0.65502 1024 1024 0.65502 5 cores 1024 1024 0.75445 √ 1024 1024 0.75445 √ (4 stages) 1024 1024 0.65502 1024 1024 0.65502 2048 1024 0.86497 √ 2048 1024 0.86497 √ Total Area 3.622621 3.622621 0% (mm2) Runtime 2,509,052 2,509,052 0% (cycles) Cost value 9,089,344 9,089,344 0% 2048 2048 0.85636 √ 2048 2048 0.85636 √ 2048 1024 1.01799 √ 2048 2048 1.03706 √ 5 cores 1024 1024 0.65502 1024 1024 0.65502 (5 stages) 1024 1024 0.81981 √ 1024 1024 0.81981 √ 1024 1024 0.65502 1024 1024 0.65502 Total Area 2 4.004213 4.023277 0.48% (mm ) ↑ Runtime 2,385,684 2,384,698 0.04% (cycles) ↓ Cost value 9,552,786 9,594,300 0.43% ↑ 2048 4096 0.73678 2048 4096 0.73678 1024 1024 0.65502 1024 1024 0.65502 6 cores 1024 1024 0.65502 1024 1024 0.65502 (4 stages) 1024 1024 0.65502 1024 1024 0.65502 1024 1024 0.65502 1024 1024 0.65502 4096 4096 0.97129 √ 4096 4096 0.97129 √ Total Area 4.328164 4.328164 0% (mm2) Runtime 2,231,559 2,231,559 0% (cycles) Cost value 9,658,553 9,658,553 0% 4096 4096 0.94980 √ 4096 2048 0.90617 √ 4096 2048 0.73678 4096 2048 0.73678 4096 2048 1.08069 √ 4096 2048 1.08069 √ 4096 2048 1.08069 √ 4096 2048 1.08069 √ 9 cores 4096 2048 0.78221 √ 4096 2048 0.78221 √ (5 stages) 4096 2048 0.78221 √ 4096 2048 0.78221 √ 4096 2048 0.78221 √ 4096 2048 0.78221 √ 4096 2048 0.88646 √ 4096 2048 0.88646 √ 1024 1024 0.65502 1024 1024 0.65502 Total Area 2 7.736048 7.692421 0.56% (mm ) ↓ Runtime 1,812,193 1,825,402 0.73% (cycles) ↑ Cost value 14,019,212 14,041,760 0.16% ↑ Table 7.5: Configurations obtained for the JPEG benchmark. The table compares the configurations obtained via the estimation-based and heuristic approaches. CHAPTER 7. DESIGN SPACE EXPLORATION 184

6 x 10 Runtime vs Area 16 1 core 14 5 cores (4 stages) 5 cores (5 stages) 6 cores (4 stages) 12 9 cores (5 stages) Optimal points for 10 each topology

8

6 Runtime (clock cycles) 4

2

0 0 2 4 6 8 10 12 Area (mm2)

(a) JPEG System

7 x 10 Cost (Runtime x Area) vs Area 3.5 5 cores (4 stages) 5 cores (5 stages) 3 Optimal points for 6 cores (4 stages) each topology 9 cores (5 stages)

2.5

2 Cost Function 1.5

1

0.5 3 4 5 6 7 8 9 10 11 Area (mm2)

(b) JPEG System

Figure 7.14: JPEG multiprocessor pipeline systems design space CHAPTER 7. DESIGN SPACE EXPLORATION 185

10 x 10 Runtime vs Area 3.5 1 core 4 cores (3 stages) 3 6 cores (4 stages)

2.5 Optimal points for 2 each topology

1.5

Runtime (clock cycles) 1

0.5

0 0 1 2 3 4 5 6 7 Area (mm2)

(a) MP3 System

10 x 10 Cost (Runtime x Area) vs Performance 5 4 cores (3 stages) 4.5 6 cores (4 stages)

4 Optimal points for each topology 3.5

3

Cost Function 2.5

2

1.5

1 2.5 3 3.5 4 4.5 5 5.5 6 6.5 Area (mm2)

(b) MP3 System

Figure 7.15: MP3 multiprocessor pipeline systems design space CHAPTER 7. DESIGN SPACE EXPLORATION 186

6 x 10 Pareto Points : Runtime vs Area 15

10

5 Runtime (clock cycles)

0 0 2 4 6 8 10 12 Area (mm2)

(a) JPEG System

7 x 10 Pareto Points : Cost (Runtime x Area) vs Area 2

1.8

1.6

1.4 Cost Function 1.2

1

0.8 3 4 5 6 7 8 9 10 11 Area (mm2)

(b) JPEG System

Figure 7.16: Pareto points of a JPEG multiprocessor pipeline systems design space CHAPTER 7. DESIGN SPACE EXPLORATION 187

10 x 10 Pareto Points : Runtime vs Area 2.5

2

1.5

1 Runtime (clock cycles)

0.5

0 0 1 2 3 4 5 6 Area (mm2)

(a) MP3 System

10 x 10 Pareto Points : Cost (Runtime x Area) vs Area 3

2.8

2.6

2.4

2.2

2

1.8 Cost Function

1.6

1.4

1.2

1 2.5 3 3.5 4 4.5 5 5.5 6 Area (mm2)

(b) MP3 System

Figure 7.17: Pareto points of an MP3 multiprocessor pipeline systems design space CHAPTER 7. DESIGN SPACE EXPLORATION 188

Estimation-based Heuristic Configuration ICache DCache Area XPRES ICache DCache Area XPRES Difference (bytes) (bytes) (mm2) (bytes) (bytes) (mm2) 2048 1024 0.76409 2048 1024 0.76409 4 cores 2048 1024 0.76409 2048 1024 0.76409 (3 stages) 2048 2048 0.89911 √ 2048 1024 0.90146 √ 4096∗ 4096 1.03740 4096 2048 0.82678 Total Area 6.59% 3.486102 3.256417 (mm2) ↓ Runtime 7.37% 3,865,873,883 4,150,654,337 (cycles) ↑ Cost value 13,476,829,531 13,516,259,783 0.29% ↑ 1024 1024 0.74502 1024 1024 0.74502 2048 1024 0.76409 2048 1024 0.76409 6 cores 2048 1024 0.76409 2048 1024 0.76409 (4 stages) 2048 1024 0.76409 2048 1024 0.76409 1024 1024 0.74502 1024 1024 0.74502 4096 1024 0.80772 4096 2048 0.82678 Total Area 0.42% 4.590027 4.609091 (mm2) ↑ Runtime 0.31% 3,555,683,787 3,544,700,433 (cycles) ↓ Cost value 16,320,684,017 16,337,846,899 0.11% ↑ ∗This cache configuration has an associativity of two. Note that all other configurations have set associativity of one. Table 7.6: Configurations obtained for the MP3 benchmark. The table compares the configurations obtained via the estimation-based and heuristic approaches. to different multiprocessor configurations that have been implemented (refer to Fig- ure 7.1).

From our heuristic, we are able to determine a near-optimal configuration (lowest cost value). These are the design with five processors (four stages) for JPEG and the design with four processors (three stages) for MP3. The designs for these systems are shown in Figures 7.1(b) and 7.1(f) respectively.

For JPEG encoding, our heuristics obtained around 0.43% of the optimum value

(lowest cost) (using an estimation-based search; refer to Section 7.4.4), while in MP3 encoding we manage to get close to within 0.29% of the lowest cost attainable. The heuristic analysis provided us with the configuration that is closest to optimal for the particular MP3 and JPEG encoders.

In Figures 7.14(a) and 7.15(a), we show that our heuristic provides us with points close to optimum runtime while minimizing the cost function. With the five processor CHAPTER 7. DESIGN SPACE EXPLORATION 189

(four stages) parallel pipeline JPEG implementation, we are able to obtain at least

4.03 speedup over the fastest single processor implementation and 3.31 speedup, × × with the four processor (three stages) parallel pipeline MP3 encoder. These compar- isons are performed between configurations obtained through the heuristic approach and the fastest implementation of the single processor system (JPEG benchmark:

32kB instruction and data cache with associativity of 2 and 4 for instruction and data cache respectively; MPEG benchmark: 32kB with associativity of 4 for both instruction and data cache).

Figures 7.16(a) and 7.17(a) are Pareto points from Figures 7.14 and 7.15. The multiprocessor configurations (group of points on the right of each figure) provide better speedups than the single processor configuration. Figures 7.16(b) and 7.17(b) are the cost functions derived from their corresponding runtime graphs. It can be seen that as performance increases, the cost of the multiprocessor implementation increases as well.

In contrast to the optimal cost value of the multiprocessor systems shown in

Tables 7.5 and 7.6, the most cost efficient single processor systems are the system with 4096kB instruction cache and 2048kb data cache without XPRES enabled for the JPEG benchmark (cost of 8,209,174), and the system with 4096kB instruction cache and 2048kB data cache without XPRES enabled for the MP3 benchmark (cost of 11,231,663,263). In other words, when comparing the most cost efficient single processor system with the heuristically selected multiprocessor system, the JPEG benchmark has a speedup of 4.44 (area increase of 4.92 ) and the MP3 benchmark × × has a speedup of 3.51 (area increase of 4.22 ). × × The more cost efficient single processor system (compared to our multiprocessor system) is in line with the notion that there is inefficiency from distributing workload among processors. However, our configuration space, K is relatively small, as we | | CHAPTER 7. DESIGN SPACE EXPLORATION 190

did not explore the subset of the instruction set architecture (ISA) due to simulation time constraints. If this space is explored, it would be possible to obtain an even more cost efficient multiprocessor system compared to a single processor approach.

7.7 Discussion

Our approach uses ASIPs as processing entities in a multiprocessor system, which pro- vide configurability options not found in conventional processors and DSPs. ASIPs in multiprocessors allow different processing entities to be configured independently.

Thus, low computation-intensive components can be configured to be basic and small, while highly computation-intensive components are augmented with wider buses, ex- tensible instructions and bigger cache sizes. However, the design space increases exponentially as more ASIPs are added into the system. A thorough exploration would not be feasible on a large design with a multitude of processors. Our heuristic is well suited for such design space exploration.

The heuristic provides a quick evaluation of a multiprocessor pipeline design and produces a set of recommended configurations for each processor in the design which has a cost close to minimal. This methodology trades off accuracy for improved search time and should be used during the design stage of multiprocessor pipeline implementation. The designer can then quickly evaluate the performance of different multiprocessor configurations (i.e., different pipeline stages and flows – refer to Fig- ure 7.1) without performing detailed and thorough simulations. This would greatly reduce design time while increasing the search space.

Usually, a cost function for hardware would be the product of both area and run- time values, and the result with the lowest multiplication value would be selected.

However, our heuristic in Figure 7.9 uses the addition of the normalized values of CHAPTER 7. DESIGN SPACE EXPLORATION 191

area and the core iteration runtime. The characteristic of the data is that higher area values are always associated with lower core iteration runtime, Rprocess (refer to

Section 7.4.3). We would like to have a greater balance of runtime as well as area.

Multiplication intrinsically favors input values of very high or very low magnitudes.

By summing the normalized cost and runtime, each data set is given equal weight and an equal opportunity to be considered. When evaluating among different multi- processor configurations, the conventional cost calculation (area runtime) is used × (see Tables 7.5 and 7.6).

The heuristic contains three coefficient values (i.e., α, β and γ) which provides weights to the normalized area, normalized runtime and initialization/finalization runtime ratios respectively. These coefficients have been set to the value of 1 and have provided results that have been very close to the optimum values. It is un- derstood that this methodology and heuristic requires further experimentation and evaluation. However, multiprocessor benchmark programs are hard to obtain due to the complexity of finding and partitioning sequential data streaming programs.

Different architectural configurations and benchmark programs may require different coefficient values in order for the heuristic to achieve acceptable level of performance.

Examples of other possible benchmarks include MPEG1, MPEG2, MPEG4, JPEG2000,

OGG, H.261, H.263, AES, DES, Triple DES and various network routing protocols,

filters, etc. Most applications are optimized for single processor systems and so re- quire considerable effort to understand the algorithm and to partition the code back into higher level partitions. Due to proprietary projects, this has resulted in scarce availability of readily partitioned streaming programs. In this work, we obtained single processor applications for JPEG and MP3, and partitioned the programs into many variations, as demonstrated in the previous sections.

From Tables 7.5 and 7.6, the systems which achieve the lowest cost values are CHAPTER 7. DESIGN SPACE EXPLORATION 192

selected to be the most cost efficient and will represent the final designs for implemen- tation. The selections include the 5-core (4 stages) system for the JPEG benchmark and the 4-core (3 stages) system for the MP3 benchmark application (refer to Ta- bles 7.5 and 7.6).

The heuristic that is proposed produces recommended configurations with cost values within less than 0.5% of the values achieved through the estimation-based approach. As the number of cores increases, it can be observed that runtime for each individual benchmark decreases while the total area utilization increases. However, cost values increase as more processors are added into the system. This implies that the increase in performance does not compensate for the increase in area used by the extra logic included in the design. These designs can either be replaced with systems with smaller numbers of processors or with pipeline partitions that provide a more balanced utilization rate amongst the processors in the systems. When the

final design is selected and confirmed by the designer, an exhaustive search can be used to produce the optimal and most cost efficient design before the multiprocessor pipeline design is sent for implementation and fabrication.

Neighboring processors in the pipeline system are connected via FIFO queues.

Queue buffers of sufficient size would not dramatically affect the cost function and performance metric of the proposed algorithm. As only full or empty queues would stall the affected pipeline stages, the stage mainly responsible for performance is the critical stage in the pipeline (which is taken into account in our cost metric). The processing rate of the critical stage of the pipeline is assumed to be greater than the incoming data rate of the first processor in the multiprocessor system. Thus, the issue of insufficient buffer sizes and the possibility of incoming data overloading the buffer is non-existent.

The use of ASIPs with the same ISA provides a uniform platform for designers CHAPTER 7. DESIGN SPACE EXPLORATION 193

to build more complex systems. Single-ISA system allows easier code maintenance and mapping of code segments to different physical processors compared to multi-ISA systems. The extensibility option of an ASIP provides a quick and easy optimization path to code segments without introducing different variants of processing entities.

Such capability improves performance without drastically increasing area usage. The common set of tools used would also provide easy verification in a multiprocessor design.

7.8 Conclusion

In conclusion, we have formalized the problem of mapping processor configurations in the context of an ASIP multiprocessor system that is implemented in a pipeline manner. We have also presented a heuristic to obtain a near-optimal configuration

(lowest area per performance ratio), given a partitioned benchmark program. This is complemented with a full methodology that uses this heuristic to rapidly explore the architectures provided by the designer and thus explore and select the architecture that provides the best performance per unit area. This framework used Tensilica’s

Xtensa LX [20] configurable cores which provided the queue interface that is used to connect each processor in the system in a pipelined configuration. We have explored the design space of such an architecture by using the existing ASIP design flow to rapidly select the best cache configurations and extended instructions to provide the necessary speedup while minimizing area; thus providing an excellent performance to area ratio. Chapter 8

Conclusions

System performance, area and power utilization has been the major metric for effi- ciency in embedded system design. Plenty of research work has explored a variety of different systems and processor architectures to improve performance of embed- ded systems, emphasizing on energy consumption and the portability of the system.

Architectures explored ranges from processor centric architectures such as extensi- ble systems, superscalability, reconfigurability and asynchronous designs, to system wide configurations such as topology of components on chip, multiprocessor topology, coprocessor systems and heterogeneous design. With the multitude of architectures to explore and a huge design exploration space, formal methodologies, algorithms, techniques and frameworks are necessary to provide a quick design-turnaround time for embedded system design and to increase competitiveness in the market space.

In Chapter 1, a historical view of embedded systems was provided. The humble beginnings of the computer system and the proliferation of such systems into em- bedded devices have been explored. The chapter ended with possible architectural enhancements for embedded systems and the motivation for an automated design methodology for such systems.

194 CHAPTER 8. CONCLUSIONS 195

Most research to date focusses on application-specific processors that are specif- ically targeted to single-processor systems. However, there is a growing trend to

System-on-Chip design, particularly in multiprocessor systems. The advancement in miniaturization of silicon technology has made integration of additional hardware logic possible. The advancement has been so great that most new designs are not able to keep up with the increasing availability of additional circuits. Thus, there is a need for a design methodology to generate application-specific designs to harness these additional processing resources; be it through coprocessors or additional pro- cessors in the system-on-chip. Chapter 2 and 3 provided the necessary background knowledge and the rationale behind all the approaches taken in the course of this thesis.

Chapter 4 addresses the issues of previous processor generation methodologies, which have mainly focused on customizing instruction sets, by extending the work on base processors. The chapter presents an RTL generation scheme which uses the processor generation tool, ASIPmeister for a SimpleScalar / PISA Instruction set ar- chitecture. The framework supports system calls in order to implement C programs that use standard C libraries for input / output functionality. The framework pro- vides total control of the implementation and configuration of 6-stage pipeline base processor, thus providing opportunities for further design exploration, not only by ex- tending instructions, but also by reducing the instruction set to improve performance of the system. Chapter 4 provided an approach to generate a processor with various subsets of instructions, in contrast with other approaches of just extending the base processor. The framework provides the necessary tools and methodology to generate both hardware and software packages. An application is compiled and analyzed to obtain a minimized instruction set, which is used to generate a customized processor, using only necessary instructions. Software libraries (application stack, syscall data, CHAPTER 8. CONCLUSIONS 196

syscall subroutines and bootloader) are generated for the compiled program, which is then generated as instruction and data memory VHDL files. The VHDL code of the customized processor is then generated and combined with the memory components.

For five benchmark applications, it was shown that, on average, processor size can be reduced by 30%, energy consumed reduced by 24%, and performance improved by

24%.

Building upon the framework in Chapter 4, Chapter 5 explored a tightly-coupled coprocessor architecture, designed to speed up loop executions within an applica- tion. The computation intensive loops within applications are accelerated by tightly coupling a coprocessor to the ASIP, which can be generated using the methodology from Chapter 4. The availability of the RTL source of the base processor has made possible the integration of a coprocessor directly into the customized core. Chapter 5 illustrated the advantages of such an approach by investigating a JPEG encoding algorithm and accelerating one of its loops by implementing it in a coprocessor. The

JPEG encoder was profiled and the loop segment that had the highest optimization potential (Huffman encoder) was selected for acceleration and converted to a copro- cessor. The architecture introduced achieves parallelism by allowing the coprocessor to assist the base processor during loop execution. Performance was achieved through loop pipelining and memory latency hiding. For comparison purposes, a high-level synthesis (HLS) approach was used to generate a coprocessor that would execute the loop completely within the coprocessor itself. The HLS tool (i.e., SPARK) is able to perform loop unrolling and several other optimizations. To compare with the coprocessor approaches, the JPEG encoder was also accelerated by implementing fre- quently used instructions as customized instructions. In summary, it was found that the coprocessor approaches achieved much better speedup and lower energy consump- tion compared with the customized instruction approach. The integrated coprocessor CHAPTER 8. CONCLUSIONS 197

approach managed to offload more computations from the base processor to the co- processor compared with the HLS approach, so achieving better performance. A loop performance improvement of 2.57 was acheived using the custom coprocessor × approach, compared to 1.58 for the HLS approach and 1.33 for the customized × × instruction approach, all in comparison with the main processor. Energy savings within the loop were 57 , 28 and 19 respectively. × × × To investigate more ways to achieve greater runtime performance, Chapter 6 and

Chapter 7 explored the use of multiple cores to speed up sequential streaming applica- tions. Chapter 6 explored the possibility of speeding up the JPEG encoder application with heterogeneous components, such that the speedup increases at a faster rate than area increase. The multiprocessor system was implemented using Tensilica’s Xtensa

LX processor. The multiprocessor architecture explored follows a pipeline concept by which different tasks are mapped and executed in a single processor, and each pro- cessor is a pipeline stage in the overall system. Communication between processors were implemented using queue communication interfaces that are available on Xtensa

LX processors. In Chapter 6, the single processor application (JPEG encoder) was partitioned and parallelized across a number of different multiprocessor configura- tions (with each processor having identical configurations). Later on, each processor was enhanced with additional instructions if it was found to be the bottleneck (i.e., critical processor), or if the workload was small, it was diminished by the use of a less enhanced processor with a reduced cache. Parallelization was carried out with up to nine processors with utilization of between 50% to 80%. Speedups of up to 4.6 with × a seven core system with an area increase of only 3.1 were achieved. Additionally, × Chapter 6 included a case study and comparison of multiprocessors which were based on a master-slave model.

Although Chapter 6 confirmed that a heterogeneous multiprocessor system would CHAPTER 8. CONCLUSIONS 198

be a more efficient method for optimizing those applications that can be parallelized, there was no formal methodology available to automate the design of such an ar- chitecture. Chapter 7 formulated the problem of mapping each algorithmic stage in the system to an ASIP configuration, and presented an exhaustive algorithm and a heuristic to efficiently search the design space for a pipeline-based heterogeneous multiprocessor system. An optimal configuration is defined as the set of hardware configurations (e.g., instruction sets, cache sizes, etc.) that provides the best per- formance at the lowest possible cost. In addition to the JPEG benchmark program, an MP3 encoder application was used as a case study example. The benchmark ap- plications were partitioned, and each group of partitions was mapped to a different multiprocessor topology (i.e., different pipeline stages and processor count). Chap- ter 7 also presented an estimation technique to obtain the runtime of an application that is running on a pipeline multiprocessor architecture. Based on the estimation technique, a heuristic was developed to obtain a near-optimal set of configurations for the multiprocessor system. The multiprocessor design provided a performance improvement of at least 5.58 for JPEG and 3.47 for MP3, for a single processor × × design system. The minimum cost (area clock cycle) obtained through the heuristic × was within 0.43% and 0.29% of the optimum values for JPEG and MP3 benchmarks respectively (using the exhaustive search method).

As future work, an operating system could be used to manage the allocation and mapping of different partitions of programs to different processors in a pipeline multiprocessor network. Such an infrastructure would prove necessary if different applications are to be used on the same platform. For example, a device might be used to process JPEG images and also for voice recording.

Plenty of research has been done in sharing of components and interconnects CHAPTER 8. CONCLUSIONS 199

within processor architectures. Building on these concept and ideas, different het- erogeneous multipipeline processor systems, as presented in Chapter 6 and 7 can be merged and processors / queues shared among different applications. There will also be a need in the future to develop such a pipeline multiprocessor architecture optimized for a particular group of streaming applications.

The methodologies and results presented in this thesis demonstrate that customiz- ing processor platforms specifically for targeted applications is necessary for an effi- cient system: best performance for the lowest possible cost. It is also shown that a formal methodology leads to better design, more efficient systems and reduced design times. To meet the challenge of increasing processing resources in the coming years, the thesis has thus presented a group of approaches that efficiently use such resources that will lead to designs that meet performance, area and energy constraints. Bibliography

[1] Altera Nios Processor. Altera Corp. (http://www.altera.com).

[2] Apollo Guidance Computer. Computer History Museum.

(http://www.computerhistory.org).

[3] ARC Configurable Processors. ARC International (http://www.arc.com).

[4] ARCtangent. ARC International (http://www.arc.com).

[5] ASIP Meister. ASIP Solutions (http://www.asip-

solutions.com/asip meister.html).

[6] Cascade. CriticalBlue (http://www.criticalblue.com).

[7] Design Compiler. Synopsys, Inc. (http://www.synopsys.com).

[8] Independent JPEG Group. IJG (http://www.ijg.org).

[9] Intel Itanium Processor Microarchitecture Overview. Intel Corp.

(http://www.intel.com/design/Itanium/microarch ovw/).

[10] Jazz DSP. Improv Inc. (http://www.improvsys.com).

[11] JPEG Encoder Core. Alma Technologies (http://www.alma-tech.com).

[12] Lexus LS. Lexus (http://www.lexus.com/models/LS/).

200 BIBLIOGRAPHY 201

[13] MicroBlaze Architecture. Xilinx, Inc. (http://www.xilinx.com/ipcenter/pro-

cessor central//architecture.htm).

[14] Rockwell PPS-4. The Antique Chip Collector’s Page

(http://www.antiquetech.com/chips/PPS-4.htm).

[15] SP-5flex. 3DSP Corp. (http://www.3dsp.com).

[16] SystemC Initiative. (http://www.systemc.org).

[17] Tensilica. Tensilica, Inc. (http://www.tensilica.com).

[18] TMS1000. The Antique Chip Collector’s Page

(http://www.antiquetech.com/chips/TMS1000.htm).

[19] Xilinx. Xilinx, Inc. (http://www.xilinx.com).

[20] Xtensa Processor. Tensilica Inc. (http://www.tensilica.com).

[21] Intel XScale Core : Developer’s Manual. Intel Corporation, 2000.

[22] Application to silicon: Understanding the improv methodology. White paper,

Improv Systems Inc., June 2001.

[23] Flix: Fast relief for performance-hungry embedded applications. Tensilica Inc.

(http://www.tensilica.com/pdf/FLIX White Paper v2.pdf), 2005.

[24] Our History. Xilinx. (http://www.xilinx.com/company/history.htm), 2007.

[25] Emile Aarts, Panos Markopoulos, and Boris Ruyter. The persuasiveness of am-

bient intelligence. In Security, Privacy, and Trust in Modern Data Management,

pages 367 – 381. Springer Berlin Heidelberg, 2007. BIBLIOGRAPHY 202

[26] Ulises Ag¨ueroand Subrata Sasgupta. A Plausibility-Driven Approach to Com-

puter Architecture Design. Communications of the ACM, 30(11):922 – 932,

1987.

[27] Sudhir Ahuja, Nicholas J. Carriero, David H. Genernter, and Venkatesh Krish-

naswamy. Matching Language and Hardware for Parallel Computation in the

Linda Machine. IEEE Transactions on Computers, 37(8):921 – 929, 1988.

[28] E. Arnould, H. T. Kung, O. Menzilcioglu, and K. Sarocky. A Systolic Array

Computer. In IEEE International Conference Acoustics, Speech, and Signal

Processing (ICASSP ’85), volume 10, pages 232 – 235, 1985.

[29] Siamak Arya, Howard Saches, and Sreeram Duvvuru. An Architecture for

High Instruction Level Parallelism. In Proceedings of the 28th Annual Hawaii

International Confirence on System Sciences, volume 1, pages 153 – 162, Wailea,

HI, USA, 1995.

[30] Todd Austin, Eric Larson, and Dan Ernst. SimpleScalar: An Infrastructure for

Computer System Modeling. Computer, 35(2):59–67, 2002.

[31] Jakob Axelsson. A Case Study in Heterogeneous Implementation of Automotive

Real-Time Systems. In CODES’98, Seattle, 1998.

[32] Sati Banerjee, Takeo Hamada, Paul M. Chau, and Ronald D. Fellman. Macro

Pipelining Based Scheduling on High Performance Heterogeneous Multiproces-

sor Systems. Signal Processing, IEEE Transactions on, 43(6):1468 – 1484, 1995.

[33] Michael Barr. Embedded Systems Glossary

(http://www.netrino.com/Publications/Glossary/), 2007. BIBLIOGRAPHY 203

[34] Sanjoy Baruah. Task partitioning upon heterogeneous multiprocessor platforms.

In RTAS’04, pages 536 – 543, 2004.

[35] Aleksandar Beri´c,Ramanathan Sethuraman, Carlos Alba Pinto, Harm Peters,

Gerard Veldman, Peter van de Haar, and Marc Duranton. Heterogeneous Mul-

tiprocessor for High Definition Video. In ICCE’06, pages 401 – 402, 2006.

[36] Lee Boysel and J. Murphy. Four-Phase LSI Logic Offers New Approach to

Computer Designer. Computer Design, pages 141 – 146, April 1970.

[37] Tracy D. Braun, Howard Jay Siegel, and Anthony A. Maciejewski. Heteroge-

neous computing: Goals, methods, and open problems. In HiPC 2001, volume

2228, pages 302 – 320, Hyderabad, India, 2001. Springer.

[38] Steve Carr. Combining Optimization for Cache and Instruction-Level Paral-

lelism. In Parallel Architectures and Compilation Techniques, 1996., Proceed-

ings of the 1996 Conference on, pages 238 – 247, Boston, MA, USA, 1996.

[39] Steve Carr, Kathryn S. McKinley, and Chau-Wen Tseng. Compiler Optimiza-

tions for Improving Data Locality. In Proceedings of the Sixth International

Conference on Architectural Support for Programming Languages and Operat-

ing Systems (ASPLOS-VI), pages 252 – 262, San Jose, CA, USA, 1994.

[40] Karam S. Chatha and Ranga Vemuri. A Tool for Partitioning and Pipelined

Scheduling of Hardware-Software Systems. In System Synthesis, 1998. Proceed-

ings. 11th International Symposium on, pages 145 – 151, Hsinchu, 1998.

[41] Newton Cheung, J¨orgHenkel, and Sri Parameswaran. Rapid configuration &

instruction selection for an ASIP: A case study. In Norbert Wehn and Diederik

Verkest, editors, Proceedings of the Design, Automation and Test in Europe BIBLIOGRAPHY 204

Conference and Exhibition (DATE), pages 802–807, Messe Munich, Germany,

2003. IEEE Computer Society, Los Alamitos, California.

[42] Newton Cheung, Sri Parameswaran, and J¨orgHenkel. INSIDE: INstruction

selection/identification & design exploration for extensible processors. In Pro-

ceedings of the International Conference on Computer Aided Design (ICCAD),

pages 291–297, 2003.

[43] Newton Cheung, Sri Parameswaran, J¨orgHenkel, and Jeremy Chan. MINCE:

Matching instructions using combinational equivalence for extensible proces-

sor. In Georges Gielen and Joan Figueras, editors, Proceedings of the Design,

Automation and Test in Europe Conference and Exhibition (DATE), volume 2,

pages 1020–1025, CNIT La Dfense, Paris, France, 2004. IEEE Computer Soci-

ety, Los Alamitos, California.

[44] Hoon Choi, Jong-Sun Kim, Chi-Won Yoon, In-Cheol Park, Sung Ho Hwang, and

Chong-Min Kyung. Synthesis of application specific instructions for embedded

dsp software. IEEE Transactions on Computer, 48(6):603–614, 1999.

[45] Yaohan Chu. Application-specific coprocessor computer architecture. In Pro-

ceedings of the International Conference on Application Specific Array Proces-

sors, pages 653 – 664, Princeton, NJ USA, 1990.

[46] Jason Cong, Yiping Fan, Guoling Han, and Zhiru Zhang. Application-specific

instruction generation for configurable processor architectures. In Proceeding of

the 2004 ACM/SIGDA 12th International Symposium on Field Programmable

Gate Arrays, pages 183–189, Monterey, California, USA, 2004. ACM Press, New

York, NY, USA. BIBLIOGRAPHY 205

[47] Jason Cong, Guoling Han, and Wei Jiang. Synthesis of an Application-Specific

Soft Multiprocessor System. In Proceedings of the 2007 ACM/SIGDA 15th

international symposium on Field programmable gate arrays (FPGA’07), pages

99 – 107, Monterey, CA, USA, 2007. ACM Press, New York, NY, USA.

[48] James W. Cooley and John W. Tukey. An Algorithm for the Machine Calcu-

lation of Complex Fourier Series. Mathematics of Computation, 19:297 – 301,

1965.

[49] CriticalBlue. Coprocessor synthesis – increasing system on chip platform ROI.

Technical report, CriticalBlue, June 2004.

[50] Richard Curtin. ASIC Design Methods Using VHDL. In Euro ASIC’90, pages

176 – 179, 1990.

[51] S. Dasgupta. On the Verification of Computer Architectures using an Archi-

tecture Description Language. In Proceedings of the 10th Annual International

Symposium on Computer Architecture, pages 32 – 38, Stockholm, Sweden, 1983.

IEEE Press, New York.

[52] S. Dasgupta. The Design and Description of Computer Architectures. Wiley,

New York, 1984.

[53] Srinivas Devadas and Sharad Malik. A survey of optimization techniques tar-

geting low power vlsi circuits. In Design Automation, 1995. DAC ’95. 32nd

Conference on, pages 242 – 247, San Francisco, CA, 1995.

[54] Giuliano Donzellini, Stefano Nervi, Domenico Ponta, Sergio Rossi, and Stefano

Rovetta. Object Oriented ARM7 Coprocessor. pages 243 – 252. IEEE Computer

Society, Washington, DC, USA, 1998. BIBLIOGRAPHY 206

[55] Mike Ebbers, Wayne O’Brien, and Bill Ogden. Introduction to the New Main-

frame: z/OS Basics. IBM Corporation, 1 edition, 2006.

[56] Susan J. Eggers, Joel S. Emer, Henry M. Levy, Jack L. Lo, Rebecca L. Stamm,

and Dean M. Tullsen. Simultaneous Multithreading: A Platform for Next-

Generation Processors. IEEE Micro, 17(5):12 – 19, 1997.

[57] Rolf Ernst, J¨orgHenkel, and Thomas Benner. Hardware-software cosynthesis

for microcontrollers. In IEEE Design & Test, volume 10, pages 64 – 75. IEEE

Computer Society Press, 1993.

[58] Jennifer Eyre and Jeff Bier. The evolution of dsp processors. IEEE Signal

Processing Magazine, 17(2):43 – 51, March 2000.

[59] Federico Faggin. Intel 4004. The Intel 4004 (http://www.intel4004.com/).

[60] Kriszti´anFlautner, Rich Uhlig, Steve Reinhardt, and Trevor Mudge. Thread

Level Parallelism of Desktop Applications. In Proceedings of Workshop

on Multi-threaded Execution, Architecture and Compilation (MTEAC 2000),

page 9, 2000.

[61] Gene Frantz and Ray Simar. Dsp: Of processors and processing. Queue, 2(1):22

– 30, March 2004.

[62] M. Freericks. The nML machine description formalism: Technical report. Tech-

nical report, Technische University Berlin, Computer Science, Berlin, 1993.

[63] C. H. Gebotys and R. J. Gebotys. Designing for Low Power in Complex Embed-

ded DSP Systems. In Proceedings of the 32nd Hawaii International Conference

on System Sciences (HICSS-32), volume 3, page 8, Maui, HI, 1999. BIBLIOGRAPHY 207

[64] Tayeb A. Giuma and Kevin W. Hart. Microcomputer Bus Architectures. In

Southcon Conference, pages 431 – 437, Orlando, FL, 1996.

[65] Tony Givargis, Frank Vahid, and J¨org Henkel. System-Level Exploration

for Pareto-Optimal COnfigurations in Parameterized System-on-a-Chip. Very

Large Scale Integration (VLSI) Systems, IEEE Transactions on, 10(4):416 –

422, 2002.

[66] Tilman Gl¨oklerand Heinrich Meyr. Power reduction for ASIPS: a case study.

In IEEE Workshop on Signal Processing Systems, pages 235–246, Antwerp,

Belgium, 2001.

[67] James R. Goodman, Jian-tu Hsieh, Koujuch Liou, Andrew R. Pleszkun, P. B.

Schechter, and Honesty C. Young. PIPE: A VLSI Decoupled Architecture. In

Proceedings of the 12th annual international symposium on Computer architec-

ture, pages 20 – 27. IEEE Computer Society Press, Los Alamitos, CA, USA,

1985.

[68] Gert Goossens, Dirk Lanneer, Werner Geurts, and Johan Van Praet. Design

of ASIPs in multi-processor SoCs using the Chess/Checkers retargetable tool

suite. In System-on-Chip, 2006. International Symposium on (SoC), pages 1 –

4, Tampere, Finland, 2006.

[69] Sathish Gopalakrishnan and Marco Caccamo. Task Partitioning with Repli-

cation upon Heterogeneous Multiprocessor Systems. In RTAS’06, pages 199 –

207, 2006.

[70] Herbert R. J. Grosch. Computer: Bit Slices From A Life. 2003. BIBLIOGRAPHY 208

[71] Rajesh K. Gupta and Giovanni De Micheli. Specification and analysis of tim-

ing constraints for embedded systems. IEEE Transactions of Computer-Aided

Design of Integrated Circuits and Systems, 16(3):241 – 256, 1997.

[72] Sumit Gupta, Nikil Dutt, Rajesh Gupta, and Alex Nicolau. Spark: A high-

level synthesis framework for applying parallelizing compiler transformations.

In International Conference on VLSI Design, pages 461 – 466, 2003.

[73] George Hadjiyiannis, Silvina Hanono, and Srinivas Devadas. ISDL: An instruc-

tion set description language for retargetability. In Proceedings of the Design

Automation Conference (DAC), pages 299–302, 1997.

[74] Ashok Halambi, Peter Grun, Vijau Ganesh, Asheesh Khare, Nikil Dutt, and

Alex Nicolau. EXPRESSION: a language for architecture exploration through

compiler/simulator retargetability. In Proceedings of the Design, Automation

and Test in Europe Conference and Exhibition (DATE), pages 485–490, Munich,

Germany, 1999.

[75] Eric Hamilton. JPEG File Interchange Format. Technical report, C-Cube

Microsystems, September 1 1992.

[76] Uwe Hansmann. Pervasive Computing. Springer-Verlag, 2000.

[77] Scott Hauck, Thomas W. Fry, Matthew M. Hosler, and Jeffrey P. Kao. The

Chimaera Reconfigurable Functional Unit. Very Large Scale Integration (VLSI)

Systems, IEEE Transactions on, 12(2):206 – 217, 2004.

[78] John L. Hennessy and David A. Patterson. Computer Architecture: A Quanti-

tative Approach. Morgan Kaufmann Publishers, 3rd edition, 2003. BIBLIOGRAPHY 209

[79] Sato Hiroyuki and Yoshida Teruhiko. Characteristics of loop unrolling effect:

software pipelining and memory latency hiding. In Innovative Architecture for

Future Generation High-Performance Processors and Systems, pages 63 – 72,

Maui, HI USA, 2001.

[80] Chao Huang, Srivaths Ravi, Anand Raghunathan, and Niraj K. Jha. High-level

Synthesis Using Computation-unit Integrated Memories. In IEEE/ACM Inter-

national Conference on Computer Aided Design, 2004. ICCAD-2004, pages 783

– 790, 2004.

[81] Ing-Jer Huang and Alvin M. Despain. Synthesis of application specific instruc-

tion sets. IEEE Transactions on Computer-Aided Design of Integrated Circuits

and Systems, 14(6):663 – 675, 1995.

[82] Kuo-Chan Huang and Feng-Jian Wang. Design patterns for Parallel Compu-

tations of Master-Slave Model. In International Conference on Information,

Communications and Signal Processing, volume 3, pages 1508 – 1512, 1997.

[83] Luong D. Hung and Shuichi Sakai. Dynamic Estimation of Task Level Paral-

lelism with Operating System Support. In Proceedings of the 8th International

Symposium on Parallel Architectures, Algorithms and Networks (ISPAN’05),

page 6, 2005.

[84] J. K. Hunter, J. V. McCanny, A. Simpson, Y. Hu, and J. G. Doherty. Jpeg

encoder system-on-a-chip demonstrator. In Signals, Systems, and Computers,

1999. Conference Record of the Thirty-Third Asilomar Conference on, volume 1,

pages 762 – 766, 1999.

[85] Makiko Itoh, Shigeaki Higaki, Jun Sato, Akichika Shiomi, Yoshinori Takeuchi,

Akira Kitajima, and Masaharu Imai. PEAS-III: an ASIP design environment. BIBLIOGRAPHY 210

In Proceedings of the International Conference on Computer Design: VLSI in

Computers & Processors, pages 430–436, Austin, TX, USA, 2000.

[86] Makiko Itoh, Yoshinori Takeuchi, Masaharu Imai, and Akichika Shiomi. Syn-

thesizable HDL generation for pipelined processors from a micro-operation de-

scription. IEICE Transactions Fundamentals, E83-A(3):394–400, 2000.

[87] Tom R. Jacobs, Vassilios A. Chouliaras, and David J. Mulvaney Loughbor-

ough. Thread-Parallel MPEG-4 and H.264 Coders for System-on-Chip Multi-

Processor Architectures. In Consumer Electronics, 2006. ICCE ’06. 2006 Digest

of Technical Papers. International Conference on, pages 91 – 92, 2006.

[88] Manoj Kumar Jain, M. Balakrishnan, and Anshul Kumar. ASIP design method-

ologies: survey and issues. In Proceedings of the International Conference on

VLSI Design, pages 76–81, Bangalore, India, 2001.

[89] Jinhwan Jeon and Kiyoung Choi. Loop Pipelining in Hardware-Software Parti-

tioning. In Design Automation Conference 1998. Proceedings of the ASP-DAC

’98. Asia and South Pacific, pages 361 – 366, Yokohama, Japan, 1998.

[90] Gilles Kahn. The semantics of a simple language for parallel programming. In

IFIP’74, pages 471 – 475, Stockolm, Sweden, 1974.

[91] J. F. Kaiser. Considerations in the Hardware Implementation of Digital Filters.

In Decision and Control including the 13th Symposium on Adaptive Processes,

1974 IEEE Conference on, volume 13, pages 106 – 107, 1974.

[92] Ron Kalla, Balaram Sinharoy, and Joel M. Tendler. IBM Power5 Chip: A

Dual-Core Multithreaded Processor. IEEE Micro, 24(2):40 – 47, 2004. BIBLIOGRAPHY 211

[93] Arun Kejariwal, Prabhat Mishra, Jonas Astrom, and Nikil Dutt. HDLGen:

Architecture description language driven HDL generation for pipelined proces-

sors. Technical report, Center for Embedded Computer Systems, University of

California, Irvine, CA 92697, USA, February 2003.

[94] Manho Kim, Daewook Kim, and Gerald E. Sobelman. MPEG-4 performance

analysis for a CDMA network-on-chip. In Communications, Circuits and Sys-

tems, 2005. Proceedings. 2005 International Conference on, pages 493 – 496,

2005.

[95] Kevin D. Kissell. MIPS MT: A Multithreaded RISC Architecture for Embedded

Real-Time Processing. High Performance Embedded Architectures and Compil-

ers (HiPEAC 2008), 4917:9 – 21, 2008.

[96] Shinsuke Kobayashi, Kentaro Mita, Yoshinori Takeuchi, and Masaharu Imai.

Rapid prototyping of JPEG encoder using the ASIP development system:

PEAS-III. In Proceedings of the International Conference on Acoustics, Speech,

and Signal Processing (ICASSP), volume 2, pages 485–488, Hong Kong Exhi-

bition and Convention Centre, Hong Kong, 2003.

[97] Takeshi Kodaka, Keiji Kimura, and Hironori Kasahara. Multigrain Parallel

Processing for JPEG Encoding on a Single Chip Multiprocessor. In IWIA’02,

pages 57 – 63, 2002.

[98] David Koufaty and Deborah T. Marr. Hyperthreading Technology in the Net-

burst Microarchitecture. IEEE Micro, 23(2):56 – 65, 2003.

[99] Kayhan K¨u¸c¨uk¸cakar. An ASIP design methodology for embedded systems.

In Proceedings of the Seventh International Workshop on Hardware/Software

Codesign, pages 17–21, Rome, Italy, 1999. BIBLIOGRAPHY 212

[100] R. Kumar, D.M. Tullsen, N.P. Jouppi, and P. Ranganathan. Heterogeneous

Chip Multiprocessors. Computer, 38(11):32 – 38, November 2005.

[101] A. Langi and W. Kinsner. An architectural design of a wavelet coprocessor.

In Electrical and Computer Engineering, 1994. Conference Proceedings. 1994

Canadian Conference on, volume 2, pages 497 – 500, Halifax, NS, 1994.

[102] Chunho Lee, Miodrag Potkonjak, and William H. Mangione-Smith. Media-

bench: A tool for evaluating and synthesizing multimedia and communications

systems. In Proceedings of the Thirtieth Annual IEEE/ACM International Sym-

posium on Microarchitecture, pages 330 – 335, Research Triangle Park, NC USA,

1997.

[103] Jae-Jin Lee and Gi-Yong Song. Super-Systolic Array for 2D Convolution. In

TENCON 2006. 2006 IEEE Region 10 Conference, pages 1 – 4, Hong Kong,

2006.

[104] Jong-eun Lee, Koyoung Choi, and Nikil D. Dutt. Energy-Efficient Instruction

Set Synthesis for Application-Specific Processors. In Proceedings of the 2003

international symposium on Low power electronics and design (ISLPED’03),

pages 330 – 333, Seoul, Korea, 2003. ACM Press, New York, NY, USA.

[105] Mike Tien-Chien Lee, Vivek Tiwari, Sharad Malik, and Masahiro Fujita. Power

Analysis and Minimization Techniques for Embedded DSP Software. Very Large

Scale Integration (VLSI) Systems, IEEE Transactions on, 5(1):123 – 135, 1997.

[106] Rainer Leupers and Peter Marwedel. Instruction-set modelling for ASIP code

generation. In Proceedings of the International Conference on VLSI Design,

pages 77 – 80, Bangalore, India, 1996. BIBLIOGRAPHY 213

[107] Rainer Leupers and Peter Marwedel. Retargetable Code Generation based on

Structural Processor Descriptions. Design Automation for Embedded Systems,

3(1):75–108, 1998.

[108] Ted G. Lewis and Hesham El-Rewini. Introduction to Parallel Computing.

Prentice Hall, 1992.

[109] Jianmin Li and Chung-Kuan Cheng. Routability Improvement Using Dynamic

Interconnect Architecture. In FPGAs for Custom Computing Machines, 1995.

Proceedings. IEEE Symposium on (FCCM), pages 61 – 67, Napa Valley, CA.,

1995. IEEE Computer Society.

[110] Chih-chang Lin, Malgorzata Marek-Sadowska, and Duane Gatlin. Universal

logic gate for FPGA design. In Proceedings of the 1994 IEEE/ACM interna-

tional conference on Computer-aided design (ICCAD’94), pages 164 – 168, San

Jose, California, USA, 1994. IEEE Computer Society Press, Los Alamitos, CA,

USA.

[111] Hung-Yueh Lin, Tay-Jyi Lin, Chie-Min Chao, Yen-Chin Liao, Chih-Wei Liu,

and Chein-Wei Jen. Static floating-point unit with implicit exponent tracking

for embedded dsp. In Circuits and Systems, 2004. ISCAS ’04. Proceedings of

the 2004 International Symposium on, volume 2, pages 821 – 824, 2004.

[112] Tay-Jyi Lin and Chein-Wei Jen. Cascade - configurable and scalable dsp en-

vironment. In Circuits and Systems, 2002. ISCAS 2002. IEEE International

Symposium on, volume 4, pages 870 – 873, 2002.

[113] Tony C. Lin. Development of U.S. Air Force Intercontinental Ballistic Missile

Weapon Systems. Journal of Spacecraft and Rockets, 40(4):491 – 509, 2003. BIBLIOGRAPHY 214

[114] Lars Lundberg. Predicting and Bounding the Speedup of Multithreaded Solaris

Programs. Journal of Parallel and Distributed Computing, 57(3):322 – 333,

1999.

[115] Elmar Maas, Dirk Herrmann, Rolf Ernst, Peter R¨uffer,Sieghard Hasenzahl, and

Martin Seitz. A processor-coprocessor architecture for high end video applica-

tions. In Acoustics, Speech, and Signal Processing, 1997. ICASSP-97., 1997

IEEE International Conference on, volume 1, pages 595 – 598, Munich, 1997.

[116] Pedro Marcuello, Antonio Gonz´alez,and Jordi Tubella. Thread Partitionig and

Value Prediction for Exploiting Speculative Thread-Level Parallelism. IEEE

Transactions on Computers, 53(2):114 – 125, 2004.

[117] Ernesto Martins and Jos´eA. Fonseca. Traffic Scheduling Coprocessor with

Schedulability Analysis Capability. In Proceedings of the Euromicro Symposium

on Digital Systems Design (DSD’01), pages 127 – 134. IEEE Computer Society,

Washington, DC, USA, 2001.

[118] Peter Marwedel. Embedded System Design. Kluwer Academic Publishers, 2003.

[119] Huzefa Mehta, Robert Michael Owens, Mary Jane Irwin, Rita Chen, and De-

bashree Ghosh. Techniques for Low Energy Software. In Low Power Electronics

and Design, 1997. Proceedings., 1997 International Symposium on, pages 72 –

75, 1997.

[120] Prabhat Mishra, Arun Kejariwal, and Nikil Dutt. Rapid exploration of pipelined

processors through automatic generation of synthesizable rtl models. In Work-

shop of Rapid System Prototyping (RSP), 2003. BIBLIOGRAPHY 215

[121] Prabhat Mishra, Arun Kejariwal, and Nikil Dutt. Synthesis-driven exploration

of pipelined embedded processors. In Proceedings of the International Confer-

ence on VLSI Design (VLSID’04), pages 921–926, 2004.

[122] Toff C. Mowry, Monica S. Lam, and Anoop Gupta. Design and Evaluation of a

Compiler Algorithm for Prefetching. In Proceedings of the Fifth International

Conference on Architectural Support for Programming Languages and Operating

Systems, pages 62 – 75, Boston, Massachusetts, USA, 1992.

[123] Sri Parameswaran, Matthew F. Parkinson, and Peter Bartlett. Profiling in the

asp codesign environment. Journal of Systems Architecture, 46(14):1263 – 1274,

2000.

[124] Jorgen Matthew David Peddersen, Seng Lin Shee, Andhi Janapsatya, and Sri

Parameswaran. Rapid embedded hardware/software system generation. In

18th International Conference on VLSI Design, pages 111 – 116, Taj Bengal,

Kolkata, India, 2005.

[125] D. Pham et al. The design and implementation of a first-generation cell pro-

cessor. In ISSCC 2005, pages 184 – 186. IEEE CS Press, 2005.

[126] David Pok, Henry Chien-In Chen, C. Montgomery, B. Y. Tsui, and Joh

Schamus. ASIC Design for Monobit Receiver. In ASIC Conference and Ex-

hibit, 1997. Proceedings., Tenth Annual IEEE International, pages 142 – 146,

Portland, OR, USA, 1997.

[127] Scott R. Powell and Paul M. Chau. A Model for Estimating Power Dissipation

in a Class of DSP VLSI Chips. IEEE Transactions on Circuits and Systems,

38(6):646 – 650, 1991. BIBLIOGRAPHY 216

[128] Shiv Prakash and Alice C. Parker. SOS: Synthesis of Application-Specific Het-

erogeneous Multiprocessor Systems. Journal of Paralleland Distrbuted Com-

puting, 16(4):338 – 351, 1992.

[129] Xiang-Ju Qin, Ming-Cheng Zhu, Zhong-Yi Wei, and Du Chao. An Adaptive

Viterbi Decoder Based on FPGA Dynamic Reconfiguration Technology. In

Field-Programmable Technology, 2004. Proceedings. 2004 IEEE International

Conference on (ICFPT’04), pages 315 – 318, 2004.

[130] Lawrence R. Rabiner and Bernard Gold. Theory And Application Of Digital

Signal Processing. Prentice Hall, Englewood Cliffs, NJ, 1975.

[131] Allan Rae and Sri Parameswaran. Application-Specific Heterogeneous Mul-

tiprocessor Synthesis Using Differential-Evolution. In Proceedings of the 11th

International Symposium on System Synthesis, pages 83 – 88, Hsinchu, Taiwan,

China, 1998. IEEE Computer Society.

[132] B. Ramakrishna Rau. Iterative Modulo Scheduling : An Algorithm For Software

Pipelining Loops. In Microarchitecture, 1994. MICRO-27. Proceedings of the

27th Annual International Symposium on, pages 63 – 74, San Jose, CA, USA,

1994.

[133] Rahul Razdan and Michael D. Smith. A high-performance microarchitecture

with hardware-programmable functional units. In Microarchitecture, 1994.

MICRO-27. Proceedings of the 27th Annual International Symposium on, pages

172 – 180, 1994.

[134] F. Salice, L. Del Vecchio, L. Pomante, and W. Fornaciari. Partitioning of

Embedded Applications onto Heterogeneous Multiprocessor Architectures. In BIBLIOGRAPHY 217

ACM symposium on Applied computing, pages 661 – 665, Melbourne, Florida,

2003.

[135] Alberto Sangiovanni-Vincentelli and Grant Martin. Platform-Based Design and

SoftwareDesign Methodology for Embedded Systems. IEEE Design & Test of

Computers, 18(6):23 – 33, November – December 2001.

[136] Tero S¨antti and Juha Plosila. Architecture for an Advanced Java Co-Processor.

In International Symposium on Signals, Circuits & Systems, ISSCS 2005, 2005.

[137] Klaus E. Schauser, Chris J. Scheiman, J. Mitchell Ferguson, and Paul Z. Kolano.

Exploiting the Capabilities of Communications Co-Processors. In Proceedings

of the 10th International Parallel Processing Symposium (IPPP’96), pages 109

– 115. IEEE Computer Society Washington, DC, USA, 1996.

[138] Seng Lin Shee. VLSI chip implementation for communication protocols :

JSCHIP Project. Undergraduate thesis, The University of New South Wales,

2003.

[139] Seng Lin Shee, Andrea Erdos, and Sri Parameswaran. Heterogeneous Multi-

processor Implementations for JPEG : A Case Study. In IEEE/ACM/IFIP

International Conference on Hardware - Software Codesign and System Syn-

thesis (CODES+ISSS), Seoul, Korea, 2006.

[140] Seng Lin Shee and Sri Parameswaran. Design Methodology for Pipelined

Heterogeneous Multiprocessor System. In Design Automation Conference

(DAC’07), pages 811 – 816, San Diego, CA, USA, 2007.

[141] Gilbert C. Sih and Edward A. Lee. Declustering: A New Multiprocessor

Scheduling Technique. IEEE Transactions of Parallel and Distributed Systems,

4(6):625 – 637, 1993. BIBLIOGRAPHY 218

[142] James E. Smith. Dynamic Instruction Scheduling and the Astronautics ZS-1.

IEEE Computer, 22(7):21 – 35, 1989.

[143] James E. Smith and Gurindar S. Sohi. The Microarchitecture of Superscalar

Processors. Proceedings of the IEEE, 83(12):1609 – 1624, 1995.

[144] Litong Song and Krishna Kavi. What can we gain by unfolding loops? ACM

SIGPLAN Notices, 39(2):26 – 33, February 2004.

[145] Lawrence Spracklen and Santosh G. Abraham. Chip Multihreading: Opportu-

nities and Challenges. In Proceedings of the 11th International Symposium on

High-Performance Computer Architecture (HPCA-11 2005), pages 248 – 252,

2005.

[146] Greg Stitt, Roman Lysecky, and Frank Vahid. Dynamic hardware/software

partitioning: a first approach. In Design Automation Conference, 2003, pages

250 – 255, 2003.

[147] Harold S. Stone. High Performance Computer Architecture (3rd Edition).

Addison-Wesley, 1993.

[148] Marino T. J. Strik, Adwin H. Timmer, Jef L. van Meerbergen, and Gert-Jan

van Rootselaar. Heterogeneous multiprocessor for the management of real-time

video and graphics streams. Solid-State Circuits, IEEE Journal of, 35(11):1722

– 1731, 2000.

[149] Fei Sun, Srivaths Ravi, Anand Raghunathan, and Niraj K. Jha. Custom-

instruction synthesis for extensible-processor platforms. IEEE Transactions

on Computer-Aided Design of Integrated Circuits and Systems, 23(2):216–228,

2004. BIBLIOGRAPHY 219

[150] Fei Sun, Srivaths Ravi, Anand Raghunathan, and Niraj K. Jha. Synthesis of

Application-specific Heterogeneous Miltiprocessor Architectures using Extensi-

ble Processors. In VLSID’05, pages 551 – 556, 2005.

[151] Gary M. Swift, Sana Rezgui, Jeffrey George, Carl Carmichael, Matthew Napier,

John Maksymowicz, Jason Moore, Austin Lesea, R. Koga, and T. F. Wrobel.

Dynamic Testing of Xilinx Virtex-II Field Programmable Gate Array (FPGA)

Input Output Blocks (IOBs). In IEEE Nuclear and Space Radiation Effects

Conference (NSREC’04), page 7, Atlanta, GA, USA, 2004.

[152] R. Reed Taylor and Herman Schmit. Creating a Power-aware Structured ASIC.

In Low Power Electronics and Design, 2004. ISLPED ’04. Proceedings of the

2004 International Symposium on (ISLPED’04), pages 74 – 77. ACM Press,

New York, NY, USA, 2004.

[153] Shashidhar Thakur and D. F. Wong. On designing ULM-based FPGA logic

modules. In Proceedings of the 1995 ACM third international symposium on

Field-programmable gate arrays (FPGA’95), pages 3 – 9, Monterey, CA., 1995.

ACM Press, New York, NY, USA.

[154] Vivek Tiwari, Sharad Malik, and Andrew Wolfe. Power Analysis of Embedded

Software : A First Step Towards Software Power Minimization. Very Large

Scale Integration (VLSI) Systems, IEEE Transactions on, 2(4):437 – 445, 1994.

[155] Jordi Tubella and Antonio Gonz´alez. Control speculation in multithreaded

processors through dynamic loop detection. In Proceedings of the The Fourth

International Symposium on High-Performance Computer Architecture, pages

14 – 23. IEEE Computer Society, 1998. BIBLIOGRAPHY 220

[156] Gary Tyson and Matthew Farrens. Code scheduling for multiple instruction

stream architectures. International Journal of Parallel Programming (IJPP),

22(3):243 – 272, 1994.

[157] Johan van Praet, Gert Goossens, Dirk Lanneer, and Hugo De Man. Instruction

set definition and instruction selection for ASIPs. In Proceedings of the Seventh

International Symposium on High-Level Synthesis, pages 11–16, Niagara-on-

the-Lake, Ontario, Canada, 1994. IEEE Computer Society Press.

[158] John von Neumann. First Draft of a Report on the EDVAC. Technical report,

University of Pennsylvania, June 30 1945.

[159] Zvonko G. Vranesic. The FPGA Challenge. In Multiple-Valued Logic, 1998.

Proceedings. 1998 28th IEEE International, pages 121 – 126, 1998.

[160] Vojin Zivojnovi´c,Stefanˇ Pees, and Heinrich Myer. LISA-machine description

language and generic machine model for HW/SW co-design. In Workshop on

VLSI Signal Processing, pages 127–136, 1996.

[161] Mark Weiser. Ubiquitous Computing (http://www.ubiq.com/weiser/).

[162] A. Wieferink, M. Doerper, R. Leupers, G. Ascheid, H. Meyr, T. Ko-

gel, G. Braun, and A. Nohl. System Level Processor/Communication Co-

exploration Methodology for Multiprocessor System-on-Chip Platforms. Com-

puters and Digital Techniques, IEE Proceedings, 152(1):3 – 11, 2005.

[163] P. A. Wilsey, M. T. Wright, S. Dasgupta, J. Heinanen, and J. Wang. An

S*M Execution Environment. In Technical Report TR87-3-1. The Center for

Advanced Computer Studies, University of Southwestern Louisiana, 1987. BIBLIOGRAPHY 221

[164] Ralph D. Wittig and Paul Chow. OneChip: An FPGA Processor With Recon-

figurable Logic. In FPGAs for Custom Computing Machines, 1996. Proceedings.

IEEE Symposium on, pages 126 – 135, Napa Valley, CA, 1996.

[165] Bing-Fei Wu and Chung-Fu Lin. An efficient architecture for jpeg2000 copro-

cessor. Consumer Electronics, IEEE Transactions on, 50(4):1183 – 1189, 2004.

[166] Takeo Yamada, Seiji Katoaka, and Kohtaro Watanabe. Heuristic and Exact

Algorithms for the Disjunctively Constrained Knapsack Problem. Information

Processing Society of Japan Journal, 43(9):2864 – 2870, 2002.

[167] Ning Zhang and Chawn-Hwa Wu. Study on Adaptive Job Assignment for Mul-

tiprocessor Implementation of MPEG2 Video Encoding. Industrial Electronics,

IEEE Transactions on, 44(5):726 – 734, 1997.

[168] Yinong Zhang and George B. Adams III. Exploiting Instruction Level Paral-

lelism With The DS Architecture. In Parallel Processing, 1996., Proceedings of

the 1996 International Conference on, pages 230 – 237, 1996.