<<

Master Thesis Project Report

Register File Exploration of Embedded

Chaitanya Cherukuri

Lund University, Faculty of Engineering(LTH) Department of Electrical and Information Technology SE-221 00 Lund, Sweden Abstract

Low power design is becoming an integral part of design with the profusion of mobile devices. Increasing computation requirements for audio/video applications in portable devices implies the need for long life battery devices and reduced energy dissipation. As battery improve- ment techniques are not able to satisfactorily address the growing energy requirements of processors, it is important to devise new low power archi- tectural techniques to reduce power consumption.

In current embedded system processors, multi-ported register files are one of the power hungary parts of the processor. Registers play an important role in performance and power consumption of a processor. It consumes more power even when clustering techniques are applied. In this thesis I explained about IMEC technology called Very Wide Register file architecture, which has single ported cells and asymmetric interfaces to the memory and to the .

In this report I have also explained about the tool LISATek. LISATek is an automated embedded and optimization environment that slashes months from processor hardware design time and engineer-years from the creation of processor-specific software development tools. The key to LISATeks automation is its Language for Instruction Set Architectures, LISA 2.0.

i Acknowledgments

This work would not have been completed without help and support of many individuals. I owe a debt of gratitude to everyone who had spent their time and effort along the way which helped me in successfull completion of the thesis work.

I would like to express my deep and sincere gratitude to my supervisors in Interuniversity MicroElectronics Center(IMEC), Mr.Praveen Raghavan and Dr. Murali Jayapala. Their wide knowledge and their logical way of thinking has been of great value for me. Their understanding, encouraging and personal guidance has provided a good basis for my thesis.

Next, I wish to express my warm and sincere thanks to my supervisor in Lund Tekniska Hogskola Dr. Viktor Owall,¨ Dr.Peter Nilsson for introducing me to the field of Digital ASIC.

I owe my most sincere gratitude to my promoter Professor Francky Catthoor for giving me the oppurtuinity to work with Architecture Team at Interuniversity MicroElectronics Center(IMEC), Leuven, Belgium. I wish to extend my warmest thanks to all those who have helped me with my work in Interuniversity MicroElectronics Center(IMEC). Special thanks to my friend Narasinga Rao Miniskar for his encouragement. Last, I want to thank my parents without whom I would never have been able to achieve so much. Chaitanya Cherukuri Lund, 2008

ii Contents

Abstract i

Acknowledgments ii

1 INTRODUCTION 1 1.1 What is Register ? ...... 2 1.2 LISATek ...... 5

2 very wide register 7 2.1 Introduction ...... 7 2.2 very wide register ...... 8 2.3 Architecture Description ...... 10 2.3.1 Memory Design ...... 11 2.3.2 Foreground Memory Organization ...... 11 2.3.3 very wide register and Datapath Connectivity . . . . . 12 2.4 Example Operation of very wide register ...... 13 2.5 Conclusion ...... 15

3 Implementation of Very Wide Register 16 3.1 Introduction ...... 16 3.2 Baseline Reference RISC Architecture Description with Stan- dard ...... 16 3.2.1 The Register Module ...... 18 3.2.2 The Memory Module ...... 19 3.2.3 The Pipeline Module ...... 21

iii 3.3 Architecture Description-Very Wide Registers ...... 21 3.3.1 Register Module ...... 22 3.3.2 Memory Module ...... 23 3.4 Conclusion ...... 24

4 Results 26 4.1 Tool Flow ...... 27 4.2 Implementation of Benchmark ...... 27 4.3 Simulation Results ...... 29 4.3.1 Results-LISATek Processor Debugger ...... 29 4.3.2 Results-Modelsim ...... 30 4.4 Synthesis Results ...... 30 4.5 Conclusion ...... 32

5 Conclusion and Future Work 35 5.1 Further Work ...... 35

A Lisatek 36 A.1 Introduction ...... 36 A.2 LISATek tools ...... 37 A.2.1 LISATek Processor Designer ...... 39 A.2.2 Generating the model ...... 40 A.2.3 LISATek Processor Debugger ...... 41 A.2.4 LISATek Processor Generator ...... 42

B Assembly Code 46 B.1 Assembly code using Very Wide Register ...... 46 B.2 Assembly code using Scalar Register File ...... 49

Bibliography 52

iv List of Figures

1.1 Organization of multiported register file ...... 4 1.2 Power consumption of processor ...... 5 1.3 Overview of LISATek Processor Designer ...... 6

2.1 Conceptual model of partitioned Register file ...... 8 2.2 very wide register Organization ...... 9 2.3 very wide register and Scalar Register file connectivity to the Datapath ...... 12

3.1 RISC Architecture ...... 17 3.2 LTRISC32ca Processor Structure ...... 18 3.3 Four Stage Pipeline ...... 22 3.4 RISC Architecture using VWR ...... 25

4.1 Tool Flow ...... 28 4.2 Simulation in Modelsim ...... 31 4.3 Execution time of an application ...... 32 4.4 Power Consumption of VWR RISC Architecture ...... 33 4.5 Power Consumption of Reference RISC Architecture . . . . . 33 4.6 Energy Consumption of Register File ...... 34 4.7 Energy Consumption of Processor ...... 34

A.1 LISATek Design flow ...... 38 A.2 Example OAT Domain ...... 43 A.3 Screenshot of Processor debugger ...... 43 A.4 Screenshot of Processor Generator ...... 45

v CHAPTER 1

INTRODUCTION

Future notebook and hand-held portable multimedia wireless de- vices such as palm-pilots and wireless-phones which support new multimedia and wireless communications standards with a high computational complex- ity are becoming a major part of our daily life for personal management and communication functions. The cost of such human-machine interface ap- plications closely depends on the combination of performance and energy efficiency of a processor in order to provide long battery life, with less en- ergy dissipation.

Current of the art processors are facing several bottlenecks that prevent the required combination of performance and energy efficiency. Therefore designing a processor with both required combination of the performance and energy efficiency is one of the major challenges the designer. As battery improvements are not able to satisfy the growing energy requirements of the processors, it is important to devise new low power architectural techniques to reduce the microprocessor power consumption.

1 1.1 What is Register ?

Registers play an important role in performance and energy consumption of a processor. Register can be defined as a container to hold the which is not a memory. Todays architectures have converged on a view of archi- tectural state that distinguish between two types of storage: memory and registers. Historically registers were fast, small, expensive, on chip storage areas connected to specific pieces of on chip-logic. Memory was slow, cheap, big, off chip and addressable. Recent technologies break this abstractions, but overall these characteristics remain the same. Modern implemenatations are designed around fast on-chip operations and relatively slower memory operations. Most, but not all modern architectures operate upon the principle of moving data from main memory into registers, operating on them, are often accessed repeatdely and holdng these frequently used values in registers improves peformance.

Processor registers are at the top of the , and provide the fastest way for a CPU to access data. Registers are normally measured by the number of bits they can hold, for example, an ”8-bit register” or a ”32-bit register”. A processor often contains several kinds of registers:

• Data registers are used to hold data such as numeric values such as integer and floating-point values.

• Address registers hold addresses and are used by instructions that indirectly access memory.

• Conditional registers hold truth values often used to determine whether some instruction should or should not be executed.

• General purpose registers (GPRs) can store both data and addresses, i.e., they are combined Data/Address registers.

Typically bit is the basic block of a register, bit cells group togeather to form a individual regsiter and registers group togeather to form a register file and is used to store large amount of data in order to reduce frequent

2 access of memory, which may help to increase the performance of a proces- sor. The performance of the processor can be increased by organizing the registers in a homogenous set called register file. Apart from reducing the access to the memory, performance of a processor can be increased by per- forming multiple operations per cycle, which can be done by increasing the ports of a register file ,i.e, multiported Register file. Multiported Register files are mainly used in processors where multiple instructions are issued in the same cycle.

Apart from the performance there are some negative impacts of using Multi- ported Register files. Assume a Register file is organized in a two dimensional grid structure where horizontal lines represent control path and vertical lines represent Data path.In a single port Register file the vertical path indicates the bit positions within the word and horizontal path indicates a Register from the Register file.The sample grid structure is shown in the Figure 1.1. In this case the vertical represent the bit positions in a word and horizontal control path selects a single register out of a register file. In a Multiported Register files we need more additional control and data lines to access the individual register. Figure 1.1 shows the area of the register file grows with the square of number of ports. The read access time of a register file grows approximately linearly with the number of ports, which has a negative impact on the overall cycle time. As the number of ports increases the internal bit cell loading becomes larger, larger area of Register file causes longer wire delays. The longer wires and larger cells yields to Multi-port Register File more power-hungry circuit.

The pie chart in figure 1.2 shows the power consumption of different blocks in a processor. From figure 1.2 we can see that register file consumes more power power.

To reduce the power consumption of a register file it is preferable to have single ported registers. This problem can be addressed through various clus- tering techniques, where we break the register file into smaller units so that

3 Figure 1.1: Organization of multiported register file

each unit is connected to only a subset of the execution units and the com- munication between the clusters is called intercluster communication. As the size of the partition reduces, the cost of intercluster communication in- creases, which makes the register file multi-ported. In this thesis I explained about an IMEC technology called ”Very Wide Register(VWR)” which is a solution for Multiported Register Files for reducing the power consumption and also to reduce the energy consumption of register.

In short Very Wide Register can be defined as an asymmetric Register file organization that achieves a significantly higher energy efficiency than con- ventional organization.Three aspects are important for the VWR organiza- tion

• Wide Interface to the memory

• Single ported cells

• Narrow interface to the datapath

4 Figure 1.2: Power consumption of processor

Architecture of VWR,loading and storing of data in VWR are explained in coming chapters.

1.2 LISATek

This section gives a brief introduction of LISATek Processor Designer tool LISATek, which I had used to implement the Very Wide Register. CoWare Processor Designer is an automated, application-specific embedded proces- sor design and optimization environment that slashes months from processor hardware design time and engineer-month from the creation of application processor-specific software development tools. Processor Designer’s high de- gree of automation enables design teams to focus on architecture exploration and application-specific processor development, rather than on consistency checking and verification of individual tools.

LISATek is used to develop any of a wide range of processor architecture, including DSP, RISC, SIMD, VLIW and superscalar. The key to LISATek’s automation is its Language for Instruction Set Architectures, LISA 2.0. Op-

5 erating att a high level of abstraction LISATek not only eliminates the time and cost inherent in HDL-based processor design and manual tool develop- ment, but also enables processot design by non-experts. Coware Processor Designer is detaily explained in Appendix.Figure 1.3

Figure 1.3: Overview of LISATek Processor Designer

This thesis is organized as follows:

• Chapter 2 describes the Very Wide Register organization.

• Chapter 3 describes about the experimental setup.

• Chapter 4 gives an idea about benchmarks used and results.

• Chapter 5 concludes and suggests future work.

• Appendix describes about LISATek tool and assembly code of bench- mark.

6 CHAPTER 2

very wide register

2.1 Introduction

In chapter one positive and negative impacts of multiport-regiser file are discussed. To overcome negative impacts, power consumption of register file, this chapter explains about traditional techniquies such as clustering techniques and also explains about IMEC technology called very wide reg- ister(VWR).

A solution to overcome the negaive impacts is to partition the register file into smaller units, so that each set is connected to only a subset of func- tional units. Partitioning Register files and functional units in this manner is called clustering. Each pairing of Register and its directly connected func- tional unit is called cluster. An example of cluster register file is shown in Figure 2.1[1].

Data can be transfered from one cluster to another only through inter cluster communication. The price of cluster comes from intercluster communica- tion. As the clusters go smaller the price of the inter cluster communication increases and the resulting register file behaves like a Multi-port register

7 Figure 2.1: Conceptual model of partitioned Register file

file. From the above discussion it is prefered to use single ported registers for high energy efficiency. Apart from the number of ports of a register file, another important energy efficiency bottleneck is formed by L1 memo- ries. This bottleneck can be reduced by improving any one of the following aspects:

• Memory Design

• Memory Organization(interface of Memory and Foreground Memory)

• Mapping of data onto the memory

In this chapter I have explained IMEC technology known as very wide reg- ister(very wide register(VWR))[2], a asymmetric register file organization, that achieves higher energy efficiency than conventional Register file orga- nization.

2.2 very wide register very wide register is a asymmetric register file organization which has a wider interface towards the memory and narrower towards the datapath. ”Asymmetric Interface” this feature makes very wide register different from other register file organizations. Figure 2.2 shows the very wide register organization.

8 Figure 2.2: very wide register Organization

The three main aspects which make very wide register different from other register file organizations are:

• Interface to the memory

• Single ported cells

• Interface to the Datapath

The interface of the very wide register(VWR) is asymmetric, wide towards the memory and narrower towards the datapath. The wide interface expliots the locality of access of an application through wide loads and at the same time narrow interface able to access the necessary words of smaller width which are required for computation. As shown in figure 2.2 is made of single ported cells and has no pre decode circuit but has a post decode circuit consisting of which has to select an appropriate word for actual computation.

9 As mentioned in the introduction section, clustering is a technique to re- duce the energy consumption by spliting the register file and its parts. As mentioned the cost of clustering comes from intercluster communication, whereas in very wide register the ports are assymmetric and allows the dat- apath to access very wide register through a cheaper, often used interface, while access from the memory go through a more expensive but less used interface. As the read/write operation between very wide register and dat- apath is on a single word hence this makes more optimized usage of energy and bandwidth.

The concept of very wide register is similar to the vector register where we can store contiguous set of data for data parallel architectures. Each vector register holds multiple data elements forming a vector. This set of mutiple data is refered as word and each data is refered as sub-word. The main motivation of vector registers is to support . The tar- get of very wide register is energy efficiency and it can be used to data parallel and non-data-parallel contexts.

2.3 Architecture Description

Architecture Description of very wide register(VWR) clearly explains Fore- ground memory organization(VWR), data memory organization and inter- face, and connectivity between very wide register(VWR) and datapath. Data memory organization and its interface with the wide registers are im- portant energy consumption of a processor. As mentioned before energy in memories can be reduced by anyone of the three aspects:

• Memory design

• Mapping of data on to the memory

• Memory organization(interface)

10 2.3.1 Memory Design

Most of the energy in the memory is spent on the decoder and the wordline activation. Some extent of energy is also spent on the actual storage cells and the sense amplifiers[3,4]. The row address selects the desired row in the memory through the predecoder and the sense amplifiers and pre charge lines are activated for these words and are read out. Genrally the decode price is paid for being able to access the words in any given order. The cost of the decoder can be reduced by performing few decodings as possible and reading out more data for every decode. In embedded system domain this can be achived by exploitting the spatial locality of data[5].

2.3.2 Foreground Memory Organization

As discussed in the previous sections very wide register(VWR) has asym- metric interfaces:wide interface towards the memory and narrow interface towards the datapath. The width of the very wide register(VWR) depends on the line size of the background memory, complete or partial lines are read from the memory to the very wide register(VWR)s. The very wide regis- ter(VWR)s don’t have any pre-decode circuits for transferring data from memory, but have post-decode circuit, which consists of multiplexer/de- multiplexer. This post-decode circuit is used to select the desired word which had to be read or written from the very wide register(VWR). The load/store unit of very wide register(VWR) is different, as the interface of the very wide register(VWR) and memory is as wide as a complete very wide register(VWR) it is capable to load/store complete or partial lines from the memory to the very wide register(VWR)s. Since the very wide regis- ter(VWR) is single ported the read and write access of the registers to the memory and the access to the datapath cannot happen in parallel and also it is important that the data that needed the same cycle/operation are placed in different very wide register(VWR)s. Scalar Register Files(SRF) can also be used with very wide register(VWR)s to store the scalar constants, itera- tors and addresses etc not to pollute the data in very wide register(VWR).

11 Figure 2.3: very wide register and Scalar Register file connectivity to the Datapath

The Figure 2.3 gives a general idea of connectivity between very wide reg- ister(VWR) and the datapath. If we have N single-ported very wide regis- ter(VWR)s and a scalar register file to store the scalar constants, the scratch pad width should be equal to the data width and the foreground memory width. The next section describes the connectivity between the very wide register(VWR) and the datapath with considering a arithmetic operation.

2.3.3 very wide register and Datapath Connectivity

The very wide register’s(VWR) and Scalar Register File(SRF) can be con- nected to any datapath such as multipliers, adders, accumulators, shifters etc. Once the appropriate data are available in the foreground memory, depending on the decoded instruction the read and write operations to

12 and from the memory are foreground memory and the datapath are per- formed. At a given cycle only one word will be read from the very wide register(VWR) to the datapath and the result will be written back to the very wide register(VWR). Figure 2.3 gives an idea about the connectiv- ity between the very wide register’s(VWR), SRF and the datapath. More information of the very wide register(VWR) architecture can be found in [2].

2.4 Example Operation of very wide register

This section explains about arithmetic operations which use very wide reg- ister(VWR). Microarchitecturally every single very wide register(VWR) is made of single ported cells, depending upon the application depth of very wide register(VWR) may vary. Depth of very wide register(VWR) indicates the number of single ported registers files. Each register of an very wide register(VWR) are accessed by using indexing, this helps the user to differ- entiate the registers within the very wide register(VWR). The concept of in- dexing and the concept of arithmetic operations on very wide register(VWR) can be illustrated through basic Multiply Accumulate(MAC) operation.

The example below shows the original C code and the assembly code of Multiply Accumulate(MAC) operation using general purpose register files and very wide register in (a) and (b) respectively. Consider a 32 bit proces- sor and a line size of 64 bit, which means that any given instant of time we can store 2 words in a very wide register. The asymmetric feature of very wide register allows to load the complete line from the memory and at the datapath end only one word is used. In MAC example, we can load 2 words at a time to VWR from memory and use only one word at a time. While in store operation using the asymmetric feature of VWR we store 2 words at a time.

for(i=0;i<8;i++) {

13 c[i]=a[i]*b[i]; sum=sum+c[i]; }

(a)Original C Code

.text .text R0=2 r1=0 R1=0 r2=8 R2=8 VWR1=dmem[r1+0] _loop: VWR3.0=VWR1.0*VWR1.1 R3=dmem[R1+0] VWR3.1=VWR3.0+VWR3.1 R4=dmem[R1+1] mov VWR3.1,VWR2.0 R0=R0-1 R5=R3*R4 VWR1=dmem[r1+0] R6=R5+R6 VWR3.0=VWR1.0*VWR1.1 R1=R1+1 VWR3.1=VWR3.0+VWR3.1 dmem[R2+8]=R6 mov VWR3.0,VWR2.1 R2=R2+1 dmem[r2+0]=VWR2 if(!R0)jmp_loop _end: _end: (b) Using Scalar Register Files (c) Using very wide register

Example:Multiply Accumulate Operation

In the original C code the elements from array a and array b are multiplied and accumulate operation is performed using a variable sum. In the exam- ple(b) assembly code is multiply accumulate operation is performed using seven registers. Register R0 is used as iterator for the loop, R1 is used as base register to load the values from the memory and R2 is used as base register to store the result. The values from the memory are loaded to reg- isters R3 and R4 and after multiply accumulate operation result is stored in register R6.

14 In the above example (c) illustrates Multiply Accumulate(MAC) operation of five numeric values using VWR. Since the bit-line size is 64 we can load only two values at a time. Initially values are loaded to VWR1 from the memory, complete line of the memory is loaded and the results are stored in VWR3, finally the accumulated results are moved to VWR2. After initial phase two values are stored in the memory. Load operator loads the values from the memory and store operator is used to store the values in the mem- ory. Comparing (a) and (b) in the example we can observe that by using very wide register number of Load/Store operations are reduced.

2.5 Conclusion

In this chapter, different register organizations including architecture of very wide register were explained and benifits of very wide register over other register organizations are explained. Chapter three explains about the im- plementation of very wide register in RISC architecture.

15 CHAPTER 3

Implementation of Very Wide Register

3.1 Introduction

In chapter two, Very Wide Register architecture was discussed. This chapter starts with description of RISC processor architecture, implementation of Very Wide Register in RISC processor. LTRISC32ca(LISATek 32 bit cycle accurate RISC ) processor was used in my thesis, is a 32 bit, four stage pipeline architecture. This processor contains sixteen 32 bit general purpose Register and several special purpose registers. LTRISC32ca employs general purpose load/store operations and complex multiplication operations and several instructions for aiding DSP applications eg:add-compare-select. The RISC architecture is shown in Figure 3.1.

3.2 Baseline Reference RISC Architecture Descrip- tion with Standard Register File

The basic structure of the LTRISC32ca processor is shown in the Figure 3.2 and important modules in LTRISC32ca processor are: • The Register Module

• The Memory Module

16 Figure 3.1: RISC Architecture

17 • The Pipeline Module

Figure 3.2: LTRISC32ca Processor Structure

3.2.1 The Register Module

LTRISC32ca processor has 16 general purpose register elements each with 32 bits wide. Registers are declared in the resource section in LISA. This section describes about two kind of registers used in this architecture are:

• Clocked Behavior of Registers

• Special Registers

Clocked Behavior of Registers

As the name suggests Clocked Behavior of Registers depends on clock. The values written to registers are not applied until the next clock cycle starts. This behavior can be globally switched on or off in processor designer or this behavior can also be achieved by changing the type of the resource to TClocked. For Cycle accurate models of clocked resources such as registers this does not hold true. Registers can be read and written in arbitrary order within one cycle. Clocked resources have separate read and write ports, a write does not influence the value of the read port in the same cycle.

18 Special Registers

Apart from temporary storage of data, registers are also used as counters and pointers. There are two types of special registers:

• PROGRAM-

• STACK POINTER

3.2.2 The Memory Module

LISA supports multiple types of memories as listed below:

• Ideal memories that store data without any timing information

• Non-ideal memories that may have latencies and may be connected to a bus.

In my thesis work I implementated Ideal memory which is the simplest way to describe a memory.

Ideal Memories

An ideal memory can be imagined as an array of a fixed size, composed of elements of a defined size, which may be accessed by . The simplest way to model an ideal memory is to define an array of the desired size and data type within the LISA RESOURCE section. Ideal memories are declared using the MEMORY keyword. Ideal memories are of two types:

• Data Memories are used to store data.

• Program Memories are the memories that contain the executable code.

An example of valid declaration of ideal memories is shown in the preceding example.

RESOURCE { MEMORY unsigned int pmem {

19 BLOCKSIZE(32); SIZE(0x10000); FLAGS(R|W|X); }; }

BLOCKSIZE, SIZE and FLAGS are the basic memory attributes. The block size defines the memory width of a single memory element. The SIZE attribute specifies the number of memory elements within the array. Finally the FLAG attribute provides the information about the accessibility of the respective memory, that is, if the memory can be used for reading/storing data(R/W)and for storage of executable code(X). Memory map is used to transform physical memory address into the corresponding array index.

Memory Map

Mapping between the processor’s address space and the defined memories must be established, so the simulator can load object code into the model’s memories. This is performed within one or more memory maps.

The primary job of Memory map is to establish the link resources and the processor’s address space, that means, it provides information to the where to store code and data segments of the loaded object file. Memory map has to provide an unambiguous mapping of addresses onto resources so that same address is not mapped onto different resources.

The following example shows a memory map.

RESOURCE { MEMORY char memory{ BLOCKSIZE(8); SIZE(0x200000); } MEMORY int memory2 { BLOCKSIZE(16); SIZE(0x100000);

20 } MEMORY char memory3 { BLOCKSIZE(32); SIZE(0x80000); } MEMORY_MAP RANGE(0x0000000,0x01fffff) -> memory1[(31..0)];} RANGE(0x0800000,0x09fffff) -> memory2[(31..1)];} RANGE(0x1000000,0x11fffff) -> memory3[(31..2)];} }

3.2.3 The Pipeline Module

Pipeline provides a mechanism of processing multiple parts of an instruc- tions at the same time. The purpose of processor pipelines is to improve instruction throughput. As the execution of instructions is split into sev- eral parts each pipeline stage performs a part of the complete execution. Instructions enter at one end, progress through the pipeline and exit at the other end. The pipeline consists of four stages:fetch(FE), decode(DC), exe- cute(EX) and writeback(WB).

The pipeline must remember the different instructions in pipeline. After each different stage of pipeline instructions are held in a different pipeline registers. These are latches that separate different pipeline stages and store the context of instructions in the pipeline. Figure 3.3 shows the register in a four stage pipeline and the stage between two registers has a logic that performs execution of the instruction.

3.3 Architecture Description-Very Wide Registers

Previous section gave an idea about the RISC architecture which has scalar register files and also gave an idea about memory module, pipeline module and register file. This section explains RISC architecture which has very wide registers and explains modifications made in memory module and reg- ister module. The basic pipeline structure remains the same in both the

21 Figure 3.3: Four Stage Pipeline

architectures. Figure 3.4 shows the RISC architecture modified to use the very wide registers.

Compared to the Reference Architecture (Architecture using scalar register files) the connections from memory to the registers and internal organization of register varies.

3.3.1 Register Module

Apart from the architecture difference between the very wide register and scalar register files there is no much big difference in the Register module in both the architectures. Just the declaration and the usage of the register varies in both architectures. Very wide register is also declared as clocked registers which depend on clock. Declaration of a very wide register is shown below which tells that is VWR is a clocked register and can hold 5 values simultaneously.

REGISTER TClocked VWR[0..4];

As mentioned before apart from the very wide register scalar register files and general register files are also used as counters, iterators and variables.

22 3.3.2 Memory Module

As mentioned in previous chapter very wide register is a group of registers which has an asymmetric interface. i.e, very wide register of width four has wider interface(which are modelled to be multiple ports) towards memory such that four words can be stored/loaded simultaneously and only one port towards the datapath. As number of very wide register increases number of interfaces towards the memory increases which may cause address conflicts. To implement very wide register in RISC architecture successfully address conflicts must be avoided. There are two options to avoid address conflicts:

• Increasing the bus size which connects the memory and very wide reg- ister. Consider very wide register of width four, which can store 4*32 bits of data. For loading the data from memory to register the bus size has to be 128bits, so that it can handle all the 4 values simultaneously.

• Second option is optimal for ideal memories(which is a fixed array). Partioning the ideal memories into smaller memories(small arrays), so that each memory is dedicated to each very wide register.

To implement the very wide register without any address conflicts I opted for second option where the ideal memory of fixed array size (fff0000-fffffff) is partitioned into smaller memories depending on the number of words in very wide register. The memory map which is used in my architecture for implementing the four word interface between VWR and the memory is shown below:

RAM U32 data_mem { SIZE(0x10000); BLOCKSIZE(32,8); FLAGS(R|W); }; RAM U32 data1_mem { SIZE(0x10000);

23 BLOCKSIZE(32,8); FLAGS(R|W); }; RAM U32 data2_mem { SIZE(0x10000); BLOCKSIZE(32,8); FLAGS(R|W); }; RAM U32 data3_mem { SIZE(0x10000); BLOCKSIZE(32,8); FLAGS(R|W); }; RANGE(DMEM_START, DMEM_END) -> data_mem[(31..0)]; RANGE(DMEM_START1, DMEM_END1) -> data1_mem[(31..0)]; RANGE(DMEM_START2, DMEM_END2) -> data2_mem[(31..0)]; RANGE(DMEM_START3, DMEM_END3) -> data3_mem[(31..0)];

3.4 Conclusion

This chapter gave an idea about the RISC Architecture and it’s different modules(memory module, pipeline module and register module) on which very wide register was implemented. It explains the procedure that I used to implement very wide register without any address conflicts.

24 Figure 3.4: RISC Architecture using VWR

25 CHAPTER 4

Results

In previous chapter two different RISC architectures with scalar register files and very wide registers were explained. This chapter deals with benchmark implementation on RISC architecture’s on two different architectures (ar- chitectures using scalar register file and very wide register) and compare various performance metrics such as number of load/store operations, en- ergy consumption and execution time of an application. In order to compare the performance metrics five RISC architectures were designed which have different Register Files organizations in order to calculate the performance metrics. The five RISC architectures on which benchmark was implemented are:

• RISC Architecture with 32 scalar register file

• RISC Architecture with 16 scalar register file

• RISC Architecture with 3 very wide register’s, each of width 8

• RISC Architecture with 3 very wide register’s, each of width 4

• RISC Architecture with 3 very wide register’s, each of width 2

26 As explained in chapter one, LISATek Processor Designer is used to acceler- ates the design of both custom processors and programmable accelerators, including the application-specific instruction set processors (ASIPs) that are increasingly essential to convergent system-on-chip (SoC) functionality. Pro- cessor Designer is used to develop a wide range of processor architectures, including architectures with DSP-specific and RISC-specific features as well as SIMD and VLIW architectures.

4.1 Tool Flow

Figure 4.1 gives an brief idea about my work flow and tool flow of my work. As mentioned in Figure 4.1 I started from designing the very wide register in RISC processor and completed it by calculating the power and energy consumption of the register file.

4.2 Implementation of Benchmark

A simple eight tap Finite Impulse Response(FIR) filter is implemented as benchmark on all the five RISC architectures with different register file or- ganization. Each architecture has its own resources and instruction set with which FIR filter is implemented on all architectures using assembly lan- guage. After successfull verification of the assembly code, the architecture and application are converted to Hardware Description Languages(VHDL or Verilog) using LISATek Processor Generator. The implementation of the FIR benchmark is shown below:

• 16 input and 8 tap Time multiplexed Finite Impulse Response(FIR) filter is used as benchmark[6].

• FIR operation is performed considering that all the inputs and coeffi- cients are initially present and are stored in the memory.

• Multiply Accumulate operation is performed if the sum of each input and each coefficient is less than 16, else the result is considered as zero.

27 Figure 4.1: Tool Flow

28 Architecture No.of Load’s No.of Steps to and Store’s compute FIR RISC Architecture using 32 SRF 272 1392 RISC Architecture using 16 SRF 272 1392 RISC Architecture using VWR of width 8 34 629 RISC Architecture using VWR of width 4 68 742 RISC Architecture using VWR of width 2 136 773

Table 4.1: Simulation results-LISATek Processor Debugger

• In Reference RISC Architecture (architecture using scalar register file) each input and coefficient is loaded from memory to register file, while in VWR RISC Architecture(architecture using very wide register) set of continuous values are loaded into very wide register.

• The results are stored in memory after completion of MAC operation in Reference Architecture where as in VWR RISC Architecture de- pending upon the size of very wide register set of results are stored in memory.

4.3 Simulation Results

Simulation results includes results from LISAtek Processor Debugger and Simulation results from Modelsim.

4.3.1 Results-LISATek Processor Debugger

The architecture specification ,i.e., pipeline stages, resources, assembly in- structions are described and declared in LISATek Processor Designer. The LISATek Processor Debugger is used to analyze the Architecture using benchmark. Finite Impulse Response (FIR) filter is implemented on Refer- ence RISC Architecture (architecture using scalar register files) and VWR RISC Architecture (architectures using very wide registers). The parame- ters of all the five architectures are tabulated and are shown in Table 4.1.

29 The results from Table 4.1 indicates that, RISC architecture using very wide register of width 8 has less number of Load/Store and executes FIR oper- ation in less number of steps than all other architectures. Overall RISC Architectures using very wide registers have less number of Load/Store op- erations and executes FIR operation in less number of steps than RISC architecture using scalar register files.

4.3.2 Results-Modelsim

All the five architecture’s and the respective applications are converted to VHDL by LISATek Processor Generator. Testbench is also generated along with other files. The functionality of the generated code is verified using modelsim. The screenshots of the simulation in modelsim are show in Fig- ure 4.2.

The execution time of FIR application in different architecture’s are plotted in a graph and shown in Figure 4.3. Figure 4.3 shows that in architec- tures using very wide registers execution time of an FIR Operation is less compared to architectures using scalar register files.

4.4 Synthesis Results

Logic synthesis is performed after verification of VHDL code in modelsim. Synthesis is performed using 90nm Target library, with including wireload models and with clock period 3ns. Netlist and SDF(standard delay for- mat)files are written after compilation. The netlist file which is generated is used in post synthesis simulation in order to ensure that the functionality is unchanged compared to the RTL-level description. Power estimation is done after post synthesis simulation by reading the SAIF backward anno- tated file (with activity) from modelsim to synopsys. Dynamic power( power+Internal power+leakage power) is reported after post synthesis sim- ulation.

The total dynamic power consumed by different modules in an RISC archi- tecture(using three VWR’s of width 8 + 8 scalar register files) and reference

30 Figure 4.2: Simulation in Modelsim

RISC architecture(using 32 scalar register files) are plotted in a pie chart.

Now energy consumption of a register file in an architecture and the energy consumption of the entire architecture is calculated by multiplying the power consumption and execution time of an application. The energy consumed by register file of all five different architectures are plotted in a graph shown in Figure 4.6 and it shows that very wide register of width 8 consumes less energy compared to other register files.

31 Figure 4.3: Execution time of an application

The energy consumed by an RISC architecture using different register files are plotted and shown in Figure 4.7 and among all the architecture’s RISC architecture using three very wide register each of width 4 is consuming less energy than other architectures.

4.5 Conclusion

Observing all the simulation results and synthesis results it can be concluded that very wide registers have better performance and consumes less energy than scalar register file.

32 Figure 4.4: Power Consumption of VWR RISC Architecture

Figure 4.5: Power Consumption of Reference RISC Architecture

33 Figure 4.6: Energy Consumption of Register File

Figure 4.7: Energy Consumption of Processor

34 CHAPTER 5

Conclusion and Future Work

Register is a power hungary part of an processor.Increase in the number of ports may have negative impact on the energy consumption of register file.In this thesis I had experimented on IMEC technology very wide reg- ister organization which has an asymmetric interfaces to the memory and datapath. An Finite Impulse Response(FIR)filter was mapped on very wide register and compared the results to a scalar register file about 2X reduction in energy consumption on Finite Impulse Response(FIR) filter benchmark.

5.1 Further Work

I have few solutions for this project: • Optimizing the VHDL code generated by the LISATek to reduce power consumption and energy consumption.

• Experimenting very wide register on multi-issue processor.

• Implementing few other benchmarks on very wide register.

• Perform physical synthesis and do place and route ot ensure that the wirelengths do not worsen the VWR based design.

35 APPENDIX A

Lisatek

This appendix address about the LISATek tool.Application-Specific Instruction-Set Processors (ASIPs) are becoming increasingly popular in the world of customized, application-driven System-on-Chip (SoC) designs. Efficient ASIP design requires an iterative architecture ex- ploration loopgradual refinement of the processor architecture start- ing from an initial template. LISATek is one such tool to accomplish this task, it detects bottlenecks in embedded applications,implement application-specific processor instructions, and automatically gener- ates the required software tools as well as synthesizes the hardware.

A.1 Introduction

Lisatek is an automated embedded processor design and software de- velopment tool generation environment.It is an optimization environ- ment which slashes months from processor hardware design time and years from the creation of processor specific software development tools.It enables even those design teams with no processor develop- ment expertise to create advance processors and it can also generate

36 software development tools for processors that have not been designed using LISATEek’s automated hardware design capability.

LISATek accelerates the design if both custom and standard proces- sors,including the application specific instruction processors(ASIPs) and it is also used to develop any wide range of processor architec- ture including DSP,RISC,SIMD,VLIW and superscalar. The key to LISATek’s automation is its Language for Instruction Set Architec- tures,LISA 2.0.LISA 2.0 enables the creation of single processor model as the source for the automatic generation of the instruction set sim- ulator(ISS),the complete suite of software development tools,and the synthesizable RTL code.LISATek enables the designer to optimize in- struction set design,processor micro architecture and memory systems, including caches.

The LISATek processor design platform is built around the LISA 2.0 Architecture Description Language(ADL).From a given processor model, written in LISA2.0 ADL,several processor development tools,such as instruction-set simulator,C-compiler, assembler,linker,,are automatically generated to support architecture exploration.Also Reg- ister Transfer Level (RTL) hardware models in languages like VHDL,verilog and SystemC can be generated from the LISA model for hardware im- plementation.Figure below shows the processor design flow supported by LISATek.

A.2 LISATek tools

This section gives an introduction to LISATek tools:

1. Processor Designer to describe and generate the model. 2. Processor Debugger to and benchmark the created model. 3. Processor Generator to automatically get an hardware descrip- tion of the model.

37 Figure A.1: LISATek Design flow

38 A.2.1 LISATek Processor Designer

LISATek Processor Designer is a tool for design and optimization for both custom and embedded processors.It is build on other LISATek tools to let one model the processor,analyze and optimize its behav- ior,generate applications for it and then create the RTL code for the architecture.The most important feature of the tool is processor model- ing,which allows one to describe quite easily the processor architecture with the imperative features,such as memories,registers,pipelines,instruction etc.

LISA architecture descriptions are composed of two main components: Resource definitions:here we model the storage elements of the real hardware of the processor like registers,memories,buses and pipelines. Operations:operations are basic objects,abstracted from the hard- ware to cover the instruction set,timing and behavior.

Resource section

Resource section consists of definitions of all the objects which are re- quired to build the memory model and the resource model.Handling of variables(declarations and passing data values to variables) are similar handling of variables in C language. The Resource section allows the declaration of the following types of objects.

– simple resources,such as: Registers and register files signals Flags Ideal memory arrays – Pipeline structures for instructions and data paths – Pipeline registers for storing data on its way from one pipeline stage to the next

39 – Non-ideal memories,such as: Caches Buses – Memory maps describing: Mapping of memories into the processor’s address space The Connectivity between the memory and bus modules

LISATek Operations

Processor behavior is modelled by LISATek operations,which is a ba- sic brick.The word operations refers to an instruction,a set of similar instructions,a part of single instruction and so on. Two operations must be present in any LISATek model: OPERATION reset:This operation is invoked at simulation start- up,which initializes the registers and the PC to the right values and flushes the pipeline. OPERATION main:This operation is invoked at every LISATek simulator clock cycle.This operation performs two kinds if actions:execute and shift,also the next operation is also declared which is to be acti- vated such as fetch. LISATek operations are structured in an hierarchical way.The idea of referring between operations,two different operations can refer to the same lower level operation, is similar to the idea of calling functions in C/C++ languages.An important feature in LISATek is that an high- level operation can refer to a low-level operation selected out of a given set.

A.2.2 Generating the model

Once LISATek model has been described,it is compiled to check the correctness and to generate tools.We can generate tools in two different ways:using the graphical version of LISATek Processor Designer or using command line input.

40 Compilation

LISATek processor generation startsby translating the code with LISATek syntax into C++ files.This C++ files will be sources for the subsequent generation of the other tools.Generation of the model is initialized by the command lamke¡model¿

Generating the simulator

We can generate two types of simulators for a given LISATek model:dyna mic simulatorand static simulator Typing Lmake simlib in the com- mand line generates dynamic libraries:the source C++ files are pro- cessed and linked with architecture.so,which contains generated dy- namic files.lmake simlibin the command lines generates a file named architecture.a containing staticlibraries.

Generating Assembler-Linker-Disassembler

Typing lmake assembler generates an assembler for the processor,which creates an executable file that allows to translate an assembly-code written for the processor into binary file,using SYNTAX and CODING section of LISA OPERATION. The binary file which is generated is linked to an executable file by the linker tool llnk by typing lmake linker.disassembler tool can be build with command lmake disassem- bler,which re-translate an executable file for the LISATek processor into a file that contains both the assembly-code and the coding of instructions as they have been read by the processor:it checks if the processor can understand in a correct way instructions that are written for it.

A.2.3 LISATek Processor Debugger

The LISATek Processor Debugger is a very powerful and user-friendly tool to simulated a LISATek model and checks the processor behav- ior.The debugger monitors every simulation-cycle of every part of the

41 processor model, such as registers,memories and pipelines.In a win- dow there is the disassembly code,where it is possible to see the last instruction fetched by the processor,how many times an instruction has been executed,which is most useful for loop cycles.Another win- dow consists the corresponding lines of the source code,assembly code or C++,depending upon how the bechmark is written. simulation is done in three different ways:

1. Free run:This allows to run a simulation till the end or till the previously set break point. 2. Run-debug:This runs a simulation showing each cycle values stored inside the memory,processor registers,pipeline registers.This mode slower than free run mode. 3. Run cycle:This allows to debug the architecture step by step.

A.2.4 LISATek Processor Generator

Processor Generator is powerful LISATek tool that allows the genera- tion of hardware Description Language(HDL)code.

LISATek Processor Generator options

Processor Generator offers various configuration options: General settings simply allows to enable or disable verbose mode and to specify the logfile name. Target settings is a section in which the HDL language can be chosen,between Verilog and VHDL,the clock and reset are defined and reset properties,such as high/low and syn- chronous/asynchronous,can be set. Script Generation allows to de- fine which scripts to be generated based on language(VHDL and Ver- ilog) and on simulation ans synthesis tools e.g. Mentor Modelsim and Synopsys Design Compiler. Resource Properties is a section regard- ing memories.It also determines if a memory is internal or external as well as the names of the memory files for hardware simulation.

42 Figure A.2: Example OAT Domain

Figure A.3: Screenshot of Processor debugger

43 LISATek HDL optimizations

The processor Generator provides a number of options allowing to manipulate the generation of HDL code with respect to optimized area and timing or power consumption.There are four major options to optimize the translated generated code:

1. Resource Sharing : Resource Sharing is limited to path shar- ing,which is,automatically detection and sharing of exclusive ac- cesses to LISA resources. 2. Decision Minimization : Decision Minimization option reduces the multiplexing logic for resources accesses by multiplexing only that part of a path which is actually relevant for the decision whether an access is executed or not. 3. Hierarchical Pattern Matching: Hierarchical pattern matching can reduce the gate count by generating comparators with the smallers bit width,however the critical path may increase as the hierarchical comparator structure introduces additional levels of logic. 4. Condition Decoupling for Group Calls and Expressions:

44 Figure A.4: Screenshot of Processor Generator

45 APPENDIX B

Assembly Code

This appendix describes about the assembly code of FIR Filter oper- ation which is mapped on very wide register and scalar register file.

B.1 Assembly code using Very Wide Register

The assembly of code of FIR filter using very wide registers is shown below.I have used three very wide registers(vr1,vr2 and vr3)each of width 4.Apart from very wide regsiters eight scalar registers(r1-r8) are also used which are used as variables and counters.

.include "fir.S" .text r1=16 ; iterator for the loop r5=0 ; address pointer to the inputs r6=24 ; address pointer to hold ouput mov vr3.3,10 r3=4 _loop:

46 mov vr3.1,0 r4=4

vr2.0=dmem[r5+0] ;loading inputs from memory to VWR2 vr1.0=dmem[r4+0] ;loading coefficients from memory to VWR1

mov vr3.2,0 ;intializing VWR3.2 to zero vr3.3=vr2.1 + vr1.1 ; ;Begining of FIR Operation vr3.1=vr2.0*vr1.0 vr3.2=vr3.2+vr3.1 vr3.1=vr2.1*vr1.1 nop nop vr3.2=vr3.2+vr3.1 nop vr3.1=vr2.2*vr1.2 nop nop vr3.2=vr3.1+vr3.2 vr3.1=vr2.3*vr1.3 nop nop vr3.2=vr3.1+vr3.2 r4=r4+1 r5=r5+4 nop vr1.0=dmem[r4+0] vr2.0=dmem[r5+0] nop vr3.1=vr2.0*vr1.0 nop nop

47 vr3.2=vr3.1+vr3.2 vr3.1=vr2.1*vr1.1 nop nop vr3.2=vr3.1+vr3.2 nop vr3.1=vr2.2*vr1.2 nop nop vr3.2=vr3.1+vr3.2 vr3.1=vr2.3*vr1.3 nop nop vr3.2=vr3.2+vr3.1

r2=(vr3.3 < 16) ;checking the condition r3=r3-1 r1=r1-1 r5=r5-3 if(!r2)jmp _loop3 if(!r3) jmp _store ;if VWR3 is full with outputs Store ;operation is performed if(r1) jmp _loop ;Checks whether inputs are finished or not _loop3: dmem[r6+0]=vr3.0 ;storing zero in memory if i/p + coeff > 16 r5=r5+1 jmp _loop _store: ;Storing result in Very Wide Register dmem[r6+0]=vr3.2 r6=r6+1 jmp _loop _end: ;Macros storing the inputs and coeff in memory initially

48 .section data1, "aw",@progbits _startdata1: generate_block_data(1) _startcoeff1: generate_block_coeff(7)

.section data, "aw",@progbits _startdata0: generate_block_data(0) _startcoeff0: generate_block_coeff(8)

.section data2,"aw",@progbits _startdata2: generate_block_data(2) _startcoeff2: generate_block_coeff(6)

.section data3,"aw",@progbits _startdata3: generate_block_data(3) _startcoeff3: generate_block_coeff(5)

B.2 Assembly code using Scalar Register File

The assembly code of FIR filter using scalar register files is shown below.This assembly code is mapped on RISC architecture using 32 and 16 scalar register file.

.include "fir.S" .text r1=16 ; iterator for the loop r4=0 ; address pointer to the inputs

49 r5=24 ; address pointer to hold ouput _loop1: r6=16 ;Pointer to access the coeff data r2=8 ;Iterator of the _loop2 r11=0 r9=0 _loop2: r7=dmem[r4+0] ;loading inputs from memory to VWR2 r8=dmem[r6+0] ;loading coefficients from memory to VWR1 nop nop r10=r7+r8 r9=r7*r8 nop ;intializing VWR3.2 to zero r3=(r10 < 16) ;checking if the i/p+coeff < 16 ;if < 16 proceed the loop else store0 in the o/p r11+=r9 r2-=1 if(!r3)jmp _store ;if r3=0 jump to store loop r4=r4+1 r6=r6+1 if(r2)jmp _loop2 r1=r1-1 r4=r4-7 dmem[r5+0]=r11 ;store tje o/p to memory r5=r5+1 _start: if(r1) jmp_loop1 ;if r1=0 go to loop ;else continue to loop1 jmp _end _store: r11=0 r1=r1-1 nop

50 dmem[r5+0]=r11 r5=r5+1 jmp _start _end; nop nop

;Macros storing the inputs and coeff in memory initially .data _startdata: generate_data(0) _startcoeff: generate_coeff(8) nop

51 Bibliography

[1] Scott Rixner and William J. Dally and Brucek Khailany and Peter R. Mattson and Ujval J. Kapasi and John D. Owens Register Organization for Media Processing, booktitle=HPCA, month=Januray,2000,pages=375-386. [2] Praveen Raghavan and Andy Lambrechts and Murali Jayapala and Francky Catthoor and Diederik Verkest and Henk Corporaal Very Wide REgister:An Asymmetric Register File Organization for Low Power Embedded Processors IMEC vzw, Kapeldreef 75, 3001 Leu- ven, Belgium KULeuven, Belgium; VUB, Belgium; TU Eindhoven, Netherland ragha, lambreca, jayapala @imec.be [3] B.Amrutur and M.Horowitz Speed and power scaling ofSRAM’s booktitle=IEEE Journal of solid state Circuits, volume=35 month=February year=2000 [4] Evans,R.j. Franzon,P.D. Energy Consumption modeling and opti- mization for SRAM’s booktitle=IEEE Journal of solid state Cir- cuits, volume=35 month=February year=2000 [5] F.Catthor and S.Wuytack and E.DeGreef and F.Balasa and L.Nachtergaele and A.Vandecappelle, Custom Memory Manage- ment Methodology Exploration of Memory Organization for Em-

52 beded Multimedia System Design, Publisher=Kluwer Academic Publ.Boston year=1998 pages=571-579 [6] title = TI DSP BEnchmark Suite, address = http://focus.ti.com/docs/toolsw/folders/print/sprc.092.html [7] Jean Baka Domelevo Master thesis:Working on the design of a Customizable Ultra low power Processor:A Few Experiments school = ENS Cachan Bretage and IMEC, month = September, year = 2005

53