AN ABSTRACT OF THE THESIS OF

John Mark Matson for the degree of Master of Science in Electrical and Computer

Engineering presented on May 2. 2003.

Title: Designing a Reconfigurable Embedded .

Abstract Redacted for Privacy

Ben Lee

The growth of applications for embedded processors has spawned a need for highly configurable devices. Custom have long life cycles for a fast paced market, where as off-the-shelf designs often do not provide the level of configuration, nor the ability to allow system-on-chip designs. This paper presents a description for a software environment that allows designers to provide configuration options for a design, and responds by dynamically reconfiguring the environment to provide a ready to test design. A background survey is provided on current embedded RISC architectures, along with a proposed new embedded ISA and a cycle-level simulator. Justification is presented for a new instruction format to reduce code size with little loss to performance. A manual is also provided for the new ISA. Designing a Reconfigurable Embedded Processor

By John Mark Matson

ru:I

submitted to

Oregon State University

in partial fulfillment of the requirements for the degree of

Master of Science

Presented May 2, 2003 Commencement June 2003 Master of Science thesis of John Mark Matson presented on May 2, 2003.

APPROVED: Redacted for Privacy

Major Professor, representing Electrical and Computer Engineering

Redacted for Privacy

Director of Schoobef1ectrica1 Engineering and Computer Science

Redacted for Privacy

Dean of the"dradtate School

I understand that my thesis will become part of the permanent collection of Oregon

State University libraries My signature below authorizs release of my thesis to any reader upon Redacted for Privacy

John Mark Matson, Author TABLE OF CONTENTS

Page

1. Introduction...... 1

2. Background...... 2

3. Profiling...... 3

RegisterFile Size...... 4

Instruction Length...... 5

Instruction Profiling: The ARM-PISA Comparison...... 7

4. Definition of the X32V ISA...... 11

DefaultMode...... 12

LightMode...... 12

UltraLight Mode...... 12

Word Boundary Study...... 15

5.Creation of the X32V Cycle Accurate Simulator...... 19

InstructionFetch...... 19

Instruction Decode...... 20

Execute, Memory Access, Write Back...... 21

Forwarding...... 21

Memory...... 21

SystemCalls...... 21 TABLE OF CONTENTS (Continued)

Page

Statistics...... 22

6. Future Work...... 23

7. Conclusion...... 24

Bibliography...... 25

Appendices...... 26

Appendix A: X32V ISA Manual...... 27

Appendix B: ARM SS and PISA SS Configuration Table...... 92

Appendix C: ARM SS and PISA SS MPEG 4 Benchmark Results.. 93 LIST OF FIGURES

Figure Page

1. 32-bit Instruction Format...... 13

2. 24-bit Instruction Format...... 13

3. 16-bit Instruction Format...... 13

4. Default Mode...... 14

5. Light Mode...... 14

6. Ultra Light Mode...... 15

7. Code Size for each X32V Mode...... 16

8. Code Size Reduction...... 17

9. Cycle Overhead...... 18

10. Instruction Fetch Example...... 20

11. The X32V Simulator and Compiler...... 22 LIST OF TABLES

Table Page

1. Various Embedded Processor Instruction Lengths...... 5

2. ARM and PISA MPEG 4 Benchmark Results...... 8

3. ARM Statistics Profile...... 10

4. PISA Statistics Profile...... 10 Designing a Reconfigurable Embedded Processor

1. Introduction

Designers of embedded products are presented with two options for a

: design their own, or purchase an off-the-shelf model. Designing

a microprocessor from scratch is not very cost effective for a product that may

have a shelf life of only a year. It can also push the design cycle out a few extra

quarters, which will make or break a product in a highly competitive market. Off- the-shelf solutions come in a variety of different formstypical microprocessor,

FPGA, CPLD, etc. The designer can choose from options such as memory size,

I/O size, physical size, speed and more. However, with the advent of Systemon

Chip, it is becoming tough to meet designer's requirements with an off-the-shelf solution. Power, speed, area, and architectural requirements cannot meet every design need with a 'flavor of the month' approach.

Ideally, a designer would have an immediate turnaround between when design requirements are made and when first silicon arrives for testing.Those goals, however, are unreasonable. What is reasonable is a software environment that could take the same design requirements and produce a ready-to-simulate design immediately. Simulations could then give the designer foresight into how the design will perform, where potential bottlenecks are, and the ability to experiment with new ideas that would otherwise be cost ineffective. 2. Background

The goal of this project is to design a software environment where a

designer can provide design requirements, and the environment will dynamically

change (reconfigure) to provide a ready-to-test design.Examples of design

requirements are caches, floating-point units, digital signal processing units, multiply and accumulate units, and more. User definable instructionsare also available for controlling the additional hardware and for creating specialized operations such as multimedia instructions.

The software environment is comprised of three major components:a core

Instruction Set Architecture (ISA), a compiler, and a simulator. Each of these components is an integral part of the environment and deserves discussion.

The designer should have the basic building blocks of a microprocessor available to expand upon; otherwise too much time is spent on recreating functionality that does not change from one design to another. Therefore,a core set of instructions and pipeline definition are the framework for the environment.

The core ISA allows for the completion of virtually any design, while the additional hardware, instructions, and other parameters allow for enhanced performance.

A compiler is also required to transform high-level code into machine executable instructions, and should handle the creation of new instructions to control additional hardware.The compiler is not presently completed and is 3 beyond the scope of this paper. However, it will be referred to and the reader should be aware of its purpose.

Lastly, a simulator is needed to test and collect statistics on the design.

The simulator should provide cycle-level accurate data on precisely how information moved through the machine during execution.The designer then uses this data to verify functionality and performance.

This paper describes the development of the core ISA and simulator, and is broken down as follows: Section 3: Profiling; Section 4: Defmition of the

X32V ISA; Section 5: Creation of the X32V Cycle Accurate Simulator; Section 6:

Future Work; Section 7: Conclusion; Section 8: References; and Section 9:

Appendix: AX32V ISA Manual, BARM and PISA SimpleScalar Configuration

Table, and CARM and PISA SimpleScalar MPEG 4 Benchmark Results.

3. Profiling

Before justifying the creation of an embedded RISC ISA, an analysis of processor characteristics and their impacts on performance, complexity, and size must be conducted. Recently Hennessy and Patterson [1] did a survey of desktop, server, and embedded RISC processors.In this survey the authors detailed the various architectures' core integer instructions, floating point instructions, and unique instructions.

We investigated Hennessy and Patterson's survey of ARM, Thumb,

SuperH, M32R, and MIPS 16.Furthermore, we extended our survey to include

Advanced Digital Chip's AE32000 [4] and Tensilica's Xtensa [5] processor. The 1

survey focused on three main topics: sizes, instruction lengths, and

typical core instructions.

Register File Size

All embedded processors surveyed used an integer register file with 16 or

fewer entries, unlike most desktop and server platforms that use 32 or more

entries. A smaller register file does limit the amount of data the processor has

immediately accessible, which in turn can cause an increase in memory traffic.

So why is a 1 6-entry register file the choice for most embedded processors?

Tensilica [5] suggests that the performance loss from 32 registers to 16 registers is

a mere 5%, and that the biggest drop comes between 16 registers and 8 registers.

Perhaps more important than performance is the impact a register file has on power consumption and viable instruction formats.

Unsalet al.[2] compared the relative energy consumption of a 16-entry and a 32-entry register file during the simulation of the Mediabench tool set [3].

The study showed an average 25% decrease in power consumption for a 16-entry register file.The increase in memory traffic did not outweigh the extra energy consumption by having twice as many registers and additional decoding logic.

Likewise, register file size can limit instruction format options. Typical

RISC instructions are often broken down into a few broad categories: 'register- register' (RR), 'register-immediate' (RI), and 'immediate' (I). Typically an I-type instruction would simply have an operation code (op-code) and an immediate value. A RI-type instruction would specify an op-code, a register, and an immediate value. Lastly, an RR-type instruction would include an op-code and two or three registers.

RR-typeinstructionsareof significantinterestbecause of their dependence on the register file size; the larger the register file, the more bits required for addressing source and destination registers.For example, take an addition instruction encoded in an RR-type format that references three registers:

R3 = R2 + Ri. Each register must be encoded in five bits (2 = 32 registers), which consumes 15 bits of the instruction. Another format involves specifying only two registers, one source and one source/destination.For example, if the same instruction were encoded, R3R3 + Ri, only 10 bits are need for registers and 6 bits remain for the op-code. This method results in the loss of R3 's original data. In order to retain the present state, a copy of R3 must be made. In the worst scenario, this could double the execution time of RR-type instructions.

Of the two options for RR-type Processor Instruction instructions, register reference Length ARM Thumb 32/16 consumes 10-15 bits of the instruction. SuperH 16 M32R 32/16 As thenextsectionshows,this MIPS 16 32/16 ADCAE32000 16 severelylimitstheoptions for Xtensa 24/16 instruction formats. Table 1Various Embedded Processor Instruction Lengths Instruction Length

Table 1 shows the instruction lengths of all the embedded processors surveyed.The importance of instruction length in an embedded architecture cannot be overlooked. Embedded processors do not have access to a secondary storage device whereas desktop or server systems do, and this requires that all code reside in a ROM. This is costly in terms of silicon, size and power.

As is shown in Table 1, most architects have chosen to sacrifice flexibility of instruction format in the hopes of increasing code density. This is important to our analysis of RR-type instructions. If a three-register format is used with a 32- entry register file, only one bit exists for op-code. For obvious reasons, one bit of op-code is unreasonable for such a format. However, if only 4 bits are required for referencing registers, 4 bits still remain for other op-codes.

The architects of the ARM Thumb processor chose to limit the number of addressable registers from 16 to 8 when in Thumb mode (a 16-bit instruction format). This requires thatonly9 bits of the instruction be dedicated to register referencing.

Because most of the architectures have two formats, one large and one small, the loading of large immediate data takes place via the larger format.

However, SuperH and AE32000 are limited toonly16 bits, which means immediate values larger than 12 bits could require multiple instructions. AE32000 uses an 'Extension Flag' (EF) [6] and an 'Extended' register to overcome this problem. Instructions that require large immediate values set the

EF and merge two subsequent instructions in memory. This mechanism appears to have little effect on processor performance. 7

All of the architectures require half-word instruction memory addressing because of the16-bit format; however, Xtensa requires byte addressable instruction memory due to its unique 24-bit format. This can cause an interesting problem during branching and jumping, which we will discuss in Section 4.

Instruction Profiling: The ARM PISA Comparison

Each ISA has unique instructions and ways of handling data, so it is hard to quantify what impact register file size and machine specific instructions have on performance. To better understand, we decided to compare two ISAs: PISA

(Portable ISA) [7] and ARM. PISA is a more classic RISC ISA with roots to

MIPS and DLX, which provides a simple core by which to accurately benchmark the ARM ISA. The recently developed SimpleScalar ARM simulator [8] and the

SimpleScalar PISA simulator were used to conduct the study.

An MPEG 4 coder/decoder benchmark was used for the study.The program encodes and decodes a sample YUV input file to and from an MPEG 4 format. Although this does not exercise the floating-point instruction set, it does exhaustively exercise all the other facets of the ISA. Both simulators had all super scalar functionality disabled for the benchmark, to provide the greatest similarity to a five-stage pipeline. Appendix B shows the configurations used, and Table 2 shows the results of the benchmark. 8

StatistIc ARM PISA Comment

sim_cycle 21,467,563,78922,921,957,175total simulation time in cycles Overall sim_num_insn 8,483,059,6719,815,840,093 total number of instructions sim_num_uops 13,466,040,213 total number of Micro Operations Memorysim_num_refs 3,260,265,87c 2,110,058,029total number of loads and stores sim num loads 2,135,057,8521,564,723,828total number of loads committed Access sim_num_stores 1,125,208,027 545,334,201 total number of stores committed

Branchessim_num_branches 1,21 5,336,39C1,912,743,848 total number of branches bpred_nottaken.miss 805,725,37c 1,310,894,457 total number of misses Inst. sim_IPC 0.39f 0.428 sim IPB 6.98( 5.132 instruction per branch Lifetime sim_CPI 2.531 2.335cycles per instruction ill accesses 9,297,604,87c11,150,009,572 total number of accesses Inst. Cacheill .hits 9,287,940,10811,124,291,835total number of hits ill misses 9,664,771 25,717,737 total number of misses ill.miss_rate 0.001 0.002 miss rate (i.e., misses/ref) dli accesses 3,259,788,53'2,110,058,029total number of accesses D ata dli .hits 3,245,149,14c2,096,871,699total number of hits Cache dli .misses 14,639,387 13,186,330 total number of misses dli miss_rate 0.00f 0.006 miss rate (i.e., misses/ref) Misc. id_text_size 393,87C 363,088program text (code) size in bytes

Table 2 ARM and PISA MPEG 4 Benchmark Results

It should be noted that ARM has 32 General Purpose Registers (GPR5), but only 16 of them are accessible in user mode [9]. Of those, R15 is the Program

Counter (PC), R14 is the Link Register (LR), and R13 is the Stack Pointer (SP).

However, PISA has 32 accessible GPR with RO fixed at zero and R3 1 the LR.

This means that PISA has 2.3 times (30/13) as many GPRs available. As stated earlier, fewer registers can mean an increase in memory traffic in order to store data already in the register file. Table 2 shows that ARM requires 2.1 times as many stores as PISA, which directly matches the difference in register file size. Even though ARM spent more time loading and storing data, due to a smaller register file it still showed a better overall simulation time in cycles.

Three architectural differences help ARM to improve performance: conditional branching, embedded shifting, and the load/store-multiple instruction. Tables 3 and 4 illustrate summarized data of the instruction profile in Appendix C.

In order to reduce the number of branches, ARM uses conditional execution of any instruction. Conditional execution is based upon four bits of the

Status Register: Negative, Zero, Carry, and Overflow. All four bits are embedded in each instruction (bits [31:28]). Upon execution, the processor compares the condition bits to the present state of the Status Register and determines if the instruction should be executed.If the instruction is not executed, then it is converted to a NOP (no operation), and a bubble/slip occurs in the pipeline.

Four instructions are used to modify the condition bits: Compare Negative

(CMN), Compare (CMP), Test Equal (TEQ), and Test (TST). These instructions compare two data elements and update the condition bits in the Status Register so subsequent instructions can be conditionally executed.

The conditional bits means ARM needs only two typical control instructions: Branch and Branch-and-Link Table 3 demonstrates that ARM branched a very low 14.32% of the time and boasted 6.98 (Table 2) instructions per branch (IPB). PISA branched 19.49% of the time (Table 4) and as a result had only 5.13 IPB (Table 2), nearly two instructions fewer per branch. 10

For every conditional Instruction Type °" instruction that is not executed there '' Load 1,741,378,189 20.53 isa bubble/slip in the pipeline. Store 589,048,095 6.94 Control 1,215,336,160 14.32 Take, for instance, an instruction Integer 4,933,354,454 58.15 FP 3,818,200 0.05 that modifies the status register and Misc. 11,710 0.00 rotat 10subsequentinstructionsthat nstructions 8,482,946,80E depend upon that condition. If each Table 3 - ARM Statistics Profile subsequent instruction does not meet the condition then they would each go through the pipeline without ever actually executing. The resultis10 slips/bubbles in the pipeline.

The second key point is that :flsjJfl Instruction Typ %Utilized Count ARM doesnothaveashift Load 1,564,723,828 15.93 Store 545,334,198 5.55 instruction. This is because all ontrol 1,912,743,848 19.49 Integer 5,787,281,813 58.95 register instructions can have a shift FP 3,824,850 0.04 Misc. 23,120 0.00 embedded into them.In RR-type otal nstructions 9,813,931 ,65 instructions,thesecondsource

Table 4 - PISA Statistics Profile register can be optionally shifted with a 5-bit immediate value. This allows arithmetic shift right, logical shift left, logical shift right, and rotate right. It is interesting to note that PISA spends about

10.6% of execution using all six of its available shift instructions (see Appendix

C). 11

Lastly, we previously stated that ARM had twice as many stores as PISA.

However, Tables 3 and 4 revealed that the number of stores for PISA and ARM are almost the same.It is important to notice the subtle difference between a memory access and a memory instruction.ARM executes nearly the same number of memory instructions, but many of these instructions are Load/Store

Multiple (LDMISTM).This allows efficient control of loading and storing several words of data with only one instruction. For LDM the processor simply pushes data to the stack from contiguous memory addresses in the data cache.

STM works much the same way. The statistics show that ARM benefits from fewer I-cache accesses (16.5%) and about one third the number of misses. This is probably due to instructions like LDMISTM.

4. Defmition of the X32V ISA

Our study of ARM and PISA gave us insight into how ISA design decisions can impact performance.However, we were unable to quantify the impact of instruction length on performance since both ARM and PISA have 32- bit instruction lengths.As was mentioned earlier, instruction length is an important characteristic of an embedded device. Care should be taken to balance the need for code density with that of flexible instruction format.

After considering these costs, we chose to create our new X32V ISA with multiple modes of operation. The idea behind modes of operation is to compress instructions whenever possible to improve code density. Outlined below are the three proposed modes. IPI

Default Mode

This is a 32-bit only format that allows for a large variety of instructions, large immediate values, and room for expansion of the instruction set (i.e., co- processor instructions).However, even very simple instructions take 32-bits.

Example: Jump Register requires only an op-code and a source register. If the op- code is 6 bits, and we have a 16-entry register file, then this instruction could be encoded in 10 bits. However, Default Mode requires the fixed 32-bit format.

Light Mode

This is a mixed 32-bitJl6-bit format.The compiler analyzes each instruction to determine if it can fit into a compressed 16-bit format. Example:

Jump simply requires an op-code and an immediate value. If the destination of the jump is ± 127, then the immediate value can fit in a 16-bit instruction format.

Ultra Light Mode

This is a 32-bitJ24-bitll 6-bit format. The compiler does exactly the same as in Light Mode except there is an additional format. If an instruction requires too many bits for a 16-bit format, the compiler can try a 24-bit format. Example:

Jumping further than ± 127 (an 8-bit immediate value) but less than ± 8,388,607

(a 24-bit immediate value). In this case a 16-bit immediate value would be ideal.

Figures 1-3 show all possible instruction formats for 32-bit, 24-bit, and

16-bit instructions. The instruction formats are subdivided into five instruction categories: Load/Store, Immediate, Branch, Register, and Jump/Call. 13

4 4 4 4 16

LD/ST t*1,I$II * Immediate F1 * Branch rr

4 4 4 4 4 4 4 4 Register 0011 opl rd rs0 ml op2 op3 I I f OP4 I

4 4 24 Jump/Call opl Label/Immediate 1 0100 1

Figure 132-bit Instruction Format

4 4 4 4 8

LD/ST I 0101 1opl j rd rso fOisplacement I Immediate 0110 opi rso Immediate i i i i i i Branch opl rd mO Label I 0111 1 I I I I

4 4 4 4 4 4 Register opi rd ml op2 11000 1 I I I

4 4 16 Jump/Call p1001 opi Label/Immediate J I I

Figure 2-24-bit Instruction Format

LD/ST, 1mm., Branch None

4 4 4 4

opi rd Register - Format 111010 I I rs0 I Register - Format 2 1011, ioP1 rd mOi

4 4 8 Jump/Call [1100 1 opl Label/Imm. ]

Figure 316-bit Instruction Format 14

Ideally the three modes would have no impact on the performance of the machine. However, while Default Mode will operate with no overhead, Light and

Ultra Light Mode may incur a penalty when branching or jumping.

In a 32-bit architecture all OxOO 32-bit instructionaccessesare word 0x04 32-bit 0x08 32-bit aligned, but this is not OxOC 32-bit guaranteedifinstructionsof Figure 4 Default Mode varying length are mixed. It only becomes a problem when the target address of a branch/jump is an instruction that is situated across a word boundary. In this case a second fetch from instruction memory is required, which causes a one-cycle delay. Figure 4 shows that when all instructions are word-aligned, branches to any instruction are possible with a normal branch penalty.

Figure5,Light Mode, OxOO 32-bit illustrates an instruction crossing 0x04 16-bit 16-bit 0x08 16-bit 32-bit awordboundary. Ifthe OxOC 32-bit 16-bit instructionataddress OxOA Figure 5Light Mode (shown in gray) is the destination of a branch or jump, then the word starting at address 0x08 is fetched.

This meansonlyhalf of the instruction at OxOA is obtained in the first cycle. An additional fetch is required to load the word at address OxOC that contains the second half of the instruction. 15

Likewise, Figure 6 illustrates four instructions (show in gray) in Ultra

Light Mode that cause a one-cycle penalty if they are the destination of a branch or jump.

Word Boundaiy Study

OxOO 16-bit 24-bit

In order to get a feel for 0x04 24-bit 16-bit I 32-bit 0x08 32-bit 16-bit how the three different modes OxOC 16-bit 16-bit 32-bit I might impact performance, we Figure 6 Ultra Light Mode setup acasestudy using

SimpleScalar PISA. Our goal was to convert PISA code into X32V code using the three new formats, in order to estimate code size reduction and performance impacts. We analyzed each PISA instruction and found a one-to-one mapping to an X32V instruction.In the event a direct mapping could not be found, we approximated which formats the instruction could be encoded in.For example,

X32V has no Logical NOR instruction, but PISA does.X32V does have a

Logical AND, so we safely assumed similar format constraints.

Next we compiled, for PISA, several benchmarks from the Media

Benchmark suite [3]. We used a PERL script to cycle through the code (text) portion of the program to determine which X32V format an existing SS instruction could be mapped too. The result was X32V code in Default, Light, or

Ultra Light mode. The size was calculated by adding the code size (instructions) to the data size (initialized data).Figure 7 shows the code sizes for the three

X32V modes. 16

!JsIsI'

250,000

J200,000 Default .150,000 DLight 100,000 0 Ultra-Light

50,000

0

C) C C%C) C) (' (<7 ( , 4<9

Figure 7 Code Size for each X32V Mode

We can see that in all cases there was a benefit from both Light and Ultra

Light modes. Figure 8 shows the percentage of reduction in code size.Light mode gives at least a 7% reduction in all the benchmarks, and Ultra Light gives about a 25% reduction.

Estimating overhead began by running each benchmark in the SS simulator and collecting data. Instruction memory, simulated instruction stream, and general simulation statistics were all gathered. We then used another PERL script to analyze the output.

The script stepped through the PISA code (text) portion again and recorded every branch/jump, and the destination address of that branch/jump.It then cycled through the instruction stream and counted how many times each branch/jump was taken. Lastly, it merged the new X32V code with these results. 17

The effect was a hash containing every X32V instruction, the old PISA address, the new X32V address, and the number of times (if any) it was branched to. We then looked at each instruction that was on a word boundary to see how many times it was branched/jumped to.

30.00%

25.00% 0

20.00%

15.00% 0) ---Light 10.00% Ultra-Light 5.00%

0.00%

'?($ c bi C ? C, C '' 1<' A'

Figure 8 Code Size Reduction

To calculate the overhead we separated the total executed instructions into two groups: normal instructions and instructions on a boundary that were the target of a branch/jump.The normal instructions were first multiplied by our simulated cycles-per-instruction (CPI). Then the word boundary instructions that were targets of branches/jumps were multiplied by the (CPI + 1) to account for the one cycle overhead of fetching the second half of the instruction. The two figures were then added together to represent the estimated number of cycles for

Light or Ultra Light. 18

Figure 9 shows the percentage overhead of Light Mode and Ultra Light

Mode when compared to Default Mode. We can see that regardless of the mode we incur about a 3% cycle overhead, on average, compared to Default mode.

8.00%

7.00%

6.00%

C) 5.00%

r Light 24.00% C Ultra Light 3.00% ______C) 2.00% , j 1.00% "1

0.00% i

c c 1,

Figure 9 Cycle Overhead

Ultra Light shows strong prospects for reducing code size (25%) with little performance loss (3%). Furthermore, it still allows for 32-bit instructions, which give the user a great deal of flexibility when user customizable instructions are added.

Most of the architecturally unique characteristics of X32V either center on instruction formats, or in the case of customizable instructions, have yet to be developed.As such, the details of the architecture are left for the reader to investigate in Appendix A: The X32V ISA Manual.This manual outlines 19 addressing modes, special purpose registers, instruction formats, instruction descriptions, etc.

5. Creation of the X32V Cycle Accurate Simulator

The X32V simulator was developed in order to accurately model the architecture and verify its functionality.It was during this development that simulator goals were drafted and low-level architecture decisions were finalized.

Pipeline implementation, cache hierarchy, forwarding mechanisms, systems calls, and statistics gathering were the main goals of this initial version of the simulator.

The simulator is based on a typical 5-stage pipeline with the following stages: Instruction Fetch (IF), Instruction Decode (ID), Execute (EX), Memory

Access (MEM), and Write Back (WB).

Instruction Fetch

The IF stage is perhaps the most complicated stage. Unlike a typical RISC machine where the instruction indicated by the (PC) is simply fetched, the X32V must handle the issue of multiple modes of operation. Three possible scenarios exist for fetching instructions: one instruction fetched, more than one instruction fetched, or only part of an instruction fetched.

If only one instruction is fetched, the target instruction is placed in the instruction decode register, the PC is incremented to the next location in memory, and operation continues as normal. If multiple instructions are fetched then the target instruction has to be masked off and placed in the decode register, and the second instruction must be stored. For example, Figure 10 OxOO 16-bit 24-bit I shows a16-bitinstructionat 0x04 24-bit 16-bit 32-bit 0x08 32-bit 16-bit address OxOO. Whenthis OxOC 16-bit 16-bit 32-bit instruction is fetched, the first Figure 10 Instruction Fetch Example two bytes of the following 24-bit instruction are also fetched.This partial instruction is stored in the instruction stream buffer until the last byte is fetched on the next clock cycle, and the two portions are merged together and placed in the instruction decode register.

The last scenario is when only part of an instruction is fetched.As described earlier, this will only happen during a branch/jump. In this event, the bytes fetched are stored in the instruction stream buffer until next clock cycle, when the remaining bytes are fetched.

Instruction Decode

The Instruction Decode stage is responsible for determining the type of instruction in the , calculating immediate values, accessing source registers, and calculating target addresses.Because of the performance loss by calculating branch addresses in the EX stage, X32V has access to an and a comparator in the ID stage.This means that the branch outcome and address are determined by the end of the ID stage.Thus, for instructions like

B_EQ (Branch if Equal) there isonlya one-cycle penalty if the branch is not taken. 21

Execute, Memory Access, Write Back

These three stages all follow typical RISC operation. All ALU operations occur in the EX stage. All memory accesses take place during the MEM stage, and data is written back to the register file during the WB stage.

Forwarding

There are two forwarding paths in X32V; one from the ALU output in the

EX stage to the ALU input, and one from the Memory Data Register in the MEM stage to the ALU input.These two forwarding paths check to see if source registers in the ID stage are the destination of an instruction in either the EX or

MEM stages. In this event, data is forwarded. The forwarding paths are in place for either source register zero or source register 1.

Furthermore, the register file is written to on the first half of a clock cycle and read from on the second half. This removes any hazards for instructions in the WB stage.

Memory

The cache/main memory system is borrowed from SimpleScalar. At present it is not completely ported to the X32V simulator, but once completed it will provide a customizable hierarchy of Li cache, L2 cache, and main memory.

System Calls

The processor uses system calls to make requests to the native operating system that the simulator is running on.This allows the program under simulation to read and write to files, print information to the screen, etc. Like the 22

memory structure, system calls are also being ported from the SimpleScalar tool

set. Presently the majority of system calls are in working order, and debugging of

the remaining few is work in progress.

Statistics

Statistics are the means for the simulator to provide information about

how the architecture performed.Instruction count, braches, taken branches,

memory traffic, pipeline bubbles, etc., are all statistics that must be gathered.

Most memory related statistics are already implemented in the SimpleScalar

memory module. The remaining statistics are being implemented as development

of the simulator is completed.

I Operating System I

-r- -I ______I ______I System Calls I X32V I ______Code I IlmiLoaderl I

I Pipeline Compiler []EX MEM [] wB I I Main

I I Memory I CCode I I

I-Cache D-Cache I

tLI I I

I I I I I I ------X32V SIMULATOR -II

Figure 11The X32V Simulator and Compiler 23

Figure 11 shows the major components of the X32V simulator, and how they interact with the operating system and the compiler. C code is initially passed to the compiler, which in turn generates X32V assembly code.The assembled and linked code is then passed to the loader during the initialization of the simulator. The loader is responsible for initializing the memory elements and placing instruction code and data in the memory hierarchy. Lastly, the pipeline begins execution. References to instruction cache, data cache, or the native OS are addressed during simulation.

6. Future Work

The description in the last two sections is only the first step of a much larger project.The core ISA and cycle accurate simulator serve as a key foundation for larger goals. However, the compiler is still in the early stages of development.In order to verify the X32V architecture we must have the flexibility to run virtually any benchmark on it. Furthermore, a key component of the compiler is the generation of new instructions. Users will provide information about added hardware to the compiler, and it must generate instructions for that hardware.

Likewise, the simulator will have to be modularized to handle new hardware components. Each component must be incorporated into the existing five-stage pipeline, and decoding of additional instructions will have to be added. 24

7. Conclusion

This paper proposes a framework for a software environment. It includes a detailed description for a new embedded RISC ISA (X32V) and a simulation environment. Several studies were done to characterize embedded processors in preparation for developing the X32V. Care was taken to provide an ISA that balancedinstructioncodesize,performance,andtheoptionforuser configuration. Simulation results were provided that demonstrated how multiple instruction formats can be used to increase code density with minimal loss to performance. Lastly, future plans were outlined for how to bring the software environment to a completed state. Bibliography

1.Hennesy, J.L., & Patterson, D.A.(2003). A Survey of RISC Architectures for Desktop, Server, and Embedded Computers. In : A Ouantitative Approach (Appendix C). San Francisco: Morgan Kaufmann Publishers.

2.Unsal, O.S., Koren, I., Krishna, C.M., & Moritz, C.A. (Februaiy 2002). The Minimax Cache: An Energy-Efficient Framework for Media Processors.Eighth International Symposium on High- Performance Computer Architecture (HPCA'02).

3. Lee C., Potkonjak M., & Mangione-Smith W.H.(December 1997). Mediabench: A Tool for Evaluating and Synthesizin Multimedia and Communications Systems. Proceedings of the 30 Annual International Symposium on .

4. AdvancedDigitalChips.(January2000). EISC32bit Microprocessor: Extendable InstructionSet Computer. 0.5.4. Advanced Digital Chips. Seoul, Korea: Advanced Digital Chips.

5.Tensilica, Inc. (September 2002).Xtensa Architecture and Performance (white paper). Santa Clara, CA: Tensilica, Inc.

6.Lee, H., Beckett, P., & Appelbe, B.(January 2001).High Performance Extendable Instruction Set Computing.6thAustralian Computer Systems ArchitectureConference (AustCSAC'Ol), Oueensland. Australia, pp. 81.

7.Burger, D., & Austin, T.M. The SimpleScalar Tool Set, 2.0. [Online], http://www.simplescalar.com.

8.Austin, T.M.,etal.SimpleScalar Tutorial,4.0. [Online], http://www.simplescalar.com.

9.Seal, D. (2000). Introduction to the ARM Architecture. In ARM Architecture Reference Manual (pp. Al-i-Al-b). Harlow, England: Addison-Wesley. APPENDICES 27

Appendix A: X32V ISA Manual

Architecture

Introduction

The purpose of this manual is to give the reader an understanding of the

X32V architecture and instruction set. X32V is a RISC based microprocessor. It is built around a 32-bit data-path, five-stage pipeline. The core ISA is based on

32-bit instructions, however, later sections describe the multiple instruction format modes (32, 32/16, 32/24/16) that X32V operates in.

Addressing Modes

Keeping in line with a strict RISC architecture, X32V only allows loads and stores to access data memory. As such, only immediate and displacement addressing modes are supported.Immediate allows a 16-bit immediate value as an operand.Displacement addressing uses a GPR in conjunction with a signed immediate value to provide the effective address.

Indirect addressing is possible if the immediate value is set to zero.

Likewise, absolute addressing is accomplished by specifying a 16-bit immediate value and a GPR that is set to zero.

Byte, half-word, and word addressing are all available. However, word addressing is word aligned, half-word addressing is half-word aligned, and byte addressing is byte aligned.

X32V is a Big Endian architecture, thus, the byte whose address is xxxxOO resides at the most significant position in the word (the big end). Words are 28 ordered so the most significant bit (MSB) is bit 31, and the least significant bit

(LSB) is bit 0.

Figure Al - Big Endian Word 31 2423 16 15 8 7 0

byte 0 byte 1 byte 2 byte 3

8 8 8 8

Re!isters

Registers are broken up into three different categories: General Purpose,

Floating Point, and Reserved Usage.

General Purpose Registers (GPRs)

X32V has 16, 32-bit, GPRs (ROR15). None of these GPRS have

reserved usage. Registers can be loaded with byte (8-bits), half-word (16-

bits), or word (32-bits) values.

Byte values are stored in the least significant 8 bits of the register

(byte 3), and half-word values are stored in the least significant 16 bits

(bytes 2 and 3). Word values use all four bytes.

If a value is a signed integer, the MSB is sign extended throughout

the rest of the word. Example: see L_H (Load Half-word Signed). If the

half-word loaded is 0x82, the value would be sign extended to OxFF82 and

placed in the destination register. 29

Instructions such as Move Upper Immediate (MOV_Ui) load a 16- bit immediate value into the upper half-word (byte 0 and byte 1).The

'SET' instructions(SET_LI, SET_LT1,etc) place the result of their operation in LSB of the register (bit 0).

Floating Point Registers (FPRs)

X32V ISA also has 16, 32-bit, FPRS.Floating-point values are stored in the following format.

Figure A2 - Floating Point Register (FPR) 31 30 2322 0

S Exponent Fraction

1 8 23

FPRs can also be used to store double precision numbers (64-bit) by allotting two consecutive registers for each PP number. Registers are then accessed on an even basis (FPRO, FPR2, FPR4, etc).For more information on double precision operations, see the double precision floating point instructions.

Reserved Usage Registers (RURs)

Like many other RISC machines, X32V has several registers reserved for special purposes.The operation of the Program Counter,

Instruction Stream Buffer, Link Register, Stack Pointer, and Status

Register are outlined below. All RURs are 32-bit registers 30

Program Counter:The PC contains the address of the current instruction being fetched from memory. The PC is incremented according to the size of instruction retrieved from memory. Because X32V allows for different modes of operation, the PC can be incremented by 2, 3, or 4 bytes. If the current instruction is a jump or a taken branch, the relative address is calculated assuming the PC has already been incremented.

Instruction Stream Buffer. This register holds partially fetched instructions for one clock cycle until the remaining portion is fetched.

Link Register:The LR is used to temporarily hold the return address during a J_P or J_PR instruction. Upon return from the procedure call, the return address is loaded into the PC from the link register with the

J_LR instruction (Jump Link Register).

Stack Pointer Register: The SPR holds the memory address of the stack.

Status Register: The SR holds key information about the state of the processor. The SR is a 32-bit register. Presently only the first 8 bits are defmed.

Bits 1 and 0 are fixed at zero.

Bits 3 and 2 contain the processor mode. These bits are checked on a regular basisto determine how instruction fetching and

immediate extension is done. Ultra Light is the initial mode. The

modes are defmed as follows... 31

00= Ultra Light

01 =Light

10 = Default

11 =

.Bit 4, the zero bit, is set when the ALU result is all zeros.

.Bit 5, the sign bit, is always the most significant bit of the ALU

result.

.Bit 6 is the overflow bit. This bit is set during signed arithmetic

operationsthatresultin an invalidvalue.Seeinstruction

descriptions for more details.

.Bit 7 is the Cany bit. This bit is modified by the ALU upon

execution of an instruction. The carry bit is set during a carry out

or borrow in. If an instruction modifies the carry bit, the condition

is outlined in the instruction description.

Figure A3 - Byte 0 of Status Register

7654 3 2 1 0 32

Instruction Set

Introduction

This section describes the different modes of operation for X32V, and the instruction formats for each of those modes.

Modes of Operation

X32V supports three different modes of operation.

1.Default Mode: 32-bit instruction length

2.Light Mode: 32-bit/i 6-bit instruction length

3.Ultra Light Mode: 32-bit/24-bitll6-bit instruction length

All core instructions are available in the default 32-bit mode. However, where possible, 32-bit instruction can be condensed into their 24-bit or 16-bit equivalents in modes 2 and 3.This allows minimal code size with little loss in performance. Each instruction has the different formats listed in their detailed description below.

Instruction Formats

Instruction formats are subdivided intofiveinstruction categories;

Load/Store, Immediate, Branch, Register, and Jump/Call. Figures A4 A6 shows all possible instruction formats for 32-bit, 24-bit, and 16-bit instructions. 4 4 4 4 16 opl rso LD/ST I 0000 rdJ I Displacement Immediate 0001 opl rd reQ Immediate Branch 0010 apI rd rsO Label

4 4 4 4 4 4 4 4 Register 0011 opi rd rsO rsl op2 op3 op4 I J I I J j

4 4 24 Jump/Call 0100 opl I I Label/Immediate

Figure A432-bit Instruction Format

4 4 4 4 8 LD/ST opi rd rsO J 0101 I J I Displacement I Immediate opl rs0 I 0110 I I d 1 1 Immediate Branch DpI rd rsO Label I 0111 1 I I I I

4 4 4 4 4 4

Register 11000 Jopi rd rsO I rsl I op2 I

4 4 16

Jump/Call 11001 Iopi I Label/Immediate

Figure A524-bit Instruction Format

LD/ST, 1mm. Branch None

4 4 4 4 Register - Format ILiolo Iopi i'd J rs0

Register - Format 2 11011 1opf i'd Irs0 I

4 4 8

Jump/Call 11100 1aPI I LabeVlmm.

Figure A616-bit Instruction Format 34

Load / Store (LIS)

Load and Store instructions operate on two registers and an immediate value. The registers are labeled source (rsO) and destination

(rd). The immediate value is labeled as a displacement.All L/S instructions undergo the same operation...

1.The displacement (disp) value is sign extended and added to

the contents of the source register.

2. The result is used as an effective address to either retrieve from

memory (load) or write to memory (store).In the case of a

load, data is written to rd, in the case of a store, data is read

from rsl.

In the 32-bit format, the immediate value is 16-bits. In the 24-bit format, the immediate value is 8-bits. The 16-bit format does not support load/store instructions.Table Al shows the load and store instructions available in 32-bit and 24-bit instruction format.

Table AlLoad/Store Instructions L_W Load Word L_B Load Byte L_BU Load Unsigned Byte L_H Load Half-word L_HU Load Unsigned Half-word S_W Store Word S_B Store Byte S_H Store Half-word 35

Immediate (1mm)

Immediate instructions operate on a source register (rsO), a destination register (rd), and an immediate value.Depending upon the instruction, the immediate value iseither sign extended (arithmetic operations) or zero padded (logical operations). The immediate value is then combined with the source register contents to produce a result, which is placed in the destination register. Like the load and store instructions, immediate instructions only come in 32-bit and 24-bit formats.

Table A2 Immediate Instructions

LG_AND1 Logical AND-Immediate

LG_ORi Logical OR-Immediate

LG_XORi Logical Exclusive OR (XOR)-Immediate

DDi Signed Integer Addition-Immediate

DD_Ui Unsigned Integer Addition-Immediate

SUBI Signed Integer Subtraction-Immediate

SUB_Ui Unsigned Integer Subtraction-Immediate

SFT_LLi Shift Left Logical-Immediate

SFT_RLi Shift Right Logical-Immediate

SFT_RAI Shift Right Arithmetic-Immediate

SET_LTi Set on Less Than-Immediate

SET_LTUi Unsigned Set on Less Than-Immediate

MOV_UPi Move Upper-Immediate

Branch (B)

Branch instructions compare either two source registers(rsO, rsl) or a source register(rsO)and zero. The immediate value is sign extended and added to the pre-incremented PC.If the comparison is true, the branch is taken and the PC is updated with the new value.If the comparison is false, the branch is not taken, and the next instruction is 36 executed. Branches, like Immediate and Load/Store, only come in 32-bit and 24-bit formats.

Table A3 Branch Instructions B_EQ Branch if Equal B_NE Branch if Not Equal B_EQZ Branch if Equal to Zero B_NEZ Branch if Not Equal to Zero B_LIZ Branch if Less Than Zero B_GTZ Branch if Greater Than Zero B_LTEZ Branch if Less Than or Equal to Zero B_GTEZ Branch if Greater Than or Equal to Zero

Register (R

Register instructions operate on two source registers (rsO, rsl) and place their result in a destination register (rd).Instructions J_AR, J_PR, and J_LR are considered register instructions (as opposed to jump instructions) because of their similarity in instruction format to a register instruction. Table A4 shows the register instructions available in all three formats: 32-bit, 24-bit, and 16-bit.

Jump / Call (J/C'

The Jump and Call instructions are lumped together because of their similarity in format, though dissimilar in operation. Table AS shows these instructions. All of the various Jumps and Calls behave somewhat uniquely. As such, one can gain an understanding by reading the detailed instruction descriptions. 37

Table A4Register Instructions LG_AND Logical AND LG_OR Logical OR LG_XOR Logical Exclusive OR (XOR) DD Signed Integer Addition WD_U Unsigned Integer Addition SUB Signed Integer Subtraction SUB_U Unsigned Integer Subtraction SFT_LL Shift Left Logical SFT_RL Shift Right Logical SFT_RA Shift Right Arithmetic SET_LT Set on Less Than SET_LTUUnsigned Set on Less Than MUL Signed Integer Multiplication MUL_U Unsigned Integer Multiplication DIV Signed Integer Division DIV_U Unsigned Integer Division JAR 4bsolute Jump Register J_PR Procedural Jump Register J_LR Jump Link Register MOV Move MOV_FH Move from High Register MOV_FL Move from Low Register

Table A5Jumn/Call Instructions J_A bsoIute Jump J_P Procedural Jump NOP No Operation SYSCALLSystem Call 38

Instructions

Introduction

The following pages outline each instruction in the X32V ISA. Detailed information about instruction type, format, usage, and encoding are given. Table

A6 shows the symbol definitions.

Table A6 Symbol Defmition = Substitute left side of operator with right side + Addition - Subtraction * Multiplication / Division = = Test equality = Test inequality > Greater Than < Less Than & Bit wise Logical AND Bit wise Logical OR I A Bit wise Logical XOR Join or Concatenate II << Bit wise Shift Left >> Bit wise Shift Right rsl Source register one rs 1 Source register two rd Destination register MEM(Ox2a)Value at main memory address Ox2a 'O' Zero extended 8 places 'imm15'16 15thbit of immediate value, sign extended 16 places 39

LG AND LogicalAND

Description: A bit wise logical AND is performed on the contents of rsO and rsl. The result of the operation is placed in rd. When using a 16- bit format, the contents of rd are used as the first source register. Type: Register Format:

32-bit / 24-bit 16-bit LG_ANDrd, rsl, rsl LG_ANDrd, rs0 Operation:

32-bit / 24-bit 16-bit rd=rs0&rsl rdrd&rs0 Encoding:

31 28 27 24 23 20 19 16 15 12 11 8 7 4 3 0 32-bit 0000 rd rsO rsl 0000 unused R unused

4 4 4 4 4 4 4 4

23 20 19 16 15 12 11 8 7 4 3 0 24-bit 0000 rd rs0 rsl R 0000

4 4 4 4 4 4

15 12 11 8 7 4 3 0 16-bit 0000 rd rs0 R-1

4 4 4 4 40

L G OR Logical OR

Description: A bit wise logical OR is performed on the contents of rs0 and rsl. The result of the operation is placed in rd. When using a 16-bit format, the contents of rd are used as the first source register. Type: Register Format:

32-bit / 24-bit 16-bit LG_OR rd, rs0, rsl LG_OR rd, rs0 Operation:

32-bit /24-bit 16-bit rd=rsOlrsl rdrdlrso Encoding:

31 28 27 24 23 20 19 16 15 12 11 8 7 4 3 0 32-bit 0001 rd rs0 rsl 0000 unusedunused

4 4 4 4 4 4 4 4

23 20 19 16 15 12 11 8 7 43 0 24-bit 0001 rd rs0 rsl 0000 R

4 4 4 4 4 4

15 12 11 8 7 4 3 0 16-bit 0001 rd rs0 R-1

4 4 4 4 41

L G XO R Logical Exclusive OR (XOR)

Description: A bit wise logical XOR is performed on the contents of rsO and rsl. The result of the operation is placed in rd. When using a 16- bit format, the contents of rd are used as the first source register. Type: Register Format:

32-bit / 24-bit 16-bit LG_XORrd, rs0, rsl LG_XORrd, rs0 Operation:

32-bit / 24-bit 16-bit rd=rsOArsl rd=rdArsO Encoding:

31 28 27 24 23 20 19 16 15 12 11 8 7 4 3 0 32-bit 0010 rd rs0 rsl 0000 unusedunused

4 4 4 4 4 4 4 4

23 20 19 16 15 12 11 8 7 43 0 24-bit 0010 rd rs0 rsl 0000 R

4 4 4 4 4 4

15 12 11 8 7 4 3 0 16-bit 0010 rd rs0 R-1

4 4 4 4 42

Signed Integer Addition A D D

Description: The contents of rsO and rsl are added. The result of the operation (in two's complement format) is placed in rd. An overflow exception occurs if the result of the operation is greater than 232i. The carry bit is set on a carry out. When using a 16-bit format, the contents of rd are used as the first source register. Type: Register Format:

32-bit / 24-bit 16-bit ADDrd, rsO, rsl ADDrd, rs0 Operation:

32-bit /24-bit 16-bit rd=rsO+rsl rd=rd+rsO Encoding:

31 28 27 24 23 20 19 16 15 12 11 8 7 4 3 0 32-bit 0011 rd rsO rsl 0000 unusedunused

4 4 4 4 4 4 4 4

23 20 19 16 15 12 11 8 7 4 3 0 24-bit 0011 rd rsO rsl 0000 R

4 4 4 4 4 4

15 12 11 8 7 4 3 0 16-bit 0011 rd rsO R-1

4 4 4 4 43

AD D_ u Unsigned Integer Addition

Description: The contents of rsO and rsl are added. The result of the operation is placed in rd. No overflow exception will occur with this instruction. (See the ADD instruction for signed addition) The carry bit is set on a carry out. When using a 16-bit format, the contents of rd are used as the first source register. Type: Register Format:

32-bit / 24-bit 16-bit ADD_U rd, rs0, rsl ADD_U rd, rs0 Operation:

32-bit / 24-bit 16-bit rd=rsO+rsl rd=rd+rs0 Encoding:

31 28 27 24 23 20 19 16 15 12 11 8 7 4 3 0

32-bit I 0100 rd rs0 rsl 0000 unusedunused

4 4 4 4 4 4 4 4

23 20 19 16 15 12 11 8 7 4 3 0 24-bit 0100 rd rs0 rsl R 0000

4 4 4 4 4 4

15 12 11 8 7 4 3 0 16-bit 0100 rd rsO R-1

4 4 4 4 44

SU B Signed Integer Subtraction

Description: The contents of rs0 and rsl are arithmetically subtracted. The result of the operation (in two's complement format) is placed in rd. An overflow exception occurs if the result of the operation is less than 231. The carry bit is also set on a barrow in. When using a 16-bit format, the contents of rd are used as the first source register. Type: Register Format:

32-bit /24-bit 16-bit SUB rd, rsO, rsl SUB rd, rs0 Operation:

32-bit / 24-bit 16-bit rd=rsO-rsl rd=rd-rs0 Encoding:

31 28 27 24 23 20 19 16 15 12 11 8 7 4 3 0

32-bit I 0101 rd rsO rsl 0000 unusedunused

4 4 4 4 4 4 4 4

23 20 19 16 15 12 11 8 7 4 3 0 24-bit 0101 rd rsO rsl R 0000

4 4 4 4 4 4

15 12 11 8 7 4 3 0 16-bit 0101 rd rs0 R-1

4 4 4 4 45 SUBU Unsigned Integer Subtraction

Description: The contents of rsO and rsl are arithmetically subtracted. The result of the operation is placed in rd. No overflow exception will occur with this instruction (See theSUBinstruction for signed subtraction). The carry bit is set on a barrow in. When using a 16- bit format, the contents of rd are used as the first source register. Type: Register Format:

32-bit /24-bit 16-bit SUB_Urd, rs0, rsl SUB_Urd, rs0 Operation:

32-bit / 24-bit 16-bit rd=rsO-rsl rd=rd-rs0 Encoding:

31 28 27 24 23 20 19 16 15 12 11 8 7 4 3 0

32-bit I 0110 rd rsO rsl 0000 unusedunused

4 4 4 4 4 4 4 4

23 20 19 16 15 12 11 8 7 43 0 24-bit 0110 rd rs0 rsl 0000 R

4 4 4 4 4 4

15 12 11 8 7 4 3 0 16-bit 0110 rd rsO R-1

4 4 4 4 46

Shift Left Logical S FT L L

Description: The contents of rsO are shifted left a variable amount corresponding to the least significant five bits in rsl. Zeros are inserted into the shifted locations, and the result is stored in rd. The last bit shifted out is stored as the carry bit. When using a 16- bit format, the contents of rd are used as the first source register. Type: Register Format:

32-bit / 24-bit 16-bit SFT_LLrd, rsO, rsl SFT_LL rd,rs0 Operation:

32-bit /24-bit 16-bit rd = rs0 << rsl rd = rd <

31 28 27 24 23 20 19 16 15 12 11 8 7 4 3 0

32-bit I 0111 rd rs0 rsl 0000 unusedunused R

4 4 4 4 4 4 4 4

23 20 19 16 15 12 11 8 7 4 3 0 24-bit 0111 rd rs0 rsl 0000 R

4 4 4 4 4 4

15 12 11 8 7 4 3 0 16-bit 0111 rd rs0 R-1

4 4 4 4 47

S FT RL Shift Right Logical

Description: The contents of rsO are shifted right a variable amount corresponding to the least significant five bits in rsl. Zeros are inserted into the shifted locations, and the result is stored in rd. The last bit shifted out is stored as the carry bit. When using a 16- bit format, the contents of rd are used as the first source register. Type: Register Format:

32-bit / 24-bit 16-bit SFT_RLrd, rsO, rsl SFT_RLrd, rsO Operation:

32-bit /24-bit 16-bit rd = rs0>> rsl rd = rd>> rs0 Encoding:

31 28 27 24 23 20 19 16 15 12 11 8 7 4 3 0 32-bit 1000 rd rsO rsl 0000 unusedunused R

4 4 4 4 4 4 4 4

23 20 19 16 15 12 11 8 7 4 3 0 24-bit 1000 rd rsO rsl 0000 R

4 4 4 4 4 4

15 12 11 8 7 4 3 0 16-bit 1000 rd rsO R-1

4 4 4 4 48

S FT Rj Shift Right Arithmetic

Description: The contents of rsO are shifted right a variable amount corresponding to the least significant five bits in rsl. The most significant bits are sign extended, and the result is placed in rd. The last bit shifted out is stored as the carry bit. When using a 16- bit format, the contents of rd are used as the first source register. Type: Register Format:

32-bit / 24-bit 16-bit SFT_RArd, rsO, rsl SFT_RArd, rsO Operation:

32-bit / 24-bit 16-bit rd = rsO>> rsl rd = rd >> rsO Encoding:

31 28 27 24 23 20 19 16 15 12 11 8 7 4 3 0 32-bit 1001 rd rsO rsl 0000 unusedunused

4 4 4 4 4 4 4 4

23 20 19 16 15 12 11 8 7 4 3 0 24-bit 1001 rd rsO rsl 0000 R

4 4 4 4 4 4

15 12 11 8 7 4 3 0 16-bit 1001 rd rsO R-1

4 4 4 4 49 SET LT Set on Less Than

Description: The contents of rsO and rsl are compared as two's complement integers. If rsO is less than rsl, then rd is set to '1'. Otherwise, rd is set to '0'. When using a 16-bit format, the contents of rd are used as the first source register. Type: Register Format:

32-bit / 24-bit 16-bit SET_LTrd, rsO, rsl SET_LTrd, rs0 Operation:

32-bit /24-bit 16-bit if(rs0 < rsl) if(rd < rs0) rd=1 rd=1 else else rd=0 rd=0 Encoding:

31 28 27 24 23 20 19 16 15 12 11 8 7 4 3 0

32-bit I 1010 rd rsO rsl 0000 unusedunused R

4 4 4 4 4 4 4 4

23 20 19 16 15 12 11 8 7 4 3 0 24-bit 1010 rd isO rsl 0000 R

4 4 4 4 4 4

15 12 11 8 7 4 3 0 16-bit 1010 rd rsO R-1

4 4 4 4 50 sE T LT u Unsigned Set on Less Than

Description: The contents of rsO and rsl are compared as positive integers. If rs0 is less than rsl, then rd is set to '1'. Otherwise, rd is set to '0'. When using a 16-bit format, the contents of rd are used as the first source register. Type: Register Format:

32-bit /24-bit 16-bit SET_LTUrd, rs0, rsl SET_LTUrd, rs0 Operation:

32-bit / 24-bit 16-bit if(rs0

31 28 27 24 23 20 19 16 15 12 11 8 7 4 3 0 32-bit RI1011 rd rs0 rsl 0000 unusedunused 4 4 4 4 4 4 4 4

23 20 19 16 15 12 11 8 7 4 3 0 24-bit 1011 rd rs0 rsl 0000 R

4 4 4 4 4 4

15 12 11 8 7 4 3 0

I 16-bit 1011 rd rs0 R-1

4 4 4 4 51

Signed Integer Multiplication Fyi U L

Description: The contents of rsO are multiplied by the contents of rsl using two's complement format. The least significant word of the result is placed in register LOW, and the most significant word is placed in register HIGH. Type: Register Format:

32-bit /24-bit! 16-bit MUL rsO, rsl Operation:

32-bit! 24-bit! 16-bit HIGH =(rsO * rsl)3116 LOW= (rsO * rsl)150 Encoding:

31 28 27 24 23 20 19 16 15 12 11 8 7 4 3 0 32-bit 1100 unused rsO rsl 0000 unusedunused R

4 4 4 4 4 4 4 4

23 20 19 16 15 12 11 8 7 4 3 0 24-bit 1100 unused rsO rsl 0000 R

4 4 4 4 4 4

15 12 11 8 7 4 3 0 16-bit 1100 rsl rs0 R-1

4 4 4 4 52

Unsigned Integer Multiplication lvi U L U

Description: The contents of rsO are multiplied by the contents of rsl using positive values. The least significant word of the result is placed in register LOW, and the most significant word is placed in register HIGH. Type: Register Format:

32-bit / 24-bit /16-bit MUL rs0, rsl Operation:

32-bit / 24-bit/I 6-bit HIGH =(rs0 * rsl)3116 LOW= (rsO * rsl)150 Encoding:

31 28 27 24 23 20 19 16 15 12 11 8 7 4 3 0 32-bit 1101 rd rs0 rsl 0000 unusedunused

4 4 4 4 4 4 4 4

23 20 19 16 15 12 11 8 7 43 0 24-bit 1101 rd rs0 rsl 0000 R

4 4 4 4 4 4

15 12 11 8 7 4 3 0 16-bit 1101 rsl rs0 R-1

4 4 4 4 53

Signed Integer Division D i \/

Description: The contents of rsO are divided by the contents of rsl using two's complement format. The quotient is placed in register HIGH, while the remainder is placed in register LOW. Type: Register Format:

32-bit /24-bit /16-bit DIV rs0, rsl Operation:

32-bit / 24-bit/I 6-bit

HIGH =(rso/rsl)quotient

LOW = (rs0/rsl)remainder Encoding:

31 28 27 24 23 20 19 16 15 12 11 8 7 4 3 0 32-bit 1110 unused rs0 rsl 0000 unused R unused

4 4 4 4 4 4 4 4

23 20 19 16 15 12 11 8 7 4 3 0 24-bit 1110 unused rs0 rsl R 0000

4 4 4 4 4 4

15 12 11 8 7 4 3 0 16-bit 1110 rsl rsO R-1

4 4 4 4 54

Unsigned Integer Division

Description: The contents of rsO are divided by the contents of rsl using an unsigned format. The quotient is placed in register HIGH, while the remainder is placed in register LOW. Type: Register Format:

32-bit / 24-bit! 16-bit DIV rs0, rsl Operation:

32-bit / 24-bit! I 6-bit

HIGH =(rsO/rsl)quotient

LOW = (rsO/rsl)remainder Encoding:

31 28 27 24 23 20 19 16 15 12 11 8 7 4 3 0 32-bit 1111 unused rs0 rsl 0000 unusedunused

4 4 4 4 4 4 4 4

23 20 19 16 15 12 11 8 7 43 0 24-bit 1111 unused rsO rsl 0000 R

4 4 4 4 4 4

15 12 11 8 7 4 3 0 16-bit 1111 rsl rs0 R-1

4 4 4 4 55 MOV Move

Description: The contents of rs0 are copied into rd. The contents of rs0 do not change. Type: Register Format:

32-bit /24-bit! 16-bit MOV rd, rs0 Operation:

32-bit/24-bit! 16-bit rd = rs0 Encoding:

31 28 27 24 23 20 19 16 15 12 11 8 7 4 3 0 32-bit 0011 rd rs0 unused 0001 unusedunused

4 4 4 4 4 4 4 4

23 20 19 16 15 12 11 8 7 4 3 0 24-bit 0011 rd rs0 unused 0001 R

4 4 4 4 4 4

15 12 11 8 7 4 3 0 16-bit 0011 rd rsO R-2

4 4 4 4 56

M OV F H Move From HIGH Register

Description: The contents of the HIGH register are copied into rd. The contents of the HIGH register do not change. Type: Register Format:

32-bit / 24-bit! 16-bit MOV_FHrd Operation:

32-bit! 24-bit! I 6-bit rd =HIGH Encoding:

31 28 27 24 23 20 19 16 15 12 11 8 7 4 3 0 32-bit 0011 rd unusedunused 0001 unusedunused

4 4 4 4 4 4 4 4

23 20 19 16 15 12 11 8 7 4 3 0 24-bit 0011 rd unused unused 0001 R

4 4 4 4 4 4

15 12 11 8 7 4 3 0 16-bit 0011 rd unused R-2

4 4 4 4 57

MOV F L Move From LOW Register

Description: The contents of the LOW register are copied into rd. The contents of the LOW register do not change. Type: Register Format:

32-bit /24-bit! 16-bit MOV_FLrd Operation:

32-bit / 24-bit/I 6-bit rd =LOW Encoding:

31 28 27 24 23 20 19 16 15 12 11 8 7 4 3 0 32-bit 0011 rd unusedunused 0001 unusedunused

4 4 4 4 4 4 4 4

23 20 19 16 15 12 11 8 7 4 3 0 24-bit 0011 rd unused unused 0001 R

4 4 4 4 4 4

15 12 11 8 7 4 3 0 16-bit 0011 rd unused R-2

4 4 4 4 58

L_VV Load Word

Description: The displacement value is sign extended and shifted two bits to the

left. This is then added to the contents ofrsOto generate a 32-bit unsigned effective address. The word in memory at this address is

copied intord.This address in memory must be word aligned. Type: Load / Store Format:

32-bit / 24-bit L_W rd, disp(rsO) Operation:

32-bit

rd =MEMword(rsO + (('disp15'16 disp) <<2))

24-bit

rd =MEMword(rsO +(('disp7'24 IIdisp) <<2)) Encoding:

31 28 27 24 23 20 19 16 15 0 32-bit 0000 rd rsO displacement L/S

4 4 4 4 16

23 20 19 16 15 12 11 8 7 0 24-bit 0000 rd rsO displacement L/S

4 4 4 4 8 59

L_B Load Byte

Description: The contents of rsO are added to the sign extended immediate displacement value to generate a 32-bit unsigned effective address. The byte in memory at this address is sign extended and copied into rd. Notice that no shifting occurs for this address calculation because it accesses memory on a byte level. Type: Load / Store Format:

32-bit /24-bit L_B rd, disp(rsO) Operation:

32-bit

rd =MEMbyte(rsO + ('disp15'16 disp)

24-bit

rd =MEMbyte(rsO +(disp7'24 IIdisp)) Encoding:

31 28 27 24 23 20 19 16 15 0 32-bit 0001 rd rsO displacement L/S

4 4 4 4 16

23 20 19 16 15 12 11 8 7 0 24-bit 0001 rd rsO displacement L/S

4 4 4 4 8 L_B U Load Unsigned Byte

Description:

The contents ofrsOare added to the sign extended immediate displacement value to generate a 32-bit unsigned effective address. The byte in memory at this address is copied into the least

significant byte ofrd.The bytes 0-2 ofrdare padded with zeros. Notice that no shifting occurs for this address calculation because it accesses memory on a byte level. Type: Load / Store Format:

32-bit / 24-bit

L_BU rd, disp(rso) Operation:

32-bit

16 rd = MEMbyte (rs0 + ('disp15' disp)

24-bit

rd = MEMbyte (rs0 + ('disp7'24IIdisp)) Encoding:

31 28 27 24 23 20 19 16 15 0 32-bit 0010 rd rs0 displacement L/S

4 4 4 4 16

23 20 19 16 15 12 11 8 7 0 24-bit 0010 rd rs0 displacement L/S

4 4 4 4 8 L_H Load Half-word

Description: The immediate value is sign extended and shifted left 1 bit. The result is then added to the contents of rs0 to generate a 32-bit unsigned effective address. The half-word in memory at this address is sign extended and copied into rd. The immediate value is shifted because the instruction access memory in half-word increments. Type: Load / Store Format:

32-bit / 24-bit L_H rd, disp(rsO) Operation:

32-bit

16 rd =MEMhaff(rsO + (('disp15' disp) << 1)

24-bit

rd =MEMha(rsO +(('disp7'24 IIdisp)) << 1) Encoding:

31 28 27 24 23 20 19 16 15 0 32-bit 0011 rd rsO displacement L/S

4 4 4 4 16

23 20 19 16 15 12 11 8 7 0 24-bit 0011 rd rs0 displacement L/S

4 4 4 4 8 62

Load Unsigned Half-word L_H U

Description: The immediate value is sign extended and shifted left 1 bit. The result is then added to the contents of rsO to generate a 32-bit unsigned effective address. The half-word in memory at this address has the upper two bytes padded with zeros, and is copied into rd. The immediate value is shifted because the instruction access memory in half-word increments. Type: Load / Store Format:

32-bit / 24-bit L_HU rd, disp(rsO) Operation:

32-bit

16 rd =MEMhaIf(rs0 + (('disp15' disp) << 1)

24-bit

rd =MEMhaff(rs0 +(('disp7'24 IIdisp)) << 1) Encoding:

31 28 27 24 23 20 19 16 15 0 32-bit 0100 rd rs0 displacement L/S

4 4 4 4 16

23 20 19 16 15 12 11 8 7 0 24-bit 0100 rd rsO displacement L/S

4 4 4 4 8 63 s_vv Store Word

Description: The immediate value is sign extended and shifted 2 bits to the left. The result is added to the contents of rsO to generate a 32-bit unsigned effective address. The word in rsl is then copied to the memory address. Type: Load / Store Format:

32-bit / 24-bit S_W disp(rso), rsl Operation:

32-bit

MEMword(rs0 + (('disp15'16 disp) <<2)) = rsl

24-bit

MEMword(rs0 + (('disp7'24 disp) <<2)) = rsl Encoding:

31 28 27 24 23 20 19 16 15 0 32-bit 0101 rsl rs0 displacement L/S

4 4 4 4 16

23 20 19 16 15 12 11 8 7 0 24-bit 0101 rsl rs0 displacement L/S

4 4 4 4 8 S_B Store Byte

Description: The contents of rsO are added to the sign extended immediate displacement value to generate a 32-bit unsigned effective address. The lowest byte (byte 3) in rd is then copied to the memory address. Type: Load / Store Format:

32-bit / 24-bit S_B disp(rso), rd Operation:

32-bit

MEMword(rs0 + ('disp15'16 disp)) = rsl

24-bit

MEMword(rs0 + ('disp7'24 disp)) = rsl Encoding:

31 28 27 24 23 20 19 16 15 0 32-bit 0110 rd rs0 displacement L/S

4 4 4 4 16

23 20 19 16 15 12 11 8 7 0 24-bit 0110 rd rsO displacement L/S

4 4 4 4 8 65

S_H Store Half-word

Description: The immediate value is sign extended and shifted 1 bit to the left. The result is added to the contents of rsO to generate a 32-bit unsigned effective address. The lower half-word (byte 2 - byte 3) in rd is then copied to the memory address. Type: Load / Store Format:

32-bit / 24-bit S_H disp(rsO), rd Operation:

32-bit

MEMword(rsO + ((disp15'16 disp) << 1)) = rsl

24-bit

MEMword(rsO + (('disp7'24 disp) << 1)) = rsl Encoding:

31 28 27 24 23 20 19 16 15 0 32-bit 0111 rd rsO displacement L/S

4 4 4 4 16

23 20 19 16 15 12 11 8 7 0 24-bit 0111 rd rsO displacement L/S

4 4 4 4 8 Logical AND - Immediate L G_AN Di

Description: The immediate value is zero extended and a bit wise logical AND is performed on the contents of rsO and the extended immediate. The result of the operation is placed in rd. Type: Immediate Format:

32-bit /24-bit LG_ANDird, rsO,imm Operation:

32-bit 24-bit

(t0J16 (OP24 rd = rsO & fiimm) rd = rsO & imm) Encoding:

31 28 27 24 23 20 19 16 15 0 32-bit 0000 rd rsO immediate Imm

4 4 4 4 16

23 20 19 16 15 12 11 8 7 0 24-bit 0000 rd rsO immediate Imm

4 4 4 4 8 L G_O R Logical OR - Immediate

Description: The immediate value is zero extended and a bit wise logical OR is performed on the contents of rsO and the extended immediate. The result of the operation is placed in rd. Type: Immediate Format:

32-bit / 24-bit LG_ORI rd, rsO, imm Operation:

32-bit 24-bit rd = rsO(O16 rd = rsO(O24 imm) I 111mm) I Encoding:

31 28 27 24 23 20 19 16 15 0 32-bit 0001 rd rsO immediate Imm

4 4 4 4 16

23 20 19 16 15 12 11 8 7 0 24-bit 0001 rd rs0 immediate Imm

4 4 4 4 8 68

L3XJ RI Logical Exclusive OR - Immediate

Description: The immediate value is zero extended and a bit wise logical XOR is performed on the contents of rsO and the extended immediate. The result of the operation is placed in rd. Type: Immediate Format:

32-bit / 24-bit LG_XORIrd, rsO,1mm Operation:

32-bit 24-bit rd = rsO "(Oh16 imm) rd = rsOA (O124 1mm) Encoding:

31 28 27 24 23 20 19 16 15 0 32-bit 0010 rd rsO immediate Imm

4 4 4 4 16

23 20 19 16 15 12 11 8 7 0 24-bit 0010 rd rsO immediate Imm

4 4 4 4 8 ADD Signed Integer Addition - Immediate

Description: The immediate value is sign extended and added to rsO. The result of the operation (in two's complement fonnat) is placed in rd. An overflow exception occurs if the result of the operation is greater

than 232 1. Type: Immediate Format:

32-bit /24-bit ADDI rd, rs0, 1mm Operation:

32-bit 24-bit

16 24 rd = rsO + ('imm15' 111mm) rd = rs0 + ('imm7' 1mm) Encoding:

31 28 27 24 23 20 19 16 15 0 32-bit 0011 rd rsO immediate 1mm

4 4 4 4 16

23 20 19 16 15 12 11 8 7 0 24-bit 0011 rd rs0 immediate 1mm

4 4 4 4 8 70

ADD_ U Unsigned Integer Addition - Immediate

Description: The immediate value is padded with zeros and added to rsO. The result of the operation is placed in rd. No overflow exception will occur with this instruction (see ADDI). Type: Immediate Format:

32-bit / 24-bit ADD_Ui rd, rsO,imm Operation:

32-bit 24-bit rd = rs0 + ('0'16 mm) rd = rsO + ('0'24 imm) Encoding:

31 28 27 24 23 20 19 16 15 0 32-bit 0100 rd rs0 immediate Imm

4 4 4 4 16

23 20 19 16 15 12 11 8 7 0 24-bit 0100 rd rs0 immediate Imm

4 4 4 4 8 71

SU B Signed Integer Subtraction - Immediate

Description:

The immediate value is sign extended and subtracted fromrsO. The result of the operation (in two's complement format) is placed inrd.An overflow exception occurs if the result of the operation is greater than 2 1 Type: Immediate Format:

32-bit /24-bit SUBi rd, rsO, 1mm Operation:

32-bit 24-bit rd = rsO - ('imm15'16 imm) rd = rsO - ('imm7'24 imm) Encoding:

31 28 27 24 23 20 19 16 15 0 32-bit 0101 rd rs0 immediate 1mm

4 4 4 4 16

23 20 19 16 15 12 11 8 7 0 24-bit 0101 rd rsO immediate 1mm

4 4 4 4 8 72

SU B_U Unsigned Integer Subtraction - Immediate

Description: The immediate value is padded with zeros and subtracted from rs0. The result of the operation is placed in rd. No overflow exception will occur with this instruction (see SUBi). Type: Immediate Format:

32-bit / 24-bit SUB_Ui rd, rsO,imm Operation:

32-bit 24-bit rd = rs0 - ('0'16 imm) rd = rsO - ('0'24 imm) Encoding:

31 28 27 24 23 20 19 16 15 0 32-bit 0110 rd rs0 immediate 1mm

4 4 4 4 16

23 20 19 16 15 12 11 8 7 0 24-bit 0110 rd rs0 immediate 1mm

4 4 4 4 8 73

Shift Left Logical - Immediate SFT_L Li

Description: The contents of rsO are shifted left a variable amount corresponding to the least significant five bits of the immediate value. Zeros are inserted into the shifted locations (least significant bits), and the result is stored in rd. Furthermore, the last bit shifted out is stored as the carry bit. Type: Immediate Format:

32-bit / 24-bit SFT_LLird, rs0,1mm Operation:

32-bit /24-bit rd = rsO <

31 28 27 24 23 20 19 16 15 0 32-bit 0111 rd rs0 immediate 1mm

4 4 4 4 16

23 20 19 16 15 12 11 8 7 0 24-bit 0111 rd rs0 immediate 1mm

4 4 4 4 8 74

SFT_R Li Shift Right Logical - Immediate

Description: The contents of rsO are shifted right a variable amount corresponding to the least significant five bits of the immediate value. Zeros are inserted into the shifted locations (most significant bits), and the result is stored in rd. Furthermore, the last bit shifted out is stored as the carry bit. Type: Immediate Format:

32-bit / 24-bit SFT_RLird, rs0,1mm Operation:

32-bit /24-bit rd = rs0>>imm40 Encoding:

31 28 27 24 23 20 19 16 15 0 32-bit 1000 rd rs0 immediate Imm

4 4 4 4 16

23 20 19 16 15 12 11 8 7 0 24-bit 1000 rd rs0 immediate Imm

4 4 4 4 8 "1

SFT_RAi Shift Right Arithmetic - Immediate

Description: The contents of rsO are shifted right a variable amount corresponding to the least significant five bits of the immediate value. The most significant bits of rsO are then sign extended

(rather than padded with zeros, seeSFT_RL1),and the result is placed in rd. Furthermore, the last bit shifted out is stored as the carry bit. Type: Immediate Format:

32-bit / 24-bit SFT_RAird, rsO,imm Operation:

32-bit /24-bit rd = rs0>>imm40 Encoding:

31 28 27 24 23 20 19 16 15 0 32-bit 1001 rd rs0 immediate 1mm

4 4 4 4 16

23 20 19 16 15 12 11 8 7 0 24-bit 1001 rd rs0 immediate 1mm

4 4 4 4 8 76

SET_LT Set on Less Than - Immediate

Description: The contents of rs0 are compared to the sign extended immediate value as two's complement integers. If rs0 is less than immediate value, then rd is set to '1'. Otherwise, rd is set to '0'. Type: Immediate Format:

32-bit / 24-bit SET_LT1 rd, rs0,imm Operation:

32-bit 24-bit

16 24 if (rs0 <('imm15' imm)) if(rs0 <('imm7 imm)) rd = I rd = I else else rd = 0 rd = 0 Encoding:

31 28 27 24 23 20 19 16 15 0 32-bit 1010 rd rs0 immediate 1mm

4 4 4 4 16

23 20 19 16 15 12 11 8 7 0 24-bit 1010 rd rs0 immediate 1mm

4 4 4 4 8 77

SET_LT U Unsigned Set on Less Than - Immediate

Description:

The contentsof rs0are compared to the immediate value (MSB5

padded with zeros) as positive integers.If rs0 isless than

immediate value, thenrd isset to '1'. Otherwise,rd isset to '0'. Type: Immediate Format:

32-bit / 24-bit

SET_LTU1 rd, rsO, imm Operation:

32-bit 24-bit

16 if(rso < ('0' imm)) if(rso < ('0' imm)) rdl rdl else else

rd = 0 rd = 0 Encoding:

31 28 27 24 23 20 19 16 15 0 32-bit 1011 rd rs0 immediate 1mm

4 4 4 4 16

23 20 19 16 15 12 11 8 7 0 24-bit 1011 rd rsO immediate 1mm

4 4 4 4 8 78

Move Upper - Immediate lvi OV_U p

Description: The immediate value is stored in rd with the least significant 16- bits padded with zero. If a 24-bit format is used, the least significant 16-bits are padded with zero and the most significant 8- bits are sign extended. Type: Immediate Format:

32-bit /24-bit MOV_UPi rd,imm Operation:

32-bit 24-bit

rd =imm rd ='imm7'8fiimm Encoding:

31 28 27 24 23 20 19 16 15 0

32-bit I 1100 rd unused immediate Imm

4 4 4 4 16

23 20 19 16 15 12 11 8 7 0 24-bit 1100 rd unused immediate Imm

4 4 4 4 8 B_EQ Branch if Equal

Description: The contents of rs0 and rsl are compared, if they are not equal, then the next instruction is fetched. If they are equal, the PC is added to the sign extended immediate value. The next instruction is then fetched from the newly calculated address. Type: Branch Format:

32-bit / 24-bit B_EQrsl, rsO,label Operation:

32-bit 24-bit if(rsO == rsl) if(rs0 == rsl)

PC = PC +'label15'16 IIlabel PC = PC +'label7'24 IIlabel Encoding:

31 28 27 24 23 20 19 16 15 0 32-bit 0000 rsl rsO label

4 4 4 4 16

23 20 19 16 15 12 11 8 7 0 24-bit 0000 rsl rsO label B

4 4 4 4 8 80

Branch if Not Equal

Description: The contents of rsO and rsl are compared. If they are not equal, the PC is added to the sign extended immediate value. The next instruction is then fetched from the newly calculated address. If they are equal, the next instruction is fetched. Type: Branch Format:

32-bit /24-bit B_NErsl, rs0,label Operation:

32-bit 24-bit if(rsO rsl) if(rsO rsl)

PC = PC +'label15'16 IIlabel PC = PC + 'label7'24II label Encoding:

31 28 27 24 23 20 19 16 15 0 32-bit 0001 rsl rsO label B

4 4 4 4 16

23 20 19 16 15 12 11 8 7 0 24-bit 0001 rsl rsO label B

4 4 4 4 8 81

B_E QZ Branch if Equal to Zero

Description: The contents of rsO are compared to zero, if they are equal, the PC is added to the sign extended immediate value. The next instruction is then fetched from the newly calculated address. If they are not equal, the next instruction is fetched. Type: Branch Format:

32-bit / 24-bit B_EQZrs0,label Operation:

32-bit 24-bit

if(rs0 ==OP32) if (rs0 ==Ol32)

PC = PC +'label15'16 IIlabel PC = PC +'label7'24 IIlabel Encoding:

31 28 27 24 23 20 19 16 15 0 32-bit 0010 unused rsO label B

4 4 4 4 16

23 20 19 16 15 12 11 8 7 0 24-bit 0010 unused rs0 label B

4 4 4 4 8 82

B_N Ez Branch if Not Equal to Zero

Description: The contents of rsO are compared to zero, if they are not equal, the PC is added to the sign extended immediate value. The next instruction is then fetched from the newly calculated address. If they are equal, the next instruction is fetched. Type: Branch Format:

32-bit / 24-bit B_NEZrs0,label Operation:

32-bit 24-bit if(rs0 032) if(rsO O32)

'label7'24 PC = PC + 'label15'16II label PC = PC + IIlabel Encoding:

31 28 27 24 23 20 19 16 15 0

32-bit I 0011 unused rs0 label

4 4 4 4 16

23 20 19 16 15 12 11 8 7 0 24-bit 0011 unused rsO label

4 4 4 4 8 83

B_LTZ Branch if Less Than Zero

Description: The contents of rsO are compared to zero, if the result is less than zero, the PC is added to the sign extended immediate value. The next instruction is then fetched from the newly calculated address. If the result is greater than or equal to zero, the next instruction is fetched. Type: Branch Format:

32-bit / 24-bit B_LTZ rs0,label Operation:

32-bit 24-bit if(rsO < if (rs0

PC = PC +'label15'16 IIlabel PC = PC +'label7'24 IIlabel Encoding:

31 28 27 24 23 20 19 16 15 0

32-bit I 0100 unused rs0 label B

4 4 4 4 16

23 20 19 16 15 12 11 8 7 0 24-bit 0100 unused rsO label B

4 4 4 4 8 84

B_G TZ Branch if Greater Than Zero

Description: The contents of rsO are compared to zero, if the result is greater than zero, the PC is added to the sign extended immediate value. The next instruction is then fetched from the newly calculated address. If the result is less than or equal to zero, the next instruction is fetched. Type: Branch Format:

32-bit / 24-bit B_GTZ rsO,label Operation:

32-bit 24-bit if (rsO >032) if(rs0 >

PC = PC +'label15'16 IIlabel PC = PC +'label7'24 IIlabel Encoding:

31 28 27 24 23 20 19 16 15 0 32-bit 0101 unused rs0 label

4 4 4 4 16

23 20 19 16 15 12 11 8 7 0 24-bit 0101 unused rs0 label B

4 4 4 4 8 85

B_LT EZ Branch if Less Than or Equal to Zero

Description: The contents of rsO are compared to zero, if the result is less than or equal to zero, the PC is added to the sign extended immediate value. The next instruction is then fetched from the newly calculated address. If the result is greater than zero, the next instruction is fetched. Type: Branch Format:

32-bit /24-bit B_LTEZrs0,label Operation:

32-bit 24-bit if(rsO < if(rsO <=

PC = PC +4 +'label15'16 IIlabel PC = PC +'label7'24 IIlabel Encoding:

31 28 27 24 23 20 19 16 15 0 32-bit 0110 unused rs0 label

4 4 4 4 16

23 20 19 16 15 12 11 8 7 0 24-bit 0110 unused rs0 label

4 4 4 4 8 86

Branch if Greater Than or Equal to Zero B_G T Ez

Description: The contents of rsO are compared to zero, if the result is greater than or equal to zero, the PC is added to the sign extended immediate value. The next instruction is then fetched from the newly calculated address. If the result is less than zero, the next instruction is fetched. Type: Branch Format:

32-bit / 24-bit

B_GTEZrsO,label Operation:

32-bit 24-bit if (rs0 > 032) if (rs0 > O32)

PC = PC +'label15'16 IIlabel PC = PC +'label7'24 IIlabel Encoding:

31 28 27 24 23 20 19 16 15 0 32-bit 0111 unused rs0 label B

4 4 4 4 16

23 20 19 16 15 12 11 8 7 0 24-bit 0111 unused rs0 label B

4 4 4 4 8 87

Absolute Jump

Description: The immediate value is sign extended and added to the PC. This target address is then unconditionally placed in the PC. Type: Jump / Call Format:

32-bit / 24-bit J_A label Operation:

32-bit 24-bit

PC = PC +'label23'8 IIlabel PC = PC +'label15'16 IIlabel Encoding:

31 28 27 24 23 0 32-bit 0000 label J/C

4 4 24

23 20 19 16 15 0 24-bit 0000 label J/C

4 4 16

15 12 11 8 7 0 16-bit 0000 label JIC

4 4 8 88

JP Procedural Jump

Description: The immediate value is sign extended and added to the pre incremented PC. This target address is then unconditionally placed in the PC. The address of the next instruction is placed in the link register(LR). Type: Jump / Call Format:

32-bit / 24-bit J_P label Operation:

32-bit 24-bit

PC = PC +'label23'8 IIlabel PC = PC +'label15'16 IIlabel LR = PC LR=PC Encoding:

31 28 27 24 23 0 32-bit 0001 label J/C

4 4 24

23 20 19 16 15 0 24-bit 0001 label J/C

4 4 16

15 12 11 8 7 0 16-bit 0001 label J/C

4 4 8 89

J_A R Absolute Jump Register

Description: The value in rsO is unconditionally placed in the PC. Type: Register Format:

32-bit / 24-bit /16-bit J_ARrsO Operation:

32-bit /24-bit /16-bit PC = rs0 Encoding:

31 28 27 24 23 20 19 16 15 12 11 8 7 4 3 0 32-bit 0000 unused rs0 unused 0001 unusedunused

4 4 4 4 4 4 4 4

23 20 19 16 15 12 11 8 7 4 3 0 24-bit 0000 unused rs0 unused 0001 R

4 4 4 4 4 4

15 12 11 8 7 4 3 0 16-bit 0000 unused rsO R-2

4 4 4 4 jP Procedural Jump Register

Description: The value in rsO is unconditionally placed in the PC. The address

of the next instruction is placed in the link register(LR). Type: Register Format:

32-bit / 24-bit /16-bit J_PRrs0 Operation:

32-bit / 24-bit/I 6-bit PC = rs0 LR = PC Encoding:

31 28 27 24 23 20 19 16 15 12 11 8 7 4 3 0

32-bit I 0001 unused rsO unused 0001 unusedunused

4 4 4 4 4 4 4 4

23 20 19 16 15 12 11 8 7 4 3 0 24-bit 0001 unused rs0 unused R 0001

4 4 4 4 4 4

15 12 11 8 7 4 3 0 16-bit 0001 unused rs0 R-2

4 4 4 4 91

Jump Link Register J_L R

Description: The value in the link register (LR) is unconditionally placed in the PC. Type: Register Format:

32-bit / 24-bit /16-bit J LR Operation:

32-bit / 24-bit/I 6-bit PC=LR Encoding:

31 28 27 24 23 20 19 16 15 12 11 8 7 4 3 0 32-bit 0010 unused unused unused 0001 unusedunused

4 4 4 4 4 4 4 4

23 20 19 16 15 12 11 8 7 4 3 0 24-bit 0010 unused unused unused 0001 R

4 4 4 4 4 4

15 12 11 8 7 4 3 0 16-bit 0010 unused unused R-2

4 4 4 4 92

Appendix B: ARM SS and PISA SS Configuration Table

option ARM PISA Description

-fetch:ifqsize 1 1 4 instruction fetch queue size (in insts) fetch:mplat 3 3 3 extra branch mis-prediction latency speed of front-end of machine relative to execution -fetch:speed 1 1 1 core bpred nottaken nottaken bimod {nottakenitakenlperfectlbimodl2levlcomb} -bpred:bimod 2048 bimodal predictor config (

) -bpred:2lev 11024 8 02-level predictor ( ) -bpred:comb 1024 combining predictor config () -bpred:ras 0 0 8 return address stack size (0 for no return stack) -bpred:btb 0 0 0 0 512 4 BTBconfig ( ) speculative predictors update in (IDIWB} (default non- -bpred:spec spec)

-decode:width 1 1 4 instruction decodeB1W(insts/cycle)

-issue:width 1 1 4 instruction issueB/W(insts/cycle) -issue:inorder TRUE TRUE false run pipeline with in-order issue -issue:wrongpath FALSE FALSE true issue instructions down wrong execution paths

-commit:width 1 1 4 instruction commitB1W(insts/cycle) -ruu:size 2 2 16 register update unit(RUU)size -lsq:size 2 2 8 load/store queue(LSQ)size lsq:perfect n/a false >erfect memory disambiguation -cache:dll 256:16:4:1256:16:4:1dll:128:32:4:lIi data cache config, i.e., {inone}

-cache:dlllat 1 1 1 Il data cache hit latency (in cycles) ul2:1024:64:4 -cache:dl2 none none :1 12 data cache config, i.e., {Inone} -cache:dl3lat 6 12 data cache hit latency (in cycles) -cache:ill 256:16:4:1256:16:4:1ill:512:32:1:lII inst cache config, i.e., {idl1ldl2none}

-cache:illlat 1 1 1 Ii instruction cache hit latency (in cycles) -cache:il2 none none dl2 12 instruction cache config, i.e., {dl2none} -cache:il2lat 6 12 instruction cache hit latency (in cycles) -cache:flush FALSE FALSE false flush caches on system calls -cache:icompress FALSE FALSE false convert 64-bit inst addresses to 32-bit inst equivalents -mem:Iat 18 1 18 1 18 2 memory access latency ( ) -mem:width 8 8 8 memory access bus width (in bytes) -mem:pipelined n/a true memory accesses are fully pipelined itlb: 16:4096:4

-tlb:itlb none none :1 instructionTLBconfug, i.e., {jnone} dtlb:32:4096: -tlb:dtlb none none 41 dataTLBconfig, i.e., {inone} -tlb:Iat 30 inst/dataTLBmiss latency (in cycles)

-res:ialu 1 1 4 total number of integerALU'savailable

-res:imult 1 1 1 otal number of integer multiplier/dividers available total number of memory system ports available (to -res:memport 1 1 2 CPU)

-res:fpalu 1 1 4 total number of floating pointALU'savailable total number of floating point multiplier/dividers -res:fpmult I I I available 93

Appendix C: ARM SS and PISA SS MPEG 4 Benchmark Results

ARM sim_num_insn 8,483,057,826 Number of instructions executed sim_num_refs 2,330,433,025 Number of loads and stores executed sim_elapsed_time 4,082 Simulation time in seconds sim_inst_rate 2,078,162 Simulation speed in instructions/second

Overall Statistics Load 1,741,378,189 20.53 Store 589,048,095 6.96 Control 1,215,336,160 14.32 Integer 4,933,354,454 58.15 FP 3,818,200 0.04 Misc 11,710 0 8,482,946,808

Control b%c %j 1,141,184,923 13.45 Branch bl%c %j 74,151,237 0.87 Branch and Link 1,215,336,160 14.32 Total% lnteaer and%c %d,%n,%m 179,939 0 AND and%cs %d,%n,%m 0 0 and%c %d,%n,#%i 48,984,130 0.58 and%cs %d,%n,#%i 189,236 0 orr%c %d,%n,%m 39,946,144 0.47 OR orr%cs %d,%n,%m 7,581,874 0.09 orr%c %d,%n,#%i 59,753 0 orr%cs %d,%n,#%i 0 0 eor%c %d,%n,%m 5,481,850 0.06 Exclusive OR eor%cs %d,%n,%m 0 0 eor%c %d,%n,#%i 12,331 0 eor%cs %d,%n,#%i 0 0 cmp%c %n,%m 236,635,336 2.79 Compare cmp%c %n,#%i 910,896,229 10.74 cmn%c %n,%m 1,092,728 0.01 Compare Negative cmn%c %n,#%i 3,439,197 0.04 tst%c %n,%m 0 0 Test-AND tst%cs %n,%m 467,610 0.01 tst%c %n,#%i 0 0 94

tst%cs %n,#%i 185,453,430 2.19

teq%c %n,%m 0 0 Test-XOR teq%cs %n,%m 0 0 teq%c %n,#%i 0 0 teq%cs %n,#%i 0 0 bic%c %d,%n,%m 4,001 0 Bit Clear bic%cs %d,%n,%m 473 0 bic%c %d,%n,#%i 227,591 0 bic%cs %d,%n,#%i 10,276,621 0.12 mov%c %d,%m 1,005,649,444 11.85 Move mov%cs %d,%m 24,340,333 0.29 mov%c %d,#%i 283,260,509 3.34 mov%cs %d,#%i 0 0 mvn%c %d,%m 218,824 0 Move negative mvn%cs %d,%m 0 0 mvn%c %d,#%i 17,838,615 0.21 mvn%cs %d,#%i 0 0 add%c %d,%n,%m 810,454,634 9.55 Addition

adds%c %d,%n,%m 85,011,646 1 add%c %d,%n,#%i 558,297,500 6.58 add%cs %d,%n,#%i 2,142 0 adc%c %d,%n,#%i 0 0 Addition wI Carry adc%cs %d,%n,#%i 0 0 adc%c %d,%n,%m 0 0 adc%cs %d,%n,%m 0 0 sub%c %d,%n,%m 52,391,274 0.62 Subtract sub%cs %d,%n,%m 37,745 0 sub%c %d,%n,#%i 240,661,762 2.84 sub%cs %d,%n,#%i 69,998,752 0.83 sbc%c %d,%n,%m 981 0 Subtract wI Carry sbc%cs %d,%n,%m 0 0 sbc%c %d,%n,#%i 0 0 sbc%cs %d,%n,#%i 0 0 rsb%c %d,%n,%m 238,464,604 2.81 Reverse Subtract rsb%cs %d,%n,%m 0 0 rsb%c %d,%n,#%i 88,679,185 1.05 rsb%cs %d,%n,#%i 153 0 rsc%c %d,%n,%m 0 0 R-Subtractw/Carry rsc%cs %d,%n,%m 0 0 rsc%c %d,%n,#%i 0 0 rsc%cs %d,%n,#%i 0 0 mul%c %n,%w,%s 7,082,760 0.08 Multiply mul%cs %n,%w,%s 35,118 0 4,933,354,454 58.15 Total% Load ldm%c%a %n!,%R 0 0 Load Multiple ldm%c%a %n!,%R 26,529 0 ldm%c%a %n!,%R 0 0 ldm%c%a %n!,%R 0 0 ldm%c%a %n,%R 17,488,328 0.21 ldm%c%a %n,%R 90,477 0 Idm%c%a %n,%R 104,885,565 1.24 ldm%c%a %n,%R 1,403 0 ldr%c %d, [%n,#%o] 425,446,638 5.02 Load Word Idr%c %d,[%n,-#%o] 474,198,284 5.59 Idr%c %d,[%n,#%o]! 2,917 0 ldr%c %d,[%n,-#%o]! 0 0 ldr%c %d,[%n,-%m] 0 0 ldr%c %d,[%n,-%m] 329,477,488 3.88 ldr%c %d,[%n,-%m]! 0 0 Idr%c %d,[%n,-%m]! 0 0 ldr%c %d,[%n],#%o 10,287,760 0.12 ldr%c %d,[%n],-#%o 17,229 0 ldr%c %d,[%n],%mI 0 0 ldr%c %d,[%n],-%m! 0 0 ldr%cb %d,[%n,#%o] 129,200,103 1.52 ldr%cb %d,[%n,-#%o] 2,055 0 ldr%cb %d,[%n,#%o]! 173 0 ldr%cb %d,[%n,-#%o]! 2,378 0 ldr%cb %d,[%n,-%m] 0 0 ldr%cb %d,[%n,-%m] 202,141,990 2.38 ldr%cb %d,[%n,-%m]! 0 0 Idr%cb %d,[%n,-%m]! 0 0 Idr%cb %d,[%n],#%o 48,108,872 0.57 ldr%cb %d,[%n],-#%o 0 0 ldr%cb %d,[%n],%m! 0 0 ldr%cb %d,[%n],-%m! 0 0 Idr%ch %d,[%n,#%h] 0 0 ldr%ch %d,[%n,-#%h] 0 0 ldr%ch %d,[%n,#%h]! 0 0 ldr%ch %d,[%n,-#%h]! 0 0 ldr%ch %d,[%n,%w] 0 0 ldr%ch %d,[%n,%w] 0 0 Idr%ch %d,[%n,%w]! 0 0 ldr%ch %d,[%n,%w]! 0 0 ldr%ch %d,[%n],#%h 0 0 Idr%ch %d,[%n],-#%h 0 0 ldr%ch %d,[%n],%w 0 0 q%Jp 0 0 qSO%Jp [q%#'u%]'p% 0 0 qsoip [q%#-'u%Yp% 0 0 qso%.Jp 0 0 qsoip 0 0 [M%'U%Fp% 0 0 qs3o,,Jp 0 0 qso%Jp 0 0 qso%Jp 0 0 qso%JpJ 0 0 qso%Jp 0 0 qso%Jp 0 0 qsoJpI 0 0 qso%Jp 0 0 qSO%Jp 0 0 qsoo,f0Jp 0 0 S3%JP 0 0 L1SO%JPI [M%'U%]p% 0 0 qso%ip [M%-'u%]p% 0 0 qso%Jp 0 0 qsop 0 0 SO%JPI 0 0 qsoo,Jp 0 0 qsoip 0 0 qso%Jp 0 0 681.'8Lf1.V'L1. 90 %Ieo1 aio;s e%o%W1S 0 0 eio;s 9IdflInVI %3%W1S J%jU% 9961.1.8 1.0 e%o%W1S 8LtL9L6 I.I.1. e%o%WS 0 0 e%o%WTS 0 0 t79'1. 0 %3%W1S 0 0 e%o%WS 909 0 1.ii6' 99 6C0 eiog PJOM 3%J$ 'p% [o%#-'u%] E '9tO 901. E E6 0 0 9EL'L 0 o%J;s 0 0 3%JS [w%-'u%]p% 6999O9i 9O 0 0 0 o o%Js 691. o 97 str%c %d,[%n],-#%o 0 0 str%c %d,[%n],%m! 0 0 str%c %d,[%nl,-%m! 0 0 str%cb %d,[%n,#%o] 66,274 0 str%cb %d,[%n,-#%o] 154 0 str%cb %d,[%n,#%o]! 70,200 0 str%cb %d,[%n,-#%o]! 12,446,985 0.15 str%cb %d,[%n,-%m] 0 0 str%cb %d,[%n,-%m] 13,337,925 0.16 str%cb %d,[%n,-%m]! 0 0 str%cb %d,[%n,-%m]! 0 0 str%cb %d,[%n],#%o 50,108,823 0.59 str%cb %d,[%n],-#%o 0 0 str%cb %d,[%n],%m! 0 0 str%cb %d,[%n],-%m! 0 0 str%ch %d,[%n,#%h] 0 0 str%ch %d,[%n,-#%h] 0 0 str%ch %d,[%n,#%h]! 0 0 str%ch %d,[%n,-#%h]! 0 0 str%ch %d,[%n,%w] 0 0 str%ch %d,[%n,%w] 0 0 str%ch %d,[%n,%w]! 0 0 str%ch %d,[%n,%w]! 0 0 str%ch %d,[%nJ,#%h 0 0 str%ch %d,[%n],-#%h 0 0 str%ch %d,[%n],%w 0 0 str%ch %d,[%n],-%w 0 0 589,048,095 6.96 Total%

Floatina Point abs%c%t%r %D,%M 0 0 abs%c%t%r %D,#%l 0 0 adf%c%t%r %D,%N,%M 1,902,450 0.02 adf%c%t%r %D,%N,#%I 2,664 0 mvf%c%t%r %D,%M 0 0 mvf%c%t%r %D,#%I 150 0 muf%c%t%r %D,%N,%M 2,444 0 muf%c%t%r %D,%N,#%I 150 0 mnf%c%t%r %D,%M 0 0 mnf%c%t%r %D,#%l 0 0 suf%c%t%r %D,%N,%M 494 0 suf%c%t%r %D,%N,#%I 150 0 rsf%c%t%r %D,%N,%M 0 0 rsf%c%t%r %D,%N,#%I 0 0 md%c%t%r %D,%M 0 0 md%c%t%r %D,#%I 0 0 dvf%c%t%r %D,%N,%M 3,117 0 dvf%c%t%r %D,%N,#%l 0 0 sqt%c%t%r %D,%M 0 0 sqt%c%t%r %D,#%I 0 0 flt%c%t%r %N,%d 1,906,581 0.02 rdf%c%t%r %D,%N,%M 0 0 rdf%c%t%r %D,%N,#%I 0 0 Iog%c%t%r %D,%N,%M 0 0 Iog%c%t%r %D,%N,#%I 0 0 pow%c%t%r %D,%N,%M 0 0 pow%c%t%r %D,%N,#%I 0 0 Ign%c%t%r %D,%N,%M 0 0 lgn%c%t%r %D,%N,#%I 0 0 rpw%c%t%r %D,%N,%M 0 0 rpw%c%t%r %D,%N,#%I 0 0 exp%c%t%r %D,%N,%M 0 0 exp%c%t%r %D,%N,#%I 0 0 rmf%c%t%r %D,%N,%M 0 0 rmf%c%t%r %D,%N,#%I 0 0 sin%c%t%r %D,%N,%M 0 0 sin%c%t%r %D,%N,#%I 0 0 fml%c%t%r %D,%N,%M 0 0 fml%c%t%r %D,%N,#%I 0 0 cos%c%t%r %D,%N,%M 0 0 cos%c%t%r %D,%N,#%I 0 0 fdv%c%t%r %D,%N,%M 0 0 fdv%c%t%r %D,%N,#%I 0 0 tan%c%t%r %D,%N,%M 0 0 tan%c%t%r %D,%N,#%I 0 0 frd%c%t%r %D,%N,%M 0 0 frd%c%t%r %D,%N,#%I 0 0 asn%c%t%r %D,%N,%M 0 0 asn%c%t%r %D,%N,#%I 0 0 pol%c%t%r %D,%N,%M 0 0 pol%c%t%r %D,%N,#%I 0 0 acs%c%t%r %D,%N,%M 0 0 acs%c%t%r %D,%N,#%1 0 0 atn%c%t%r %D,%N,%M 0 0 atn%c%t%r %D,%N,#%I 0 0 urd%c%t%r %D,%M 0 0 urd%c%t%r %D,#%I 0 0 nrm%c%t%r %D,%M 0 0 nrm%c%t%r %D,#%I 0 0 3,818,200 0.04 Total%

Misc swi%c %S 11,710 0 Software Interrupt 11,710 0 Total% I[IIC

PISA sim_num_insn 9,815,840,093 Number of instructions executed sim_num_refs 2,110,058,029 Number of loads and stores executed sim_elapsed_time 8,337 Simulation time in seconds sim_inst_rate 1,177,383 Simulation speed in instructions/second

Overall Statistics Load 1,564,723,828 15.93 Store 545,334,198 5.55 Control 1,912,743,848 19.49 Integer 5,787,281,813 58.95 FP 3,824,850 0.04 Misc 23,120 0 9,813,931,657

Control J 131,204,892 1.34 Jump jal J 78,405,310 0.8 Jump and Link jalr d,s 22,681 0 Jump and Link Register Jr s 89,985,299 0.92 Jump Register bclf j 300 0 Branch FCC False bclt j 459 0 Branch FCC True beq s,t,j 612,694,553 6.24 Branch Equal bgez s,j 153,241,607 1.56 Brand Not Equal bgtz s,j 26,290,333 0.27 Branch> =Zero blez s,j 19,087,675 0.19 Branch>Zero blt.z s,j 11,678,094 0.12 Branch< Zero bne s,t,j 790,132,645 8.05 Branch

Inteaer and d,s,t 12,016,699 0.12 AND andi t,s,u 276,340,565 2.82 AND Immediate or d,s,t 77,524,859 0.79 OR on t,s,u 36,378,132 0.37 OR Immediate nor d,s,t 4,889,166 0.05 NOR xor d,s,t 4,591,241 0.05 XOR xon t,s,u 26,167,712 0.27 XOR Immediate slI d,t,H 620,641,707 6.32 Shift Left Logical shy d,t,s 90,271,681 0.92 Variable sra d,t,H 86,245,034 0.88 Shift Right Arithmetic 101

srav d,t,s 143,463,961 1.46 Variable sri d,t,H 47,803,335 0.49 Shift Right Logical srlv d,t,s 46,630,162 0.48 Variable sit d,s,t 92,749,321 0.94 Set Less Than siti t,s,i 319,817,053 3.26 Set Less Than Immediate situ d,s,t 158,508,789 1.61 Set Less Than Unsigned sltiu t,s,i 38,720,989 0.39 Set Less Than Unsigned 1mm mthi d 37,291,777 0.38 Move From High mflo d 86,862,705 0.88 Move From Low mthi S 0 0 Move To High mtlo s 0 0 Move To Low lul t,tJ 64,096,217 0.65 Load Upper Immediate add d,s,t 0 0 Add addi t,s,i 0 0 Add Immediate addu d,s,t 2,140,451,26721.81 Add Unsigned addlu t,s,i 1,057,549,854 10.77 Add Unsigned Immediate sub d,s,t 0 0 Subtract subu d,s,t 231,406,539 2.36 Subtract Unsigned mult s,t 44,419,726 0.45 Multiply multu s,t 118,227 0 Multiply Unsigned div s,t 5,158,662 0.05 Divide divu s,t 37,166,433 0.38 Divide Unsigned 5,787,281,813 58.95 Total%

Load lb t,o(b) 112,354,745 1.14 Load Byte -8 lb t,(b+d) 0 0

Ibu t,o(b) 367,052,319 3.74 Load Byte Unsigned-8 ibu t,(b+d) 0 0 lh t,o(b) 11 0 Load Half Word-16 lh t,(b+d) 0 0

Ihu t,o(b) 22,990,852 0.23 Load Half Word Unsigned-16 thu t,(b+d) 0 0 1w t,o(b) 1,053,029,08810.73 Load Word -32 1w t,(b+d) 0 0 Iwl t,o(b) 0 0 lwr t,o(b) 0 0 l.d T,o(b) 3,167 0 Load Double Precision FP l.d T,(b+d) 0 0 l.s T,o(b) 10,791 0 Load Single Precision FP 1.s T,(b+d) 0 0 dlw t,o(b) 9,282,855 0.09 Load Double Word dlw t,(b+d) 0 0 1,564,723,828 15.93 Total% 102

Store sb t,o(b) 117,846,825 1.2 Store Byte-8 sb t,(b+d) 0 0 sh t,o(b) 0 0 Store Half Word 16 sh t,(b+d) 0 0 sw t,o(b) 418,198,642 4.26 Store Word-32 sw t,(b+d) 0 0 dsw t,o(b) 9,282,855 0.09 Load Single Word dsw t,(b+d) 0 0 s.d T,o(b) 200 0 Store Double Precision FP s.d T,(b+d) 0 0 s.s T,o(b) 5,676 0 Store Single Precision FP s.s T,(b+d) 0 0 545,334,198 5.55 Total%

Floatina Point abs.d D,S 0 0 Absolute Value abs.s D,S 0 0 add.d D,S,T 1,906,014 0.02 Add Double Precision FP add.s D,S,T 0 0 Add Single Precision FP sub.d D,S,T 750 0 Sub Double Precision FP sub.s D,S,T 0 0 Sub Single Precision FP c.eq.d S,T 303 0 Compare eq c.eq.s S,T 0 0 c.le.d S,T 150 0 Compare less than or equal c.le.s S,T 0 0 c.lt.d S,T 306 0 Compare less than c.lt.s S,T 0 0 cvt.d.s D,S 0 0 Conversions cvt.d.w D,S 1,906,731 0.02 cvt.s.d D,S 0 0 cvt.s.w D,S 0 0 cvt.w.d D,S 2,964 0 cvt.w.s D,S 0 0 div.d D,S,T 3,117 0 Divide Double div.s D,S,T 0 0 mul.d D,S,T 2,553 0 Multiply Double mul.s D,S,T 0 0 neg.d D,S 0 0 neg.s D,S 0 0 mov.d D,S 1,959 0 mov.s D,S 0 0 sqrt.d D,S 0 0 sqrt.s D,S 0 0 dsz 0(b) 3 0 103 dsz (b+d) 0 0 3,824,850 0.04 Total %

Misc syscall 11,572 0 System Call break B 0 0 flop 11,548 0 23,120 0 Total %