<<

ˇ˛˘ʓʴ˅ˀʓʴ˥˖˛˜˧˘˖˧˨˥˘ʓ ʓ ˆ˘ˡ˚ʓʿ˜ˡʓˆ˛˘˘ʓ ʓ ˇˇˇˇ ˛˨˨˨˨ ˥˥˥˥ ˦˦˦˦ ˗˗˗˗ ˔˔˔˔ ˬˬˬˬ ʟʟʟʟ ʓʥʥʥʥ ʣʣʣʣ ʓˀˀˀˀ ˔˔˔˔ ˬˬˬˬ ʓʥʥʥʥ ʣʣʣʣ ʣʣʣʣ ʧʧʧʧ ʓ ʓ ʶˢˡ˧˘ˡ˧˦ ABSTRACT...... 2 1 INTRODUCTION ...... 2 1.1 HISTORY ...... 2 1.2 ARM LIMITED ...... 2 1.3 LICENSING ISSUES ...... 3 2 THE ARM ARCHITECTURE...... 5 2.1 ARCHITECTURE DEFINITION...... 5 2.2 ARCHITECTURE VARIANTS ...... 5 2.3 GENERAL ISA FEATURES ...... 7 2.3.1 Programmer’s Model ...... 8 2.3.2 The ARM interface ...... 8 2.3.3 Conditional Execution...... 9 2.3.4 Multiple Register Transfer Operation...... 9 2.3.5 Fold shifts/ rotates into ALU operation...... 10 2.4 SUPPORT...... 10 2.4.1 Coprocessor Number 15...... 10 2.4.2 MMU architecture...... 10 2.4.3 ...... 10 2.4.4 Context switching ...... 11 2.5 AMBA INTERFACE ...... 11 2.6 ARMV4 ...... 11 2.6.1 Thumb Instruction Set ...... 11 2.7 ARMV5 ...... 13 2.7.1 ARM DSP Extensions...... 13 2.7.2 Jazelle...... 13 2.8 ARMV6 ...... 15 2.8.1 Media Instructions...... 15 2.8.2 Thumb2...... 16 2.9 SUMMARY...... 16 3 ARM IMPLEMENTATIONS ...... 17 3.1 ARM7TDMI ...... 17 3.2 ARM9TDMI ...... 18 3.3 ARM11 ...... 19 3.4 XSCALE ...... 21 3.5 AMULET...... 23 3.5.1 AMULET3 ...... 23 4 CONCLUSION ...... 25 5 REFERENCES ...... 25

1 Abstract The ARM is a Reduced Instruction Set (RISC), a type of that recognizes a relatively limited number of instructions. One advantage of RISC processors is that they can execute instructions very fast because the instructions are so simple. Another important advantage is that RISC chips require fewer , which makes them cheaper to design and produce. This architecture is deferent from most major such as the SPARC and processors as it is mainly designed for embedded systems. ARM processors are generally processors that implement the ARM architecture specifications. In this report, the ARM architecture, the interesting features, architectural extensions, ARM architectural implementations and some historical notes will be discussed. 1 Introduction

1.1 History The ARM processor originated from a small company named Acorn Limited in England between the years 1983 and 1985. At that time, Acorn developed computers for BBC (British Broadcasting Corporation). Due to the BBC popularity, the computer which was built around the 8-bit 6502 microprocessor became the dominant machine in British schools. It also flourished in the hobbyist market and was also used in research laboratories and higher education establishments.

In order to create a successor to the BBC , Acorn needed a new microprocessor which is better than the 6502 microprocessor. At the time, most 16-it CISC microprocessors were slower than standard memory parts. BBC was reluctant in adopting such slow performance because the 6502 has actually better interrupt response. When Acorn was refused access to the 80286 microprocessor, Acorn decided to create its own processor from scratch.

By developing its own microprocessor, Acorn had to develop the whole platform and produced it as a complete product. Acorn thus developed the microprocessor, system board as well as the (operating system) to run the system. With only just over 400 employees in total, Acorn could not invest in a development of a complex microprocessor. The company does not have the relevant experience to make a commercial microprocessor.

Thus, it was decided that Acorn had to produce a better design with a fraction of the design effort. Fortunately, Acorn stumbled upon some papers published by a group of students which developed the Berkeley RISC 1 processor. The was simple, with no complex instructions to ruin the interrupt latency. It was suggested that this design was the way to the future.

An engineering team at Acorn, led by Roger Wilson and , started development of the ARM, which stood for Acorn RISC Machine. This microprocessor later became known as ARM 1, a prototype microprocessor which was not made commercial to the public yet.

1.2 ARM Limited The ARM1 processor was completed by 1985, and the first "real" production systems was launched as ARM2 the following year. The ARM2 featured a 32-bit data and 26-bit address bus, with 16 registers. The ARM2 was the simplest useful processor in the world, with only 30,000 transistors (compare with the four-year older 68000's 68,000). Much of this simplicity comes from not having (which represents about 1/4 to 1/3

2 the size of the 68000) and not including any . This simplicity leads to its excellent low- power needs, and yet it performed better than the 286.

Soon after that, Acorn started working with Apple Computer on newer versions of the ARM core. The work was so important that Acorn spun off the design team in 1990, and is now a separate company named Advanced RISC Machines (still maintaining the ARM acronym).

The first CPU produced by Advanced RISC Machines was the ARM6, This time, the ARM6 design is a true 32-bit CPU, while otherwise remaining similar to earlier models. This CPU utilizes a 32-bit addressing. The first models were released in 1991, and Apple used the ARM6-based ARM 610 as the basis for their Apple Newton PDA.

Today, the company is just named ARM Limited but continues to be the leader in embedded RISC microprocessor . In 2002, ARM Limited was the leading provider of 32-bit embedded RISC microprocessors with 75% of the market.

ARM’s success was due to • Common architecture • High performance • Low power consumption • Low system cost

ARM provided solutions for: • Embedded real-time systems for mass storage, automotive, industrial and networking applications • Secure applications – smartcards and SIMs • Open platforms running complex operating systems

1.3 Licensing Issues ARM’s main business model is driven by licensing its technology to other companies. The company does not manufacture and fabricate its own microprocessors. Rather, it licenses the technology to companies like Intel and Motorola, who will use the ARM architecture in their microprocessors and market it to consumers.

The following are the different types of licenses which are provided by ARM: • Implementation License This license scheme is the most popular and is purchased by hundred over companies around the world. The license provides the licensee complete information to design & manufacture integrated circuits containing ARM core. ARM provides hard or soft core (macro cells).

Hard cores refer to and technology dependent implementations whereas soft cores refer to HDL (Hardware Description Languages) codes. These soft cores can be used in various processes but is not optimized. • Foundry License This license is targeted to fab-less semiconductor vendors to develop & sell ARM core-based products manufactured by licensed companies. The license provides all the key elements and views needed to design an ARM-Powered system-on-chip. • Architecture License The architecture license provides the licensee to develop their own CPU implementations compliant with ARM's Instruction Set Architecture. The architecture licensee must have extensive design resources and the highest level of implementation expertise. Intel is a very good example of a company developing

3 their own CPU based of ARM’s ISA. The Intel XScale is based on the ARM’s ISA and provides more instructions. • Academic License Basic building blocks of the core to allow simulation and design of prototypes parts for academic research are provided. Thus, a core simulation environment can be created to provide further research on the ARM architecture for academic purposes.

4 2 The ARM Architecture The ARM Architecture has been developed and improved through the years in support for the additional computing power which is required by the current embedded systems today while maintaining low power consumption and low code density. The section will discuss the ARM architecture features and the additional features which were added during the major revisions of the architecture.

2.1 Architecture definition The architecture describes the rules for how the microprocessor will behave, but without constraining or specifying how it will be built. The definition of the architecture provides the specification for the interface with the outside world, thus enabling operating system, application and development support to be planned and implemented. In detailed terms, the microprocessor architecture defines: • the processor’s instruction set The instruction set definition provides the list of instructions available to the programmer. In the ARM, there are concurrent ISAs complementing the existing ARM ISA. These includes Thumb instructions and SIMD Media Instructions. • the programmer’s model The programmer’s model defines the list of available instructions which are available at any one time to the programmer. It defines how the program is represented and how context will affect the register sets. In the ARM architecture, a link register and stack pointer is used for branching and context switching. • how the processor interfaces with its closest memory resources The processor cannot be effective by working on its own. To expand the capability of the processor core, support is given to work with additional devices. The architecture specifies how the core interacts with additional devices such as co-processors. It also defines the bus interfaces in which it interacts with.

2.2 Architecture Variants ARMv1 The ARMv1 was the first processor architecture developed at Acorn Limited. The architecture was implemented in the ARM1 processor. This first architecture was very simple with only 26-bit addressing, thus it was not a fully 32-bit processor. There was no support for multiplication instructions and coprocessor support.

ARMv2 The first commercial chip marketed by Acorn limited was the ARM2. TheARM2 featured a 32-bit data bus and 26-bit address bus, with 16 registers. With only 30,000 transistors, it was the one of the simplest useful processors in the world at that time. The ARM2 does not have microcode operations in contrast with most CISC processors. The ARMv2 architecture included 32-bit result multiply instructions and coprocessor support.

ARMv2a Acorn introduced the ARM3 chip with on-chip cache. The additional features which were added include the atomic load and store (SWP) instruction and the use of coprocessor 15 as the system control coprocessor to manage the cache.

ARMv3 ARM6 was the first processor developed and marketed after the team which was responsible for the ARM spunned off into Advanced RISC Machine in 1990. The ARM6 was sold as a macrocell, a stand-alone processor and as an integrated CPU with an on-chip cache, MMU and write buffer. The ARM6 was used in the Apple Newton PDA. The ARMv3 architecture

5 had a 32-bit addressing, separate CPSR and SPSRs, and added the undefined and abort modes to allow coprocessor emulation and support in supervisor mode.

The architectures discussed above have been replaced by newer architectures which is being supported today by most embedded systems. The chart below shows the timeline on the introduction of the new architectures and the implementation of the architectures themselves.

Figure 2-1 Architectural Timeline ARMv4 ARMv4 adds the signed and unsigned half-word and signed load and store instructions and reserves some of the SWI space for architecturally defined operations. The system mode is introduced, and several unused corners of the instruction space are categorized as undefined instructions for future usage.

ARMv4 introduces the Thumb instruction set which comprises of 16-bit Thumb instructions. These instructions result in higher code densities.

ARMv5 Version 5 of the ARM architecture improves on the ARM and Thumb instruction interworking, count leading-zeroes (CLZ) instruction and introduces more architecture variants: • E - enhanced DSP instructions including saturated arithmetic operations and 16-bit multiply operations • J - support for new Java , offering hardware and optimized software acceleration of byte code execution.

ARMv6 ARMv6 was recently introduced in October 2001. The ARMv6 includes all ‘TEJ’ enhancements, namely the Thumb instruction set, DSP enhancement and Jazelle technology (Java). The ARMv6 also includes improves memory management support, features and added new Media Instructions. This results in a newer programmable modal for the new media instructions which utilizes SIMD instructions.

6 2.3 General ISA features The Berkey RISC 1 architecture became the basis for the ARM processor. However, only certain elements were incorporated into the ARM architecture. The ARM architecture can be said to be RISC which some CISC features. Thus, the ARM is able to achieve power- efficiency and small core size while obtaining better code density than a pure RISC processor.

The following are the features from Berkeley RISC design incorporated within the ARM architecture: • A load-store architecture Memory access is only perform on load / store instruction. Other instructions only perform operations on registers.

• Fixed-length 32-bit instructions

• 3-address instructions A 32-bit instruction usually contains the address of 2 source registers and 1 destination register.

The following features differ from the Berkey RISC design • Register windows A refers to the range of visible registers that is visible at any one time. In the Berkeley RISC processors, there are 32 visible registers which can be used. During a procedure entry and exit, the visible window provides access to a new set of registers, thus foregoing the need to save the values of the previous registers into memory. This saves time of memory saves.

However, doing so will incur a cost of implementing lots of registers in hardware. This results in larger chip size, speed as well as power. Thus, in order to reduce those factors, the ARM architecture allows only 17 registers to be visible at any one time.

• Delayed branches A delayed branch is actually a ‘bubble’ or an extra cycle which is used by the branch instruction to decide whether a branch is taken or not. This extra cycle can be used to execute an instruction, which is usually not affected by the decision of the branch.

The delayed branch implementation is removed because delayed branches removes the atomicity of individual instructions. This will interact badly in multi-issue implementation of the ARM microprocessor. Thus, the more complex exception handling can be avoided.

• Single-cycle execution of all instructions A single-execution of all instructions can be done only if both data and instruction are in separate memory blocks, thus accessing both memory blocks at the same time to fetch an instruction and write/fetch data to/from memory.

In the earlier versions of the ARM, a single memory for both data and instruction is used. Thus, when memory access is needed (for data), then these instructions would take multiple cycles.

When more than one cycle is needed, other useful stuff is performed, such as support for auto-indexing addressing modes. This redeuces the total number of ARM instructions to perform any sequence of operations, thus improving performance and code density.

7 2.3.1 Programmer’s Model

Figure 2-2 ARM's visible registers The ARM has a 32-bit RISC-processor core. It receives 32-bit instructions from the instruction memory. In total, the ARM has a total of 37 pieces of 32-bit integer registers. When writing user-level programs, only the 15 general-purpose 32-bit registers from r0 to r14, the (r15) and the current program (CPSR) need to be considered.

The ARM supports 8 / 16 / 32 bits data type. User level programming executes in User Mode whereas System level programming can execute in 5 other modes (fiq, svc, abort, irq and undefined mode).

The ARM itself is pipelined. Thus it uses ILP (instruction level parrallelism) to execute certain instructions in parrallel. In older ARM implementations, the (unified instruction and data memory) was used. However, to increase memory bandwidth as well as lower memory latency, the was chosen.

2.3.2 The ARM coprocessor interface The ARM supports a general-purpose extension of its insutrction set through the of hardware . The following are the important features of the coprocessor architecture.

• Support for up to 16 logical coprocessors • Each coprocessor can have up to 16 private registers of any reasonalble size; they are not limited to 32 bits. • Coprocessors use a load-store architecture, with instructions to perform internal operations on resgiters, instructions to load nad save resgiters form and to memory, and instructions to move data to or from an ARM register.

Coprocessors communicate with the ARM core through handshaking protocols. Instructions are communicated through this method before it can be performed.

8 2.3.3 Conditional Execution In normal RISC and CISC processors, conditional branches are used to skip instructions, thus avoiding unnecesary executions. However, the ARM instruction set has an unsual feature. Every instruction (with the exception of Thumb instructions) is conditionally executed. int b test gcd(int i, int j) loop subgt Ri,Ri,Rj { suble Rj,Rj,Ri while (i != j) { test cmp Ri,Rj if (i > j) bne loop i -= j; else j -= i; } return i; }

There are advantages by having conditional executions. • Cuts down significantly on the space available for displacement memory access • Avoid branch instructions when generating code for small if statements

Figure 2-3 The ARM condition code field Each instruction has a 4-bit condition code. Thus, everycode can be made conditional. Each instruction mnemonic may be extended by appending two letters defined. (EQ, NE, GE, LT, GT etc)

2.3.4 Multiple Register Transfer Operation The ARM multiple register transfer instructions allow any subset of the 16 registers visible in the current operating mode to be loaded from or stored to memory. These instructions are normally used on procedure entry and return to save and restore workspace registers. They are useful for high-bandwidth memory block copy routines.

Figure 2-4 Multiple register data transfer instruction binary encoding

9 2.3.5 Fold shifts/ rotates into ALU operation Another unique feature of the ARM ISA is the ability to fold shifts and rotates instructions into a normal data processing instruction. Thus, shifts and rotates can be performed to gether with arithmetic, logical and register-register move instructions. The operand can be shifted before being processed and stored into a destination register. For example, a+=(j<<2) can be rendered as a single instruction on the ARM.

This results in an ARM program being denser than what would normally be expected from a normal RISC processor. Fewer instructions need to be fetched from memory, thus reducing bandwidth consumption on the memory bus.

2.4 Operating system support

2.4.1 Coprocessor Number 15 The coprocessor Number 15 (CP15) is needed on ARM CPUs which are used in embedded systems that require a full with address translation capabilities. This expands the capability of the ARM core. The CP15 is an on-chip coprocessor which controls the operation of the on-chip cache or caches, memory management or protection unit, write buffer, prefetch buffer, branch target cache and system configuration signals.

2.4.2 MMU architecture In general-purpose applications where the range and number of application programs in unknown, the ARM CPU will require a memory management unit with address translation. It translates virtual addresses into physical addresses. It also controls memory access permission and aborting accesses which are illegal.

The MMU architecture in a typical ARM processor would use a 2-level page table with table- walking hardware. A TLB is used to store recently used page translation for fast lookup.

All accesses and controls are made to the CP15 registers. In the ARM MMU architecture, the memory mapping is performed at several different granularities. These units are • Sections: 1 MB blocks of memory • Large pages: 64 KB blocks of memory. Access control is applied to individual 16 KB subpages. • Small pages: 4 KB blocks of memory. Access control is applied to individual 1 KB subpages. • Tiny pages: 1 KB blocks of memory

The ARM MMU architecture introduces domains, which are groups of sections and/or pages which have particular access permissions. This enables a number of different processes to run with the same translation tables while retaining protection from each other. Thus, each process need not have its own translation tables.

2.4.3 Synchronization When a system runs multiple processes which share data structures, there should be a control mechanism to ensure the correct behaviour when two or more processes want to write / read the data at the same time.

For example, if process A wants to increment X, it has to read the value of X and then writes the incremented value back to X. However, if the operating system interrupts and allows process B to read and writes the new value to X, the variable X would have the incorrect value because neither A nor B sees the updated value of X. What should happen is that only

10 one process can access the variable at any one time. The other process must wait until no other process is accessing the data. A mutually exclusive access is required. Some sort of lock is required to prevent another process from accessing it before it has finished the operation.

The ARM architecture supports synchronization by providing a ‘SWAP’ instruction. The instruction is similar to an atomic ‘test and set’ instruction. Thus, it is uninterruptible. A register is set to a ‘busy’ value, and then this register is swapped with the memory location containing the Boolean. If the loaded value is ‘free’ the process can continue. If it is still ‘busy’ the process spins on the lock until it gets the free result.

WHY DON’T OTHER MAINSTREAM PROCESSORS USE THIS The ‘swap’ instruction access the memory every time. If a process spins on a lock, this would take up a lot of memory bandwidth and thus, would impede performance.

2.4.4 Context switching A process runs in a context, which consists of all the states (variables, state registers) for the process to run properly. States includes the values of all the processor’s registers, including the program counter, stack pointer etc. When a process takes places, the context of the old process must be saved and that of the new process restored.

The ‘architectural support’ for register saving and restoring offered on the ARM recognizes the difficulty of saving and restoring user registers from a privileged mode and provides special instructions to assist in this task. Thus, code running in a non-user mode can save and restore the user registers from an area of memory addressed by a non-user mode register.

2.5 AMBA Interface AMBA is an on-chip bus specification that details a strategy for the interconnection and management of functional blocks that makes up a System-on-Chip (SoC). It is an open standard, thus the standard is available to the public. It allows the developer to achieve a first-time-right result when combining and integrating one or more CPU/signal processors and multiple . Thus, IP (Intellectual Property) Developers can develop their own products without having to worry about connectivity.

AMBA promotes a reusable design methodology by defining a common backbone for SoC modules.

2.6 ARMv4

2.6.1 Thumb Instruction Set The Thumb instruction set addresses the issue of code density. It is a compressed form of a subset of the ARM instruction set. Thumb instructions map onto ARM instructions whereas the Thumb programmer’s model maps on to the ARM programmer’s model.

A dynamic decompression system is used in an ARM instruction and Thumb instructions execute as normal ARM instructions. This does not affect performance because expansion is done via dedicated hardware within the chip.

11

Figure 2-5 ARM and Thumb visible registers In the Thumb programmer’s model, there are only 8 visible registers which are mapped to register r0 to r7 in the ARM programmer’s model. The use of register R13 in the ARM programmer’s model is purely a software convention, however, in the Thumb, there is no choice because it is hardwired.

The CPSR register determines the mode of operation (Thumb or ARM). The mode is switched by executing a Branch and Exchange instruction (BX).

ARM similarities: • Load-store architecture • Support for 8-bit byte, 16-bit half-word and 32-bit word data types • 32-bit unsegmented memory

ARM differences: • Most Thumb instructions are executed unconditionally • Data processing instructions use a 2 address format instead of 3 address format in the ARM • Thumb instruction formats are less regular than ARM instruction formats, as a result of the dense encoding

Results (typical case); • Requires 70% the space of ARM code • 40% higher instruction count • 30% less external memory power than ARM code

The downside to using Thumb is that there is an overhead to switch from 32-bit to 16-bit. When a branch and exchange instruction is executed (BX), the whole pipeline is flushed.

12 2.7 ARMv5

2.7.1 ARM DSP Extensions Normal DSP processors and coprocessors usually consume too much power and require additional area. It has been found that the ARM processor was suitable even for DSP calculations. The DSP Extension has been introduced to broaden the suitability of the ARM CPU family to applications that require intensive . Thus, the new processor core is still able to retain the power and efficiency of a high-performance RISC.

The ARM DSP Extensions features: • Single-cycle 16x16 and 32x16 MAC implementations • Zero overhead saturation extension support • New instructions to load and store pairs of registers, with enhanced addressing modes • New CLZ instruction improves normalization in arithmetic operations and improves divide performance • Full support in the ARMv5TE and ARMv6 architecture

Applications • Audio encode/decode (MP3, AAC, WMA) • MPEG4 decode • Voice and handwriting recognition • Embedded control • Bit exact algorithms (GSM-AMR)

2.7.2 Jazelle JAVA SOFTWARE ACCELERATION • Optimized Java virtual machines typically offer sufficient memory efficiency, however, they are incapable of providing adequate performance for high-end application unless a high-performance processor is used. Thus, cost and power constraints cannot be achieved. • JIT bypass the Java Virtual Machine (JVM) for much of the byte code interpretation. Typical compilers are more than 100 KB in size, taking up huge memory. JIT is also slow to initiate, resulting in pauses and user input disruptions. • JAVA • Dedicated Java processors represent a significant overhead and additional integration and development complexity. These processors are dedicated for Java execution and must work along side other processors to support existing applications. • Java co-processors translate Java byte code into existing core’s instructions. Co- processors require extra space for the dates and extra power to operate. They tend to run slow because they are loosely coupled with the core processor.

The ARMv5 architecture added a third instruction set – Java Byte Code to the ISA. The Jazelle extension is added to support Java acceleration technology, which is particularly suited to small memory footprint designs. Along with this new instruction set, is additional instruction set support for entering and exiting Java applications, real-time interrupt handling, and debug support for mixed Java/ARM applications.

The Jazelle technology reuses all existing processor resources without the need to re- existing architecture or add cost, power or memory resources. • J-bit is set in CPSR to mark the mode of operation

13 • All processor state related to Java execution are stored in normal ARM register set • Any interrupt routine which saves on entry and restores on exit are compatible with Jazelle

In most systems, the JVM is implemented in software and thus, runs dramatically slower than hardware implementations. The Jazelle Technology implements the JVM in hardware. In order to reduce die size and improve performance, Jazelle is implemented in the ARM pipeline as an FSM (finite State Machine) rather than a traditional microcoded engine. Surprisingly, the hardware logic contributes to only 12,000 gates.

14 2.8 ARMv6 In summary the ARMv6 provides the following improvements over previous ARM architectures: • Media processing extensions o 2x faster MPEG4 encode/decode o 2x faster audio DSP • Improved cache architecture o Physically addressed caches o Reduction in cache flush/refill o Reduced overhead in context switches • Improved exception and interrupt handling o Important for improving performance in real-time tasks • Unaligned and mixed-endian data support o Simpler data sharing, application porting and saves memory

The architecture includes all Thumb, DSP Extensions and Jazelle enhancements. The following are the new extensions which were added to the ARMv6 architecture.

2.8.1 Media Instructions The media instruction enables more efficient software implementation of high-performance media applications. There are over 60 SIMD instructions added to the architecture. The SIMD instructions will provide performance improvements of between 2x and 4x depending on the application.

The SIMD instructions will support four 8-bit and two 16-bit operations, parallel add and subtract, selection, packing and unpacking. The new ISA also supports dual 16-bit multiply add/subtract operations. As now one 32-bit register can contain up to four 8-bit values, new status bits have to be defined for each of those values.

Six new status bits have been added to the programmer’s model • GE[3:0] bits o SIMD status bits - greater than or equal to for each 8/16-bit slice • E-bit o Indicates the current load/store endian setting of the core can be set/cleared with the SETEND instruction • A-bit o Indicates if imprecise data abort exceptions are masked

ARMv5TE: 5 cycles in a single-cycle implementation SMULTT Real,Ra,Rb ;Real = Ra.real*Rb.real SMULBB Temp,Ra,Rb ;Temp = Ra.imag*Rb.imag SUB Real,Real,Temp ;Real = Ra.real*Rb.real - Ra.imag*Rb.imag SMULTB Imag,Ra,Rb ;Imag = Ra.real*Rb.imag SMLABT Imag,Ra,Rb ;Imag = Ra.real*Rb.imag + Ra.imag*Rb.real ARMv6: 2 cycles in a single-cycle implementation SMUSD Real,Ra,Rb ;Real = Ra.real*Rb.real - Ra.imag*Rb.imag SMUADX Imag,Ra,Rb ;Imag = Ra.real*Rb.imag + Ra.imag*Rb.real Figure 2-6 16-bit Complex Multiply The example above shows how code density can be improved on an ARMv6 architecture while reducing the clock cycles needed to perform the same operations.

15 Architecture Cycles/4 pixels ARMv5TE 18 cycles ARMv6 3 cycles Figure 2-7 Implementing Sum of Absolute Differences The table above shows that the Sum of Absolute Differences operation can be reduced to just 3 clock cycles using an ARMv6 architecture, compared to 18 clock cycles, using the ARMv5TE.

2.8.2 Thumb2 The Thumb instruction set is an extension to the 32-bit ARM architecture that enables very high code density. This efficiency is however typically at the expense of performance, due to switching from Thumb back to ARM mode. This is due to the lack of operations that can be performed in Thumb mode. Although a single Thumb instruction is equivalent to a single ARM instruction, more 16-bit Thumb instructions are needed to accomplish the same overall function.

The ARM Thumb-2 core technology: • Introduces the new 16-bit thumb instruction for improve program flow • Provides new 32-bit Thumb instruction derived from ARM instruction equivalent. These instruction would have coprocessor access, privilege instructions and other special functions (SIMD) • The ARM 32-bit ISA has also been improved.

Thus, Thumb instructions are now not limited just to 16-bit instructions but includes 32-bit instructions too.

The Performance of Thumb-2 technology: • Performance similar to instruction based on ARM ISA • 5 percent smaller than Thumb high density code • 2-3 percent faster than Thumb high density code

2.9 Summary

ArchitectureThumb® DSP Jazelle Media TrustZone Thumb-2

v4T

v5TE

v5TEJ

v6

v6Z

v6T2 T: Thumb J: Jazelle E: DSP Instructions T2: Thumb2

16 3 ARM Implementations The principal current ARM processor core products offer a choice of cost, complexity and performance points from which the most effective solution can be selected. Each core is chosen as embedded in a CPU design or a unit.

Core Architecture ARM1 v1 ARM2 v2 ARM2aS, ARM3 v2a ARM6, ARM600, ARM610, AMULET1, AMULET2 v3 ARM7, ARM700, ARM710 v3 ARM7TDMI, ARM710T, ARM720T, ARM740T v4T Strong ARM, ARM8, ARM810 v4 ARM9TDMI, ARM920T, ARM940T, AMULET3 v4T ARM9E-S v5TE ARM10TDMI, ARM1020E, XScale v5TE ARM11 v6 Figure 3-1 ARM Architecture Implementations 3.1 ARM7TDMI The ARM7 family is now the lowest end ARM core and is used for personal audio players, entry level wireless handsets and two-way pagers. The ARM7TDMI evolved from ARM6, which was the first core to implement the 32-bit address space programming model.

The ARM7TDMI stands for • ARM7, a 3 volt 32-bit integer core • Thumb 16-bit instruction set • On-chip Debug support ro halt processor in response to a debug request • An enhanced Multiplier, yielding a 64-bit results instead of a 32-bit result • EmbeddedICE hardware to provide and watch point support

The ARM7TDMI uses a 3-stage pipeline with a multicycle execution stage. • Fetch – instruction is fetch from memory and placed in instruction pipeline • Decode – instruction is decoded and control signal signals prepares • Execute – register bank is read, operand shifted, ALU result generated and written back in a destination register

The ARM7TDMI has two read ports and one write port. One additional read port and an additional write port is provided to give special access to r15, the program counter. Thus, self-increment of the program counter would not affect or limit the number of registers which can be read / written.

The following interfaces are supported: • Memory interface • MMU interface • Coprocessor interface • Debug interface • JTAG interface

17 Process 0.35 m Transistors 74,209 MIPS 60 Metal layers 3 Core area 2.1 mm2 Power 87 mW Vdd 3.3 V Clock 0-66 MHz MIPS/W 690

Figure 3-2 ARM7TDMI characteristics 3.2 ARM9TDMI The ARM9TDMI represents a major improvement over the ARM7TDMI. Improvement is achieved by adopting a 5-stage pipeline to increase the maximum instead of a 3- stage pipeline in the ARM7TDMI. A separate instruction and data memory ports is used to allow an improved CPI ( ). The ARM9TDMI adopts the Harvard style architecture.

The improvements in the ARM9TDMI owe a lot to the StrongARM pipeline which is somewhat similar. However, the two major differences from the StrongARM pipeline is that: • The StrongARM has a dedicated branch which operates in parallel with the register read stage. The ARM9TDMI uses the main ALU for branch target calculations. This results in an additional clock cycle penalty for a taken branch but achieves smaller and simpler core. • The StrongARM was designed for a particular process technology where the timing path could be carefully managed. The ARM9TDMI is more flexible and readily portable to new processes.

The Thumb instruction decoding differs from ARM7TDMI because the instruction decoding now uses hardware to decode both ARM and Thumb instructions directly. Thumb instructions are not converted to ARM instructions anymore.

The ARM9TDMI employs a static branch prediction scheme.

Although the core has a Harvard style architecture, a single unified memory can still be used. However, doing so requires a complex high-speed memory subsystem, particularly caching. This makes implementation more complicated and draws more power from caching. Thus, most embedded systems provide separate instruction and data memory in their systems.

The ARM9E-S is a synthesizable version of the ARM9TDMI core and is 30% larger than the 2

ARM9TDMI on the same process. It occupies 2.7mm on a 0.25 m CMOS process.

Figure 3-3 ARM7TDMI and ARM9TDMI pipeline comparison The figure above shows how the pipeline in the ARM7 is reorganized into the 5 pipeline of the ARM9. The longer pipeline allows the clock frequency to be doubled on the same process technology. A separate stage is now allocated for data memory access and register write-back. Thus, the next instruction can be executed while the current instruction

18 reads/write from/to memory. Instructions which do not require memory access still require 5 clock cycles to complete, although nothing is done in the Memory Stage.

Process 0.25 m Transistors 111,000 MIPS 220 Metal layers 3 Core area 2.1 mm2 Power 150 mW Vdd 2.5 V Clock 0-200 MHz MIPS/W 1500 Figure 3-4 ARM9TDMI characteristics 3.3 ARM11 The ARM11 is the first in the new family of ARM11 cores. It is also the first to implement the ARMv6 instruction set architecture. The objective of developing the ARM11 microarchitecture is to meet the needs of the next-generation wireless and portable consumer products while delivering it at low power and low cost.

The ARM11 currently supports 4-64KB cache sizes. With the first cores in the family ranging from 350-500MHz, the microarchitecture represents a major step in system performance. Future versions will achieve over 1Ghz. The ARM11 allows developers to trade-off between performance and power to match the particular application.

The ARM11 is available in both synthesizable and semi-custom hard macrocell implementations. Developers can take advantage of their semiconductor processes by using synthesis. However, the hard macro implementations are targeted at the highest performance, speed-sortable applications. These hard macrocells are optimized only for a particular process.

The ARM11 cores have synthesis-friendly pipeline structure. Thus, HDL implementation of the ARM11 is designed to work with commercially available synthesis tools. Additionally, the microarchitecture features: • Thumb / Thumb2 – for code compression • Enhanced DSP – DSP processing • Jazelle – Java acceleration

Figure 3-5 ARM11 pipeline organization The new microarchitecture now has an 8 pipeline stage, thus resulting in a 40% higher throughput compared to previous cores. The introduction of the 8 stage pipe can impair

19 efficiency by introducing excessive delays or latency in the system. Thus, extensive us of forwarding the pipeline has been used. The delays in the pipeline is also avoided by using branch prediction schemes to predict the flow of the instructions. The result of these optimizations is the same effective latency as a 5-stage pipeline which is found in the ARM9 family cores.

There are 2 prediction scheme used in the ARM11: • Dynamic – a 64-entry, 4 state branch target address cache is maintained and is used to hold the majority of the most recent branches. If the branch prediction has been encountered before, a prediction is made based on the previous outcome. • Static branch predictor – used when the dynamic branch predictor cannot find a record of the branch instruction. If the branch is going backwards, the predictor assumes it is a loop, and takes the branch. If the branch is a forward branch, the branch is not taken.

As seen from Figure 3-5, the ARM11 deploys separate for the ALU, multiply- accumulate (MAC) and Load/Store (LS) instructions. This enables out-of-order execution, allowing slow operations to continue processing while independent instructions continue processing in parallel. Thus, if there is a stall in the load/store pipeline, data processing execution can still continue in the ALU and MAC datapath.

The ARM11 has improved memory access. The core has non-blocking and hit-under-miss operations in the memory system.. When a data is not available when an instruction requests for it, a cache ‘miss’ results in a normal simple pipeline. However, this is a non-blocking operation for the ARM11. Only when there are three successive misses encountered, will the pipeline be stalled.

Implementation of a 64-bit processor is still considered to be excessive in terms of power and area for the embedded systems market. However, the ARM11 uses 64-bit instructions without the need for a fully 64-bit processor implementation. 64-bit data bus is connected between the processor integer unit and the instruction and data caches, and between coprocessors and the integer unit. Thus, two 32-bit instructions can be fetched in a single clock cycle. Additionally, load- and store-multiple instructions can transfer 64-bits (two ARM registers) every cycle. The conclusion is that the ARM11 is able to achieve 64-bit effective performance, but only at a 32-bit cost.

Process 0.13 m Core area 2.7 mm2 Power 150 mW Vdd 1.2 V Clock 0-500 MHz mW/MHz 0.4 Figure 3-6 ARM11 characteristics

20 3.4 XScale XScale is a microprocessor core developed by Intel. The initiative first started when Intel took-over Digital Semiconductor in 1998, thus inheriting the StrongARM CPU as well. The StrongARM CPU was developed by Digital Equipment Corporation in collaboration with ARM Limited with the objective to create a high-end processor with much higher performance.

Figure 3-7 Intel XScale Core Architecture Features The XScale fully implements the integer instruction set architecture of the ARMv5TE. Thus, the XScale supports the Thumb ISA as well as DSP-Enhanced operations. This core features a 7-stage integer pipe, in contrast to the 5-stage pipeline featured in the StrongARM processor. The processor has 32KB, 32-way set associative instruction and data caches.

A 128-entry Branch Target Buffer (BTB) is used to predict the outcome of branch type instructions. The buffer provides the storage for the target address of branch type instructions and predicts the next address to present to the instruction cache when the current instruction address is that of a branch.

Similar to the ARM11, a ‘hit-under-miss’ feature allows execution to continue even when a cache miss is being processed. Improving on the StrongARM, a debug unit for use with Multi-ICE is implemented to support and traces.

The XScale core provides a few extension to the existing ARMv5 architecture to support the demands of the increasing demanding embedded systems market.

• A DSP coprocessor (CP0) is added to increase the performance and precision of audio processing algorithms. It contains a 40-bit accumulator and 8 new instructions. • The existing page table descriptors have been added with one more extra bit. The & B bits have been extended with an additional X bit. A P bit is also added in the first level descriptors to allow an ASSP to identify a new memory attribute. • Additional functionality has been added to coprocessor 15. CP15 configures the MMU, caches, buffers and other system attributes. Coprocessor 14 is created. CP14 contains the performance monitor registers and the trace buffer registers. • Other enhancements were also made to the Event Architecture, instruction cache and data cache parity error exceptions, breakpoint events and imprecise external data aborts.

21

Figure 3-8 Intel XScale Pipeline Organization The longer pipeline of the XScale architecture has several disadvantages: • Longer branch prediction penalty. If the prediction is incorrect, a penalty of 4 cycles will be imposed, in contrast with only 1 cycle for the StrongARM. • Large load use delay. When a value is loaded from memory, the next instruction cannot immediately obtain the read value from memory. There will be some ‘bubbles’ which requires an to fill in, so that time is not wasted on stalling the whole pipeline. • Certain instructions incur a few extra cycles of delay on the Intel XScale core as compared to StrongARM processors (LDM, STM) • Decode and lookups are spread out over 2 cycles in the Intel XScale core, instead of 1 cycle in predecessors.

The pipeline above, which is similar to the ARM11 is able to execute memory, MAC and data processing instructions in parallel. Although instructions are issued in-order, the main execution pipeline, memory and MAC pipelines have different execution times. Instructions may finish out of order.

Register is used in the MAC pipeline. A register dependency occurs when a previous MAC or load instruction is about to modify a register value that has not been returned to the register file yet. Only destination of MAC operations and memory loads are scoreboarded.

The table below summarizes the features and comparisons among the ARM implementations.

22 3.5 AMULET Synchronous design has been bog down by many problems. As size of design increases, it gets really hard to synchronize various components on silicon. As a synchronous design depend and operate on an externally supplied clock, it is very important for each component to get the timing right. Delays due to large distances on silicon can impede validity of data through the wires.

The practical problems incurred by clocked designs: • Clock skew occurs when different components receives the clock signals at a different time due to the difference in distance from the clock source. This would lead to circuit malfunction if the clock frequencies are kept increasing. • Higher clock rates lead to excessive power consumption. • Electromagnetic interference is caused by global synchrony of circuits. This would impede the performance and functionality of the circuit which the chip is designed to control.

For the above reasons, asynchronous techniques had to be used to address the above problems. The AMULET processor cores are fully asynchronous implementations of the ARM architecture. They are self-timed and operate without any externally supplied clock. The AMULET was developed at the University of Manchester as a research project for asynchronous design. The benefits of the AMULET are: • Clock skew is non-existence because a global clock is not used. • Transitions only occur in the circuit in response to a request to carry out useful work. The continuous drain by the can be avoided too. Power savings can be achieved. • The circuit within the asynchronous design emits less electromagnetic radiation. This is due to the less coherent internal activity within the chip. • A synchronous chip is design to perform in worst-case scenarios. With an asynchronous design, there is potential to achieve typical performance.

3.5.1 AMULET3

Figure 3-9 AMULET3 AMULET3 is the third generation asynchronous ARM processor from the AMULET family. It is a fully functional microprocessor with support for interrupts and memory faults. The chip supports Arm architecture version 4T, including the 16-bit Thumb instruction set. One of the objectives of the project is to achieve compatibility with the ARM9TDMI but with an asynchronous design.

23 Features include: • 15% fewer cycles/instruction than AMULET2 • Low latency load/store with asynchronous out-of-order completion • Unrestricted register forwarding • Branch target prediction and branch fetch suppression • Very low power "sleep" mode • Dual ("Harvard") bus interface • 0.35µm, 3 layer metal process

Figure 3-10 AMULET3 Pipeline

The 6 pipeline stages of the AMULET3 are as follows: • Prefetch : instruction prefetch unit, a branch target buffer (BTB) is included too • Decode & Register read : instruction decode (ARM and Thumb), register read and forwarding stage • Execute : execute stage, which includes the shifter, multiplier and ALU • Data Interface : data memory interface • Reorder Buffer : the reorder buffer • Register Write : register result write-back stage

All of the above components operate autonomously. The performance of the microarchitecture is provided below:

Process 0.35 m Transistors 113,000 MIPS 120 Metal layers 3 Core area 3 mm2 Power 154 mW Vdd 3.3 V Clock none MIPS/W 780

24 PW = BW = processor observed

4 Conclusion This report has provided an insight to the nature of the ARM processor and how the implementation work specifically for embedded applications. The ARM architecture addresses the issue for embedded systems and is completely at a different level compared to desktop / microprocessors (i.e. , Itanium, SPARC etc).

ARM Limited licenses its cores and architectures to other companies which will either integrate the cores into their own products or manufacture a new chip based on the licensed architecture. ARM does not fabricate its own chips.

The different architectures have been discussed and support for various software issues have also been analysed.

Finally, the various implementation of the architecture was looked at, particularly the ARM7TDMI, ARM9TDMI, ARM11 and the Intel XScale. Comparisons were also made among the architectures. An overview of the research version of the ARM implementation, AMULET was given; providing the motivation why asynchronous design of microprocessors will be the way into the future.

The ARM processor has combined the benefits of RISC architectures while implementing some not so trivial components. The proper combination of architectures and the simplicity of the ARM core have made the ARM cores one of the most used processor cores in the world. The ARM has evolved with the needs to the market, particularly in the media market (with the advent of the ARMv6).

However, there are still more challenges ahead and we will observe what other variants of the ARM will be revealed in the future.

5 References [1] S. Steele, "Accelerating to Meet The Challenging of Embedded Java," ARM Limited, Cambridge, UK 15 November 2001. [2] "Acorn RISC Machine," Wikipedia (http://en.wikipedia.org). [3] S. B. Furber, J. D. Garside, and D. A. Gilbert, "AMULET3: a high-performance self- timed ARM microprocessor," Computer Design: VLSI in Computers and Processors, 1998. ICCD '98. Proceedings., International Conference on, pp. 247-252, 1998. [4] "AMULET 3," Advanced Processor Technology Group (http://www.cs.man.ac.uk/apt/projects/processors/amulet/AMULET3_uP.html). [5] D. Cormie, "The ARM11 Microarchitecture," ARM Limited April 2002. [6] D. Snowdon, "ARM and StrongARM Architecture (Slides)," 2003. [7] D. Brash, "The ARM Architecture Version 6 (ARMv6)," ARM Limited January 2002. [8] "ARM Assembler," HeyRick! (http://www.heyrick.co.yk/assembler). [9] S. Furber, ARM System-on-Chip Architecture, 2 ed: Addison-Wesley, 2000. [10] D. Snowdon, "ARM, StrongARM and XScale," The University of New South Wales 10 June 2003. [11] Intel XScale Core : Developer's Manual: Intel Corporation, 2000.

25