COVER FEATURE Parallelism and the ARM Instruction Set Architecture

Leveraging parallelism on several levels, ARM’s new designs could change how people access technology. With sales growing rapidly and more than 1.5 billion ARM processors already sold each year, software writers now have a huge range of markets in which their ARM code can be used.

John ver the past 15 years, the ARM reduced- a system-on-chip device.1 Although this kept the Goodacre instruction-set (RISC) proces- main design goal focused on performance, devel- Andrew N. sor has evolved to offer a family of chips opers still gave priority to high code density, low Sloss that range up to a full-blown multi- power, and small die size. . Embedded applications’ To achieve this design, the ARM team changed ARM O demand for increasing levels of performance and the RISC rules to include variable-cycle execution the added efficiency of key new technologies have for certain instructions, an inline to driven the ARM architecture’s evolution. preprocess one of the input registers, conditional Throughout this evolutionary path, the ARM execution, a compressed 16-bit Thumb instruction team has used a full range of techniques known to set, and some enhanced DSP instructions. architecture for exploiting parallelism. The performance and efficiency methods that ARM • Variable cycle execution. Because it is a load- uses include variable execution time, subword par- store architecture, the ARM processor must allelism, -like operations, first load data into one of the general-purpose -level parallelism and exception handling, registers before processing it. Given the single- and . cycle constraint the original RISC design The developmental history of the ARM architec- imposed, loading and storing each register indi- ture shows how processors have used different types vidually would be inefficient. Thus, the ARM of parallelism over time. This development has cul- ISA instructions specifically load and store mul- minated in the new ARM11 MPCore multiprocessor. tiple registers. These instructions take variable cycles to execute, depending on the number of RISC FOR EMBEDDED APPLICATIONS registers the processor is transferring. This is Early RISC designs such as MIPS focused purely particularly useful for saving and restoring con- on high performance. Architects achieved this with text for a procedure’s prologue and epilogue. a relatively large register set, a reduced number of This directly improves code density, reduces instruction classes, a load-store architecture, and instruction fetches, and reduces overall power a simple pipeline. All these now fairly common consumption. concepts can be found in many of today’s modern • Inline barrel shifter. To make each data pro- processors. cessing instruction more flexible, either a shift The ARM version of RISC differed in many ways, or rotation can preprocess one of the source partly because the ARM processor became an registers. This gives each data processing embedded processor designed to be located within instruction more flexibility.

42 Computer P u b l i s h e d by th e IE E E Co m p u t e r So i e t y 0018-9162/05/$20.00 © 2005 IEEE Nonsaturated (ISA v4T) Saturated (ISA v5TE) PRECONDITION PRECONDITION r0=0x00000000 r0=0x00000000 r1=0x70000000 r1=0x70000000 r2=0x7fffffff r2=0x7fffffff • Conditional execution. An ARM instruction executes only when it satisfies a particular con- ADDS r0,r1,r2 QADD r0,r1,r2 dition. The condition is placed at the end of the instruction mnemonic and, by default, is set to POSTCONDITION POSTCONDITION always execute. This, for example, generates a result is negative result is positive savings of 12 —42 percent—for the great- r0=0xefffffff r0=0x7fffffff est common divisor algorithm implemented with and without conditional execution. Figure 1. • 16-bit Thumb instruction set. The condensed • Enhanced DSP instructions. Adding these Nonsaturated and 16-bit version of the ARM instruction set instructions to the standard ISA supports flex- saturated addition. allows higher code density at a slight perfor- ible and fast 16 × 16 multiply and arithmetic Saturation is mance cost. Because the Thumb 16-bit ISA is saturation, which lets DSP-specific routines particularly useful designed as a target, it does not migrate to ARM. A single ARM processor for digital signal include the orthogonal register access of the could execute applications such as voice-over- processing because ARM 32-bit ISA. Using the Thumb ISA can IP without the requirement of having a sepa- nonsaturations achieve a significant reduction in program size. rate DSP. The processor can use one example would wrap around In 2003, ARM announced its Thumb-2 tech- of these instructions, SMLAxy, to multiply the when the integer nology, which offers a further extension to top or bottom 16 bits of a 32-bit register. The value overflowed, code density. This technology increases the processor could multiply the top 16 bits of giving a negative code density by mixing both 32- and 16-bit register r1 by the bottom 16 bits of register r2 result. instructions in the same instruction stream. To and add the result to register r3. achieve this, the developers incorporated unaligned address accesses into the processor Figure 1 shows how saturation can affect the design. result of an ADD instruction.2 Saturation is par-

A Short History of ARM

Formed in 1990 as a joint venture with Acorn , Apple processor, which it incorporated into a product called the ARM Computers, and VLSI Technology (which later became Philips Second Processor, which attached to the BBC Microcomputer Semiconductor), ARM started with only 12 employees and through a parallel communication port called “the .” These adopted a unique licensing business model for its processor designs. devices were sold mainly as a research tool. ARM1 lacked the By licensing rather than manufacturing and selling its chip tech- common multiply and divide instructions, which users had to nology, ARM established a new business model that has redefined synthesize using a combination of data processing instructions. the way industry designs, produces, and sells . The condition flags and the program were combined, Figure A shows how the ARM product family has evolved. limiting the effective addressing range to 26 bits. ARM ISA ver- The first ARM-powered products were the sion 4 removed this limitation. and the PDA. The ARM processor was designed originally as a 32-bit Physical IP replacement for the MOS Technologies 6502 proces- Cortex Thumb-2 technology sor that used in a range of desktops NEON MPCore technology OptimoDE data engines designed for the British Broadcasting Corporation. ARM11 family When Acorn set out to develop this new replacement TrustZone SecurCore family technology processor, the academic community had already technology begun considering the RISC architecture. Acorn ARM10 family decided to adopt RISC for the ARM processor. ARM’s ARM9 family developers originally tailored the ARM instruction set architecture to efficiently execute Acorn’s BASIC inter- Thumb technology ARM7 family StrongARM processor preter, which was at the time very popular in the ARM6 core European education market. The ARM1, ARM2, and ARM3 processors were 1990 1996 1998 1999 2000 2003 2004 2005 developed by Acorn Computers. In 1985, Acorn received production quantities of the first ARM1 Figure A. Technology evolution of the ARM product family.

July 2005 43 Figure 2. SIMD versus non-SIMD MPEG4 ACC power consumption. MPEG4 ACC LTP The lightweight ARM ACC Dec implementation of MP3 Dec SIMD reduces gate MJPEG Dec count, hence MJPEG Enc significantly ARM11 versus ARM9E reducing die JPEG Dec (1024x768 10:1) ARM11 versus ARM7 size, power, and JPEG Enc (1024x768 10:1) complexity. H.264 Baseline Dec H.264 Baseline Enc H.263 Baseline Dec H.263 Baseline Enc MPEG4 SP Decode MPEG4 SP Encode 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 Improvement per MHz

ticularly useful for digital signal processing because size, power, and complexity. Further, all the SIMD nonsaturations would wrap around when the inte- instructions execute conditionally. ger value overflowed, giving a negative result. A To improve handling of video compression sys- saturated QADD instruction returns a maximum tems such as MPEG and H.263, ARM also intro- value without wrapping around. duced the sum of absolute differences (SAD) concept, another form of DLP instructions. Motion DATA-LEVEL PARALLELISM estimation compares two blocks of pixels, R(I,j) Following the success of the enhanced DSP and C(I,j), by computing instructions introduced in the v5TE ISA, ARM introduced the ARMv6 ISA in 2001. In addition to SAD = ∑ | R(I,j) – C(I,j) | improving both data- and thread-level parallelism, other goals for this design included enhanced math- Smaller SAD values imply more similar blocks. ematical operations, exception handling, and Because motion estimation performs many SAD endian-ness handling. tests with different relative positions of the R and An important factor influencing the ARMv6 ISA C blocks, video compression systems require very design involved increasing DSP-like functionality fast and energy-efficient implementations of the for overall video handling and 2D and 3D graph- sum-of-absolute-differences operation. ics. The design had to achieve this improved func- The instructions USAD8 and USADA8 can com- tionality while still maintaining very low power pute the absolute difference between 8-bit values. consumption. ARM identified the single-instruc- This is particularly useful for motion-video-com- tion, multiple-data architecture as the means for pression and motion-estimation algorithms. accomplishing this. SIMD is a popular technique for providing data- THREAD-LEVEL PARALLELISM level parallelism without compromising code den- We can view threads as processes, each with its sity and power. A SIMD implementation requires own and register set, while hav- relatively few instructions to perform complex cal- ing the advantage of sharing a common memory culations with minimum memory accesses. space. For thread-level parallelism, ARM needed Due to a careful balancing of computational effi- to improve exception handling to prepare for the ciency and low power, ARM’s SIMD implementa- increased complexity in handling multithreading tion involved splitting the standard 32-bit data path on multiple processors. These requirements added into four 8-bit or two 16-bit slices. This differs from inherent complexity in the handler, sched- many other implementations, which require addi- uler, and context . tional specialized data paths for SIMD operations. One optimization extended the exception han- Figure 2 shows the improvements in MHz that dling instructions to save precious cycles during the various codecs require when using the ARMv6 time-critical context switch. ARM achieved this by SIMD instructions introduced in the ARM11 adding three new instructions to the instruction set, processor. as Table 1 shows. The lightweight ARM implementation of SIMD Programmers can use the change processor state reduces gate count, hence significantly reducing die (CPS) instruction to alter processor state by setting

44 Computer Table 1. Exception handling instructions in the ARMv6 architecture. Instruction Description Action CPS Change processor state CPS ,{,#mode} CPS # CPSID the current program to supervisor CPSIE mode and disabling fast interrupt requests, as the RFE Return from exception RFE Rn! code in Figure 3 shows. Whereas the ARMv4T ISA SRS Save return state SRS,#{!} required four instructions to accomplish this task, the ARMv6 ISA requires only two. Programmers can use the save return state (SRS) instruction to modify the saved program status reg- ister in a specific mode. The updating of SPSR in a ARMv4T ISA ARMv6 ISA particular ARMv4T ISA mode involved many ; Copy CPSR ; Change processor state and modify more instructions than in the ARMv6 ISA. This MRS r3, CPSR ; select bits new instruction is useful for handling context ; Mask mode and FIQ interrupt CPSIE f, #SVC or preparing to return from an exception IC r3, r3, #MASK|FIQ handler. ; Set Abort mode and enable FIQ ORR r3, r3, #SVC|nFIQ MULTIPROCESSOR ATOMIC INSTRUCTIONS ; Update the CPSR Earlier ARM architectures implemented sema- MSR CPSR_c, r3 phores with the swap instruction, which held the external until completion. Obviously, this was Figure 3. ARMv6 ISA unacceptable for thread-level parallelism because ARM11, the logic change processor one processor could hold the entire bus until com- resides between the level 1 and the proces- state instruction pletion, disallowing all other processors. ARMv6 sor core. The reduction in cache flushing has the compared with the introduced two new instructions—load-exclusive additional benefit of decreasing overall power con- ARMv4T LDREX and store-exclusive STREX—which take sumption by reducing the external memory architecture. advantage of an exclusive monitor in memory: accesses that occur in a virtually tagged cache. The physically tagged cache increases overall perfor- • LDREX loads a value from memory and sets mance by about 20 percent. the exclusive monitor to watch that location, and INSTRUCTION-LEVEL PARALLELISM • STREX checks the exclusive monitor and, if In ILP, the processor can execute multiple instruc- no other write has taken place to that location, tions from a single sequence of instructions con- performs the store to memory and returns a currently. This form of parallelism has significant value to indicate if the data was written. value in that it provides additional overall perfor- mance without affecting the software programming Thus, the architecture can implement semaphores model. that do not lock the system bus that grants other Obviously, ILP puts more emphasis on the com- processors or threads access to the memory piler that extracts it from the source code and system. schedules the instructions across the superscalar The ARM11 was the first core. Although potentially simplifying otherwise hardware implementation of the ARMv6 ISA. It overly complex hardware, an excessive drive to has an eight-stage pipeline with separate parallel extract ILP and achieve high performance through pipelines for the load/store and multiply/accumu- increased MHz has increased hardware complex- late operations. With the parallel load/store unit ity and cost. the ARM1136J-S processor can continue execut- ARM has remained the processor at the edge of ing without waiting for slower memory—a main the network for several years. This area has always gating factor for processor performance. seen the most rapid advancements in technology, In addition, the ARM1136J-S processor has with continuous migration away from larger com- physically tagged caches to help with thread-level puter systems and toward smaller ones. For exam- parallelism—as opposed to the virtually tagged ple, technology developed for mainframes a decade caches of previous ARM processors—which con- or two ago is found in desktop computers today. siderably benefits context switches, especially when Likewise, technologies developed for the desktop running large operating systems. five years ago have begun appearing in consumer A virtually tagged cache must be flushed every and network products. For example, symmetric time a context switch takes place because the cache multiprocessing (SMP) is appearing today in both contains old virtual-to-physical translations. In the desktop and embedded computers.

July 2005 45 PERFORMANCE VERSUS POWER operating systems, ARM applied these enhance- Continued demand REQUIREMENTS ments, known as ARMv6K or AOS (for Advanced For several years, the embedded processor OS Support), across all ARMv6-architecture-based for performance at market inherited technology matured in desk- application processors to provide a firm founda- low power has led top computing as consumers demanded sim- tion for . to minimizing the ilar functionality in their embedded devices. The ARM11 multiprocessor also addressed the overall power The continued demand for performance at SMP system design’s two main bottlenecks: low power has, however, driven slightly dif- budget by adding ferent requirements and led to the overall • interprocessor communication with the inte- multiple processors power budget being minimized by adding gration of the new ARM Generic Interrupt and accelerators. multiple processors and accelerators within Controller (GIC), and an embedded design. Today, the demand for • with the integration of the high levels of general-purpose computing dri- Snoop (SCU), an intelligent mem- ves using SMP as the application processor ory-communication system. in both embedded and desktop systems. In 2004, both the embedded and desktop mar- These logic blocks deliver an efficient, hardware- kets hit the cost-performance-through-MHz wall. coherent single-core SMP processor that manufac- In response, developers began embracing poten- turers can build cost-effectively. tial solutions that require SMP processing to avoid the following pitfalls: PREPARATIONS FOR ARM MULTIPROCESSING • High MHz costs energy. Increasing a proces- To fully realize the advantages of a multiproces- sor’s has a quadratic effect on power sor hardware platform in general-purpose com- consumption. Not only does doubling the puting, ARM needed to provide a cache-coherent, MHz double the dynamic power required to symmetric software platform with a rich instruc- switch the logic, it also requires a higher oper- tion set. ARM found that a few key enhancements ating voltage, which increases at the square of to the current ARMv6 architecture could offer the the frequency. Higher frequencies also add to significant performance boost it sought. design complexity, greatly increasing the amount of logic the processor requires. Enhanced atomic instructions • Extracting ILP is complex and costly. Using Researchers can use the ARMv6 load-and-store hardware to extract ILP significantly raises the exclusives to implement both swap-based and com- cost in silicon area and design complexity, fur- pare-and-exchange-based semaphores to control ther increasing power consumption. access to critical data. In the traditional com- • Programming multiple independent proces- puting world of SMP there has, however, been sig- sors is nonportable and inefficient. As devel- nificant software investment in optimizing SMP opers use more processors, often with dif- code using lock-free synchronization. This work ferent architectures, the software complexity has been dominated by the architecture and its escalates, eliminating any portability between atomic instructions that developers can use to com- designs. pare and exchange data. Many favored using the cmpxchg8b instruc- In mid-2004, PC manufacturers and chip makers tion in these lock-free routines because it can made several announcements heralding the end of exchange and compare 8 bytes of data atomically. the MHz race in desktop processors and champi- Typically, this involved 4 bytes for payload and 4 oning the use of multicore SMP processors in the bytes to distinguish between payload versions that server realm, primarily through the introduction of could otherwise have the same value—the so-called hyperthreading in the Intel Pentium processor. At A-B-A problem. that time, ARM announced its ARM11 MPCore The ARM exclusives provide atomicity using the multiprocessor core as a key solution to help data address rather than the data value, so that the address the demand for performance scalability. routines can atomically exchange data without Introduced alongside the ARM11 MPCore, a set experiencing the A-B-A problem. Exploiting this of enhancements to the ARMv6 architecture pro- would, however, require rewriting much of the vides further support for advanced SMP operating existing two-word exclusive code. Consequently, systems. In its move to support richer SMP-capable ARM added instructions for performing load-and-

46 Computer store exclusives using various payload sizes— part of the 2.6 kernel. This Posix including 8 bytes—thus ensuring the direct porta- thread library provides significant perfor- bility of existing multithreaded code. mance improvements over the old Linux Spin-lock causes pthread library. the processor Improved access to localized data to spin around a When an OS encounters the increasing number of Power-conscious spin-locks tight loop while threaded applications typical in SMP platforms, it Another SMP system cost involves the syn- continually must consider the performance overheads of asso- chronization overhead required when ciating thread-specific state with the currently exe- processors must access shared data. At the attempting to cuting thread. This can involve, for example, lowest abstraction level in most SMP syn- acquire a lock. knowing which CPU a thread is executing on, chronization mechanisms, a spin-lock soft- accessing kernel structures specific to a thread, and ware technique uses a value in memory as a enabling thread access to local storage. The AOS lock. If the memory location contains some enhancements add registers that help with these predefined value, the OS considers the shared SMP performance aspects. resource locked, otherwise it considers the resource CPU number. Using the standard ARM system unlocked. Before any software can access the interface, software on a processor can shared resource, it must acquire the lock—and an execute a simple, nonmemory-accessing instruc- atomic operation must acquire it. When the soft- tion to identify the processor on which it executes. ware finishes accessing the resource, it must release Developers use this as an index into kernel struc- the lock. tures. In an SMP OS, processors often must wait while Context registers. SMP operating systems handle another processor holds a lock. The spin-lock two key demands from the kernel when providing received its name because it accomplishes this wait- access to thread-specific data. The ARMv6K archi- ing by causing the processor to spin around a tight tecture extensions define three additional system loop while continually attempting to acquire the coprocessor registers that the OS can manage for lock. A later refinement to reduce bus contention whatever purpose it sees fit. Each register has a dif- added a back-off loop during which the processor ferent access level: does not attempt to access the lock. In either case, in a power-conscious , these • user and privileged read/write accessible; unproductive cycles obviously waste energy. • read-only in user, read/write privileged acces- The AOS extensions include a new instruction sible; and pair that lets a processor sleep while waiting for a • privileged only read/write accessible. lock to be freed and that, as a result, consumes less energy. The ARM11 multiprocessor implements The exact use of these registers is OS-specific. In the these instructions in a way that provides next-cycle and GNU toolchain, the ARM appli- notification to the waiting processor when the lock cation binary interface has assigned these registers is freed, without requiring a back-off loop. This to enable thread local storage. A thread can use TLS results in both energy savings and a more efficient to rapidly access thread-specific memory without spin-lock implementation mechanism. losing any of the general-purpose registers. Figure 4 shows a sample implementation of the To support TLS in C and C++, the new keyword spin lock and unlock code used in the ARM Linux thread has been defined for use in defining and 2.6 kernel. declaring a variable. Although not an official exten- sion of the language, using the keyword has gained Weakly ordered memory consistency support from many compiler writers. Variables The ARMv6 architecture defined various mem- defined and declared this way would automatically ory consistency models for the different definable be allocated locally to each thread: memory regions. In the ARM11 multiprocessor, spin-lock code uses coherently cached memory to __thread int i; store the lock value. As a multiprocessor, the __thread struct state s; ARM11 MPCore is the first ARM processor to extern __thread char *p; fully expose weakly ordered memory to the pro- grammer. The multiprocessor uses three instruc- Supporting TLS is a key requirement for the new tions to control weakly ordered memory’s side Native Posix Thread Library (NPTL) released as effects:

July 2005 47 static inline void _raw_spin_lock(spinlock_t *lock) { unsigned long tmp;

_asm__ __volatile__( 1: ldrex %0, [%1] ; exclusive read lock teq %0, #0 ; check if free wfene ; if not, wait (saves power) strexeq %0, %2, [%1] ; attempt to store to the lock teqeq %0, #0 ; Were we successful ? bne 1b ; no, try again : “=&r” (tmp) : “r” (&lock->lock), “r” (1), “r” (0) : “cc”, “memory” );

rmb(); // Read memory barrier stops speculative reading of payload } // This is NOP on MPCore since dependent reads are sync’ed

static inline void _raw_spin_unlock(spinlock_t *lock) { wmb(); // data write memory barrier, ensure payload write visible // Ensures data ordering, but does not necessarily wait _asm__ __volatile__( str %1, [%0] ; Release spinlock mcr p15, 0, %1, c7, c10, 4 ; DrainStoreBuffer (DSB) sev ; Signal to any CPU waiting : “r” (&lock->lock), “r” (0) : “cc”, “memory”); }

Figure 4. Power- • wmb(). This Linux macro creates a write- tiprocessor, this macro can be defined as empty. conscious spin-lock. memory barrier that the multiprocessor can • DSB (Drain Store Buffer). The ARM architec- This sample imple- use to place a marker in the sequencing of any ture includes a buffer visible only to the proces- mentation shows writes around this barrier instruction. The sor. When running uniprocessor software, the the spin-lock and spin-lock, for example, executes this instruc- processor allows subsequent reads to scan the unlock code used tion prior to unlocking to ensure that any data from this buffer. However, in a multi- in the ARM Linux writes to the payload data complete before the processor, this buffer becomes invisible to reads 2.6 kernel. write to release the spin-lock, and hence before from other processors. The DSB drains this any other processor can acquire the lock. To buffer into the L1 cache. In the ARM11 multi- ensure higher performance, the barrier does processor, which has a coherent L1 cache not necessarily stall the processor by flushing between the processors, the flush only needs to data. Rather it informs the load-store unit and proceed as far as the L1 memory system before lets execution continue in most situations. another processor can read the data. In the • rmb(). Again from the Linux kernel, this macro spin-lock unlock code, the processors issue the places a read-memory barrier that prevents DSB immediately prior to the SEV (Set Event) speculative reads of the payload from occur- instruction so that any processor can read the ring before the read has acquired the lock. correct value for the lock upon awakening. Although legal in the ARMv6 architecture, this level of weakly ordered memory can make it difficult to ensure software correctness. Thus, ARM11 MPCORE MULTIPROCESSOR the ARM11 multiprocessor implements only The ISA’s suitability is not the only factor affect- nonspeculative read-ahead. When the possi- ing the multiprocessor’s ability to actually deliver bility exists that a read will not be required, as the scalability promises of SMP. If they are poorly in the spin-lock case—where there is a branch implemented, two aspects of an SMP design can instruction between the teqeq instruction and significantly limit peak performance and increase any payload read—the read-ahead does not the energy costs associated with providing SMP take place. So, for the ARM11 MPCore mul- services:

48 Computer Figure 5. ARM11 Configurable number of hardware interrupt lines Private FIQ lines MPCore. This multiprocessor integrates the new … ARM GIC inside the core to make the interrupt system’s Interrupt distributor access and effects much closer and Per-CPU aliased more efficient. peripherals Timer CPU Timer CPU Timer CPU Timer CPU interface interface interface interface Configurable Wdog Wdog Wdog Wdog between IRQ IRQ IRQ IRQ 1 and 4 symmetric CPUs

CPU/VFP CPU/VFP CPU/VFP CPU/VFP

LI memory LI memory LI memory LI memory

Snoop control unit (SCU) Private peripheral bus ARM MPCORE

Primary AXI R/W Optional 2nd AXI R/W 64-bit bus 64-bit bus

• Cache coherence. Developers typically provide tected resource. Other SMP OS communica- the single-image SMP OS with coherent caches tion between CPUs is best accomplished with- so it can maintain performance by placing its out accessing memory. Systems frequently data in cached memory. In the ARM11 multi- must also synchronize asynchronously. One processor, each CPU has its own instruction such mechanism uses the device’s interrupt sys- and L1 data cache. Existing coherency schemes tem to cause activity on a remote processor. often extend the system bus with additional sig- These software-initiated interprocessor inter- nals to control and inspect other CPUs’ caches. rupts (IPI) typically use an interrupt system In an embedded system, the system bus often designed to interface from I/O clocks slower than the CPU. Thus, besides plac- peripherals rather than another CPU. ing a bottleneck between the processor and its cache, this scheme significantly increases the Figure 5 shows how the ARM11 MPCore inte- traffic and hence the energy the bus consumes. grates the new ARM GIC inside the core to make The ARM11 MPCore addresses these prob- the interrupt system’s access and effects closer and lems by implementing an intelligent SCU more efficient. ARM designed the GIC to optimize between each processor. Operating at CPU fre- the cost for the key forms of IPI used in an SMP OS. quency, this configuration also provides a very rapid path for data to move directly between Interrupt subsystem each CPU’s cache. A key example of IPI’s use in SMP involves a • Interprocessor communication. An SMP OS multithreaded application that affects some state requires communication between CPUs, which within the processor that is not hardware-coher- sometimes is best accomplished without ent with the other processors on which the appli- accessing memory. Also, the system must often cation has threads running. This can occur regulate interprocessor communication using when, for example, the application allocates some a spin-lock that synchronizes access to a pro- . To maintain consistency, the OS

July 2005 49 also must apply these memory translations to all state. This optimization also causes the processor to other processors. In this example, the OS would transfer the cache line directly to the other processor typically apply the translation to its processor and without intervening external memory operations. then use the low-contention private peripheral bus This ability to move shared data directly between to write to an interrupt control register in the GIC processors provides a key feature that programmers that causes an interrupt to all other processors. The can use to optimize their software. When defining other processors could then use this interrupt’s ID data structures that processors will share, pro- to determine that they need to update their mem- grammers should ensure appropriate alignment and ory translation tables. packing of the structure so that line migration can The GIC also uses various software-defined pat- occur. Also, if the programmers use a queue to dis- terns to route interrupts to specific processors tribute work items across processors, they should through the interrupt distributor. In addition to ensure that the queue is an appropriate length and their dynamic load balancing of applications, SMP width so that when the worker processor picks up OSs often also dynamically balance the interrupt the work item, it will transfer it again through this handler load. The OS can use the per-processor cache-to-cache transfer mechanism. To aid with this aliased control registers in the local private periph- level of optimization, the MPCore includes hard- eral bus to rapidly change the destination CPU for ware instrumentation for many operations within any particular interrupt. both the traditional L1 cache and the SCU. Another popular approach to interrupt distribu- tion sends an interrupt to a defined group of proces- sors. The MPCore views the first processor to he ARMv6K ISA can be considered a key mul- accept the interrupt, typically the least loaded, as tiprocessor-aware instruction set. With its being best positioned to handle the interrupt. This T foundation in low-power design, the archi- flexible approach makes the GIC technology suit- tecture and its implementation in the ARM11 able across the range of ARM processors. This stan- MPCore can bring low power to high-performance dardization, in turn, further simplifies how designs. These new designs show the potential to software interacts with an interrupt controller. truly change how people access technology. With more than 1.5 billion ARM processors being sold Snoop control unit each year, there is a huge range of markets in which The MPCore’s SCU is an intelligent control ARM developers can use their software code. I block used primarily to control data cache coher- ence between each attached processor. To limit the power consumption and performance impact from References snooping into and manipulating each processor’s 1. D. Seal, ARM Architecture Reference Manual, 2nd cache on each memory update, the SCU keeps a ed., Addison-Wesley Professional, 2000. duplicate copy of the physical address tag (pTag) 2. A. Sloss et al., ARM System Developer’s Guide, Mor- for each cache line. Having this data available gan Kauffman, 2004. locally lets the SCU limit cache manipulations to processors that have cache lines in common. John Goodacre is a program manager at ARM with The processor maintains cache coherence with responsibility for multiprocessing. His interests an optimized version of the MESI (modified, exclu- include all aspects of both hardware and software sive, shared, invalid) protocol. With MESI, some in embedded multiprocessor designs. Goodacre common operations, such as A = A + 1, cause many received a BSc in computer science from the Uni- state transitions when performed on shared data. versity of York. Contact him at john.goodacre@ To help improve performance and further reduce arm.com. the power overhead associated with maintaining coherence, the intelligence in the SCU monitors the Andrew N. Sloss is a principle engineer at ARM. system for a migratory line. If one processor has a His research interests include exception handling modified line, and another processor reads then methods, embedded systems, and writes to it, the SCU assumes such a location will architecture. Sloss received a BSc in computer sci- experience this same operation in the future. As this ence from the University of Hertfordshire. He is operation starts again, the SCU will automatically also is a Chartered Engineer and a Fellow of the move the cache line directly to an invalid state rather British Computer Society. Contact him at andrew. than expending energy moving it first into the shared [email protected].

50 Computer