<<

Purpose Module Introduction • The intent of this module is to provide you with an overview of the i.MX31 CPU complex.

Objectives • Describe the ARM1136 core platform. • Identify features of the ARM1136JF-S . • Describe the two levels of caches in the CPU complex. • Identify the purpose of the Smart Speed™ switch.

Content • 15 pages • 3 question

Learning Time • 25 minutes

The intent of this module is to provide you with an overview of the CPU complex of the i.MX31 processor. You will learn about the ARM1136JF-STM processor, the write strategy, and the Level 2 (L2) cache system. You will also learn about the Smart SpeedTM switch, and the Vector Floating-Point (VFP) co-processor. It should be noted that, unless specifically mentioned, all information in this module applies to both the i.MX31 and the i.MX31L. ARM1136 Core Platform

ETB FPU 1136 Core ETM 4 Kbytes

16 Kbytes 16 Kbytes D Cache I Cache

128 Kbytes L2 Cache Cntrl L2 Cache Interrupt

Smart Speed TM Switch Multi AHB Crossbar

Primary AHB Alternate AHB 1,2,3 Bus Masters Patch Peripheral I/F 1

Mem Ctl Peripheral I/F 2

Let’s start by looking at the Freescale ARM®1136 core platform. The CPU complex of the i.MX31 consists of the ARM1136JF-S processor, an L2 cache system, the Smart Speed switch, and an ARM11™ Vector Interrupt Controller (AVIC).

The multilevel cache system consists of a powerful L2 Cache Controller (L2CC) that has been optimized by ARM® to Freescale specifications with 128 Kbytes of unified L2 cache memory and an integrated L2 cache monitor. The L1 cache provides 16 Kbytes for instruction and 16 Kbytes for data.

The Smart Speed switch, otherwise known as the 6 × 5 Multi-Layer AHB Crossbar switch (MAX), allows for up to five simultaneous transactions to occur in parallel, giving the performance of up to a 665 MHz bus.

The VFP11 Floating Point Unit (FPU) is an ARM-enhanced IEEE 754 numeric co- processor that can be used to support and enhance 3D graphics, gaming, high resolution audio, Java™ and other general-purpose applications. ARM1136 Core Features • High performance core platform • ARM1136 core with: – 8 stage pipeline – 16 Kbyte instruction and 16 Kbyte data caches – 64-bit data paths to memory offers increased bandwidth – Jazelle hardware for Java acceleration • Vector Floating Point Unit (VFP) • Trace module with 4 Kbyte buffer for SW debug • Freescale Smart Speed switch: – 5 simultaneous 32-bit transfers increases performance – Programmable priorities optimize system performance • 128 Kbytes L2 cache for up to 30 percent increased system performance – Freescale was the lead partner with ARM • Enhanced hardware-assisted interrupts for faster response • Flexible techniques • Dynamic voltage frequency scaling modes: – High speed: 532 MHz @ 1.45V – Medium speed: 400 or 266 MHz @ 1.1V – Idle speed: 133 MHz @ 1.1V

Reference material for previous page ARM1136JF-S Processor

Let’s look at the core of the ARM11 platform, which is the ARM1136JF-S processor. In this module, it is referred to as simply the “ARM11.”

The ARM11 incorporates an integer unit that implements the ARM V6 architecture. It supports the ARM and Thumb™ instruction sets, Jazelle™ technology to enable direct execution of Java byte codes, and a range of SIMD DSP instructions that operate on 16-bit or 8-bit data values in 32-bit registers. ARM1136JF-S Features

• Synthesizable • ARM V6 architecture: – ARM, THUMB, Jazelle – Mixed endian support – Unaligned data support – Physically addressed caches – Media extensions • High performance core: – 8-stage pipeline – Branch prediction – Return stack • VFP co-processor • Fast Interrupt mode • 16 Kbyte I- and D- caches

Reference material for previous page ARM1136JF-S: Key Benefits • ARM V6 architecture – ARM and Thumb instruction sets – Jazelle technology enabling direct execution of Java byte codes – a range of SIMD DSP instructions which operate on 16-bit or 8-bit data values in 32-bit registers • Power and area efficient • Synthesizable design • Complete set of supporting system IP • Backwards compatible with previous ARM processors • Provides full capabilities • tagging for caches and Application Space Identifiers (ASIDs) – Reduces overhead on context switches – Reduces cache invalidation and refill – Saves cycles and power

Reference material for previous page ARM V6 Benefits

Improved: • CPU efficiency and performance

• Multimedia performance

• Real-time performance

• Data sharing with non-ARM execution units

• Application portability from non-ARM processors

• Unaligned and mixed endian support

The ARM1136 is the first processor implementation of the ARM V6 architecture. Let’s look at how this architecture improves some CPU and multimedia functionalities.

The ARM V6 architecture improves CPU efficiency and performance and multimedia performance, which includes media processing extensions, two times faster MPEG-4 encode/decode, and faster audio DSP than the ARM926. Another improvement is real-time performance, which includes faster exception and interrupt handling, vectored interrupt support, reduced mode, and new stack and mode change instructions that have a three times faster interrupt entry. The ARM V6 architecture improves data sharing with non-ARM execution units, application portability from non-ARM processors, and unaligned and mixed endian support. ARM1136 Processor Block Diagram

Debug External Co-processor JTAG ETM VIC Interface

VFP

Prefetch LSU I Cache Unit ARM 11 Core D Cache

Main TLB System Metrics

L1 Instruction Side L1 Data Side Cache Controller Cache Controller

Instruction Fetch DRead DWrite Peripherals

Let’s continue to examine the ARM11 processor by looking at a functional block diagram. The features include an integer unit an eight-stage pipeline, branch prediction with return stack low interrupt latency external co-processor interface and co-processor 14 and 15, instruction and data MMUs (managed using micro TLB structure backed by a unified main TLB), and instruction and data caches (including non-blocking D cache with Hit-Under-Miss). Note that the caches are virtually indexed and physically addressed, and there is a 64- bit interface to both caches.

Other features include a that can be bypassed, a high-speed Advanced Bus Architecture (AMBA) L2 interface supporting prioritizing multiprocessor implementations, an AMBA bus interface (AHB-lite protocol), a Floating Point co-processor, trace support, JTAG-based debug, and a Load Store Unit (LSU).

The ARM11 processor features an interrupt service to quickly determine the interrupt source and branch to the interrupt service routine. The ARM11 solution contains an Interrupt vector port, and a Vector Interrupt Controller (VIC). Question

The ARM1136 is the first processor implementation of the ARM V6 architecture. What are some of the ARM V6 architecture improvements? Select all that apply and then click Done.

CPU efficiency and performance

Vectored interrupt support

Data sharing with non-ARM execution units

5-stage pipeline

Done

Consider this question concerning the ARM1136 processor.

Correct.

The V6 architecture includes CPU efficiency and performance, real-time performance, which includes vectored interrupt support, and data sharing with non- ARM execution units. The ARM11 processor also contains an eight-stage processor. ARM V6 Memory Model

Virtual Physical Address Address Address EMI Translation

CP15 Configuration/ Control DRAM Instruction ARM Core Prefetch Level 1 Level 2 SRAM Load R15 Caches Cache . Store Flash . . R0 ROM

SRAM ROM ARM Platform SOC

• Level 1 cache memory fully defined in ARM V6 Additional • Hierarchy and memory order support for Level 2 cache Processor(s)

Now let’s look at the V6 memory model to explore cache in greater detail. There are two levels of caches in the CPU complex. Level 1 (L1) consists of separate instruction and data caches, a write buffer, two micro TLBs backed by a main TLB, Application Space Identifiers (ASIDs), and memory system attributes. The Level 1 cache memory subsystem is fully defined in ARM V6, and ARM V6 also has hierarchy and memory order support for the Level 2 cache. The Level 2 cache is unified, and will therefore hold both instruction and data elements.

The cache is virtually indexed and physically addressed. Line length is fixed at eight words. The ARM1136 cache is four-way set-associative. A particular address may be stored in one of four locations within the cache. To check for a cache hit for a non-sequential access, address comparisons must be performed with four different tag values. To prevent this comparison from reducing the maximum core clock frequency, there is a minimum one cycle latency between the comparison matching and the writing of data to that cache line. This requires a small Write buffer to be implemented in the cache, to allow written words to be held until they can be written. Cache-related Definitions

• Line: Smallest loadable unit of a cache that is always a block of contiguous words in memory.

• Tag: The portion of a that is stored within the cache to identify the particular physical address located there.

• Set: The set of cache lines that can hold data from a particular memory location.

• Way: The number of sets in the cache is the number of “ways” in the cache.

• Index: The portion of the memory address that determines the set in which the cache line may be stored.

Reference material for previous Write Strategy

WriteWrite WriteThrough: Through: Back: If locationlocationIf location isis withiniswithin within thethe the cache,cache, the the only cache cache the is cache updated.is is Cache updated Writeupdated. is also Write sent is to also memorysent to memory via the Write via the BufferWrite Buffer Write Back:

WB L2 If location is within the External cache, only the cache is CPU Memory Memory updated. WT System CBAccess Mode

Non cacheable, 0 0 WriteWrite non bufferable Buffer 0 1 Non cacheable, Buffer bufferable 1 0 WT, Write Through 1 1 WB, Write Back

Now, let’s look at cache write strategies. The write buffer is used to decouple memory writes. Data is placed in the buffer at core speed and is written to memory at bus speed in parallel. A FIFO holds a set of addresses and a set of data words and size information. A sequence of data words in the write buffer require only the first address. The address of a new access may be compared against write buffer addresses. A separate FIFO is maintained for cache Write Back operations. This avoids complications associated with performing an external write while handling a write-through store operation.

With write-through, if the location is in the cache, the memory update is stored in the cache and in the write buffer, which performs the write so that the main processor does not have to slow down to main memory speed.

With Write Back, if location is in the cache, only the cache is updated and the “dirty” bit is set to show that the cache line must be written back to main memory before the line is reused.

Please note that if the data location is not contained within the cache, the data will be written directly to memory. The write buffer will be used if the region is bufferable or cacheable. ARM L2 Cache

• Improves the performance of computer systems when significant memory traffic is generated by the CPU

• Fastest memory access is via the L1 cache, followed closely by the L210; access is significantly slower to the main memory (L3)

• Is 128 Kbytes on the ARM1136 core platform

• Has a fixed line length of 32 bytes, 8 words

• Supports lockdown format C

• Has eight-way associativity, which can be directly mapped

Now, moving on to ARM L2 cache, this cache improves the performance of computer systems when significant memory traffic is generated by the CPU.

Memory access is fastest to the L1 cache, followed closely by the ARM L210™. Memory access is significantly slower to the main memory (L3).

The L2 cache on the ARM1136 core platform is 128 Kbytes.

The L2 cache has a fixed line length of 32 bytes, or 8 words.

The L2 cache supports lockdown format C with separate way locking mechanisms for data and instructions.

The L2 cache has eight-way associativity, which can be directly mapped, depending on the use of lockdown registers. ARM L2 Cache • Data RAM is byte-writeable.

• L2 cache has support for: – Write Through, read allocate – Write Back, read allocate – Write Back, read and write allocate

• Write allocate override option allows for allocation on write misses in the ARM L210.

• L2 cache performs critical word first refilling, with the option of refilling starting with word 0.

• A pseudo-random victim selection policy can be made deterministic with use of lockdown registers.

• L2 chache has increased performance by 25 to 75 percent, extended battery life, and reduced memory cost.

Continuing with the features of the L2 cache, data RAM is byte-writeable.

The L2 cache has support for the following cache modes: Write Through, read allocate; Write Back, read allocate; and Write Back, read and write allocate.

The write allocate override option allows for always having allocation on write misses in the ARM L210.

The L2 cache performs critical word first refilling, with the option of refilling starting with word 0.

The L2 cache has a pseudo-random victim selection policy, which can be made deterministic with the use of lockdown registers.

The ARM L210 L2CC and the accompanying 128 Kbytes of memory, combined with the ARM1136JF-S processor, can increase performance by 25 to 75 percent and extend battery life while reducing memory cost. By bringing more data on-chip and closer to the CPU, the ARM L210 L2CC helps remove the performance-limiting bandwidth constraints associated with off-chip memory. VFP Co-processor

High-performance, short-vector operations • Registers can be addressed as short vectors Long pipeline for floating point MAC operation • Decode- Issue- Execute (E1)- E2- E3- E4- E5- E6- E7- E8- Write Back Separate divide/square root pipeline • Supports load/store, and arithmetic operation in parallel with divide/square root operation • Reduces latency impact of these operations Separate load/store pipeline • Load/store operations done in parallel with data processing operations Single cycle execution • Loads are bandwidth balanced to sustain FMAC operations Calculation functions supported in hardware

Multiply, add, multiply-add, subtract, multiply-subtract, negate, negate multiply, negate multiply add, negate multiply-subtract, absolute value, compare, convert, divide and square root, conversions

Now let’s move on to the VFP co-processor. The VFP co-processor is an ARM-enhanced IEEE 754 numeric co-processor that supports and enhances 3D graphics, gaming, high resolution audio, Java and other general-purpose applications.

The VFP co-processor supports high-performance, short-vector operations in registers that can be addressed as short vectors. The VFP co-processor also features a long pipeline for floating-point MAC operations such as decode, issue, execute E1 through to E8, and Write Back. Also featured is a separate divide and square root pipeline that supports load, store, and arithmetic operations in parallel with a divide, square root operation. The VFP reduces the latency impact of these operations.

The VFP includes a separate load, store pipeline feature that enables load and store operations to be done in parallel with data processing operations.

For VFP instruction throughput, most single precision data processing operations and double precision data operations have single-cycle execution. Loads are bandwidth balanced to sustain FMAC operations. Two single precision values and one double precision value can be transferred each cycle.

Many calculation functions are supported in hardware, including multiplication, absolute value, and square root. Click this box to see a complete list of calculation functions. Smart Speed Switch

II22CC xx 33 WatchdogWatchdog

IOMUXCIOMUXC RTCRTC ARM1136 Core Complex CSPICSPI RNGARNGA AIPIAIPI #2#2 OneOne WireWire CSPICSPI AIPIAIPI #1#1 GPIO x 3 AudioMuxAudioMux GPIO x 3 ARM1136JF PWM SSISSI ARM1136JF PWM SSISSI ROMC / GPTGPT KeypadKeypad AVICAVIC L2L2 ROMC / SIMSIM 32K32K ROMROM CacheCache EPITEPIT xx 22 UARTUART xx 44 SD/MMCSD/MMC x2x2 0 FIRIFIRI ATAATA 0 RAMCRAMC // IIM IIM 0 16K16K RAMRAM CCM/CGM/PLLCCM/CGM/PLL ECTECT 1 UART UART 1 SJCSJC SCCSCC 2 CSPICSPI 3 2 MPEG4MPEG4 EncEnc RTICRTIC 3 MemMem StickStick xx 22 4 3 IPUIPU eDMAeDMA 5 4 32 0 0 EMIEMI 1 USBUSB 1 64 2 OTG / RAMC 2 OTG / Smart Speed Switch RAMC 1 1 GPUGPU 33 HostsHosts 64 64

The purpose of the Smart Speed switch is to concurrently support up to five simultaneous connections between master devices 0 to 5 and slave devices 0 to 4. It supports 32-bit address bus width and 32-bit data bus width at all master and slave ports. The ARM11 platform implements a six master by five slave configuration. The Smart Speed switch supports two arbitration schemes that are independently programmable for each slave device: the simple fixed-priority algorithm and simple round-robin fairness algorithm.

The Smart Speed switch allows for concurrent transactions to occur from any master device to any slave device. It is possible for five master devices and all slave devices to be in use at the same time due to independent requests. The Smart Speed switch can gain control of the slave devices and prevent any masters from making any accesses to the slave devices. This is useful if the user wishes to turn off all the clocks and ensure that no bus activity will be interrupted. The Smart Speed switch can put each slave port in low power park mode so the slave will not dissipate any power when not being accessed by a master port. Question

Which cache write strategy is illustrated in this graphic? Select the response that applies and click Done.

a. Write Back Cache b. Rewrite

ExternalL2 c. Write Through CPU Memory MemorySystem

d. Read/Write WriteWrite BufferBuffer

Done

Let’s see if you can remember the cache write strategies.

Correct.

Cache write strategies consist of Write Through and Write Back. The cache write strategy shown here is Write Through. For Write Through, if the location is within the cache, the cache is updated and write is also sent to memory via the Write Buffer. For Write Back, if the location is within the cache, only the cache is updated. Question

Which of the following statements about the CPU complex of the i.MX31 are correct? Select all that apply and then click Done.

The Smart Speed switch concurrently supports up to 5 simultaneous connections between master devices and slave devices.

The L2CC and the accompanying 128 Kbytes of memory, combined with the ARM1136JF-S processor, do not increase performance.

The VFP co-processor supports and enhances 3D graphics, gaming, high resolution audio, Java and other general-purpose applications.

The L1 cache improves the performance of computer systems when significant memory traffic is generated by the CPU.

Done

Please select all the statements that accurately describe aspects of the I.MX31 CPU complex.

Correct.

The purpose of the Smart Speed switch is to concurrently support up to 5 simultaneous connections between master devices and slave devices. The ARM L210 L2CC and the accompanying 128 Kbytes of memory, combined with the ARM1136JF-S processor, can increase performance by 25 to 75 percent and extend battery life. The VFP co-processor supports and enhances 3D graphics, gaming, high resolution audio, Java and other general-purpose applications. The L2 cache improves the performance of computer systems when significant memory traffic is generated by the CPU. Module Summary

• ARM1136 core platform

• ARM1136JF-S processor

•ARM V6 architecture

• Caches in the CPU complex

• Cache write strategies

• ARM Level 2 cache

• VFP Co-processor

• Smart Speed switch

In this module, you learned about the various components of the i.MX31 CPU complex. First, you learned about the ARM1136 core platform, ARM1136JF-S processor, and the benefits of ARM V6 architecture, which include increased CPU efficiency and performance and multimedia performance. Next, you learned about the two levels of caches in the CPU complex: L1 and L2. Specifically, you learned about cache write strategies and the ARM Level 2 cache. Finally, you learned about the VFP co-processor, which supports and enhances 3D graphics, gaming, high resolution audio, Java and other general- purpose applications, and the Smart Speed switch, which can concurrently support up to five simultaneous connections between master devices and slave devices.