ARM L2 Cache

Purpose Module Introduction • The intent of this module is to provide you with an overview of the i.MX31 CPU complex. Objectives • Describe the ARM1136 core platform. • Identify features of the ARM1136JF-S processor. • Describe the two levels of caches in the CPU complex. • Identify the purpose of the Smart Speed™ switch. Content • 15 pages • 3 question Learning Time • 25 minutes The intent of this module is to provide you with an overview of the CPU complex of the i.MX31 processor. You will learn about the ARM1136JF-STM processor, the cache write strategy, and the Level 2 (L2) cache system. You will also learn about the Smart SpeedTM switch, and the Vector Floating-Point (VFP) co-processor. It should be noted that, unless specifically mentioned, all information in this module applies to both the i.MX31 and the i.MX31L. ARM1136 Core Platform ETB FPU 1136 Core ETM 4 Kbytes 16 Kbytes 16 Kbytes D Cache I Cache 128 Kbytes L2 Cache Cntrl L2 Cache Interrupt Smart Speed TM Switch Multi AHB Crossbar Primary AHB Alternate AHB 1,2,3 Bus Masters Patch Peripheral I/F 1 Mem Ctl Peripheral I/F 2 Let’s start by looking at the Freescale ARM®1136 core platform. The CPU complex of the i.MX31 consists of the ARM1136JF-S processor, an L2 cache system, the Smart Speed switch, and an ARM11™ Vector Interrupt Controller (AVIC). The multilevel cache system consists of a powerful L2 Cache Controller (L2CC) that has been optimized by ARM® to Freescale specifications with 128 Kbytes of unified L2 cache memory and an integrated L2 cache monitor. The L1 cache provides 16 Kbytes for instruction and 16 Kbytes for data. The Smart Speed switch, otherwise known as the 6 × 5 Multi-Layer AHB Crossbar switch (MAX), allows for up to five simultaneous transactions to occur in parallel, giving the performance of up to a 665 MHz bus. The VFP11 Floating Point Unit (FPU) is an ARM-enhanced IEEE 754 numeric co- processor that can be used to support and enhance 3D graphics, gaming, high resolution audio, Java™ and other general-purpose applications. ARM1136 Core Features • High performance core platform • ARM1136 core with: – 8 stage pipeline – 16 Kbyte instruction and 16 Kbyte data caches – 64-bit data paths to memory offers increased bandwidth – Jazelle hardware for Java acceleration • Vector Floating Point Unit (VFP) • Trace module with 4 Kbyte buffer for SW debug • Freescale Smart Speed switch: – 5 simultaneous 32-bit transfers increases performance – Programmable priorities optimize system performance • 128 Kbytes L2 cache for up to 30 percent increased system performance – Freescale was the lead partner with ARM • Enhanced hardware-assisted interrupts for faster response • Flexible power management techniques • Dynamic voltage frequency scaling modes: – High speed: 532 MHz @ 1.45V – Medium speed: 400 or 266 MHz @ 1.1V – Idle speed: 133 MHz @ 1.1V Reference material for previous page ARM1136JF-S Processor Let’s look at the core of the ARM11 platform, which is the ARM1136JF-S processor. In this module, it is referred to as simply the “ARM11.” The ARM11 incorporates an integer unit that implements the ARM V6 architecture. It supports the ARM and Thumb™ instruction sets, Jazelle™ technology to enable direct execution of Java byte codes, and a range of SIMD DSP instructions that operate on 16-bit or 8-bit data values in 32-bit registers. ARM1136JF-S Features • Synthesizable • ARM V6 architecture: – ARM, THUMB, Jazelle – Mixed endian support – Unaligned data support – Physically addressed caches – Media extensions • High performance core: – 8-stage pipeline – Branch prediction – Return stack • VFP co-processor • Fast Interrupt mode • 16 Kbyte I- and D- caches Reference material for previous page ARM1136JF-S: Key Benefits • ARM V6 architecture – ARM and Thumb instruction sets – Jazelle technology enabling direct execution of Java byte codes – a range of SIMD DSP instructions which operate on 16-bit or 8-bit data values in 32-bit registers • Power and area efficient • Synthesizable design • Complete set of supporting system IP • Backwards compatible with previous ARM processors • Provides full virtual memory capabilities • Physical address tagging for caches and Application Space Identifiers (ASIDs) – Reduces overhead on context switches – Reduces cache invalidation and refill – Saves cycles and power Reference material for previous page ARM V6 Benefits Improved: • CPU efficiency and performance • Multimedia performance • Real-time performance • Data sharing with non-ARM execution units • Application portability from non-ARM processors • Unaligned and mixed endian support The ARM1136 is the first processor implementation of the ARM V6 architecture. Let’s look at how this architecture improves some CPU and multimedia functionalities. The ARM V6 architecture improves CPU efficiency and performance and multimedia performance, which includes media processing extensions, two times faster MPEG-4 encode/decode, and faster audio DSP than the ARM926. Another improvement is real-time performance, which includes faster exception and interrupt handling, vectored interrupt support, reduced latency mode, and new stack and mode change instructions that have a three times faster interrupt entry. The ARM V6 architecture improves data sharing with non-ARM execution units, application portability from non-ARM processors, and unaligned and mixed endian support. ARM1136 Processor Block Diagram Debug External Co-processor JTAG ETM VIC Interface VFP Prefetch LSU I Cache Unit ARM 11 Core D Cache Main TLB System Metrics L1 Instruction Side L1 Data Side Cache Controller Cache Controller Instruction Fetch DRead DWrite Peripherals Let’s continue to examine the ARM11 processor by looking at a functional block diagram. The features include an integer unit an eight-stage pipeline, branch prediction with return stack low interrupt latency external co-processor interface and co-processor 14 and 15, instruction and data MMUs (managed using micro TLB structure backed by a unified main TLB), and instruction and data caches (including non-blocking D cache with Hit-Under-Miss). Note that the caches are virtually indexed and physically addressed, and there is a 64- bit interface to both caches. Other features include a write buffer that can be bypassed, a high-speed Advanced Microcontroller Bus Architecture (AMBA) L2 interface supporting prioritizing multiprocessor implementations, an AMBA bus interface (AHB-lite protocol), a Floating Point co-processor, trace support, JTAG-based debug, and a Load Store Unit (LSU). The ARM11 processor features an interrupt service to quickly determine the interrupt source and branch to the interrupt service routine. The ARM11 solution contains an Interrupt vector port, and a Vector Interrupt Controller (VIC). Question The ARM1136 is the first processor implementation of the ARM V6 architecture. What are some of the ARM V6 architecture improvements? Select all that apply and then click Done. CPU efficiency and performance Vectored interrupt support Data sharing with non-ARM execution units 5-stage pipeline Done Consider this question concerning the ARM1136 processor. Correct. The V6 architecture includes CPU efficiency and performance, real-time performance, which includes vectored interrupt support, and data sharing with non- ARM execution units. The ARM11 processor also contains an eight-stage processor. ARM V6 Memory Model Virtual Physical Address Address Address EMI Translation CP15 Configuration/ Control DRAM Instruction ARM Core Prefetch Level 1 Level 2 SRAM Load R15 Caches Cache . Store Flash . R0 ROM SRAM ROM ARM Platform SOC • Level 1 cache memory fully defined in ARM V6 Additional • Hierarchy and memory order support for Level 2 cache Processor(s) Now let’s look at the V6 memory model to explore cache in greater detail. There are two levels of caches in the CPU complex. Level 1 (L1) consists of separate instruction and data caches, a write buffer, two micro TLBs backed by a main TLB, Application Space Identifiers (ASIDs), and memory system attributes. The Level 1 cache memory subsystem is fully defined in ARM V6, and ARM V6 also has hierarchy and memory order support for the Level 2 cache. The Level 2 cache is unified, and will therefore hold both instruction and data elements. The cache is virtually indexed and physically addressed. Line length is fixed at eight words. The ARM1136 cache is four-way set-associative. A particular address may be stored in one of four locations within the cache. To check for a cache hit for a non-sequential access, address comparisons must be performed with four different tag values. To prevent this comparison from reducing the maximum core clock frequency, there is a minimum one cycle latency between the comparison matching and the writing of data to that cache line. This requires a small Write buffer to be implemented in the cache, to allow written words to be held until they can be written. Cache-related Definitions • Line: Smallest loadable unit of a cache that is always a block of contiguous words in memory. • Tag: The portion of a memory address that is stored within the cache to identify the particular physical address located there. • Set: The set of cache lines that can hold data from a particular memory location. • Way: The number of sets in the cache is the number of “ways” in the cache. • Index: The portion of the memory address that determines the set in which the cache line may be stored. Reference material for previous page Cache Write Strategy WriteWrite WriteThrough: Through: Back: If locationlocationIf location isis withiniswithin within thethe the cache,cache, the the only cache cache the is cache updated.is is Cache updated Writeupdated. is also Write sent is to also memorysent to memory via the Write via the BufferWrite Buffer Write Back: WB L2 If location is within the External cache, only the cache is CPU Memory Memory updated. WT System CBAccess Mode Non cacheable, 0 0 WriteWrite non bufferable Buffer 0 1 Non cacheable, Buffer bufferable 1 0 WT, Write Through 1 1 WB, Write Back Now, let’s look at cache write strategies.

Load more