PowerPC 400 Series Caches: Applications Programming and Coherency IBM Microelectronics Research Triangle Park, NC Issues ppcsupp@us..com Version: 1.0 January 22, 1999

Abstract – The PowerPC™ instruction set provides opcodes that allow programmers to explicitly move items in or out of the processor’s instruction and data caches. This application note examines these cache control instructions in terms of initializing the caches at power-on and for enforcing coherency in systems with devices capable of directly reading and writing cacheable system memory.

High performance processors, such as the IBM family of Embedded PowerPC Processors, require access to instructions and data at the clock rate of the processor. In most circumstances, external memory cannot provide this level of data throughput and on-chip memory cannot fulfill a system’s memory requirements. An effective solution is to exploit the locality of instruction and data accesses present in most programs and retain frequently accessed items in an on-chip cache memory.

Typically, caches greatly increase performance while minimally affecting system and program design. However, there are methods of accessing memory that can cause on-chip caches and external memory to become incoherent, resulting in data errors and possible system failure. Key to preventing this type of problem is an understanding of what may cause these problems and how to write software to avoid them.

Cache Structure

The PPC40x series of Embedded Processors and Cores employ separate instruction and data caches. Separate caches provide a performance advantage by allowing simultaneous instruction and data accesses. Equivalent performance is possible through a dual-ported unified cache with the same total number of storage bits, but the unified cache has the disadvantage of occupying a larger amount of silicon.

Caches attempt to retain frequently used instruction and data items so that they may be accessed quickly. Performance usually improves greatly because most software exhibits locality in terms of code and data references. Therefore, once an item has been transferred from external memory and placed in the cache it is immediately available for subsequent accesses. When the processor requests either an instruction or piece of data and the item is present in the cache a cache hit occurs. The structure and size of the cache along with a program’s flow and its pattern of memory accesses affects the cache hit ratio for a particular application. The hit ratio is a metric that quantifies the fraction of processor accesses satisfied by the cache compared to the total number of accesses. Usually the hit ratio is determined individually for the instruction and data caches.

1 Cache Mapping

Conceptually, a cache is an on-chip memory array tightly coupled to the processor core that retains the contents of the most recently accessed memory locations. For performance and design reasons, most caches operate on data blocks rather than individual memory elements. In the case of the PPC401GF and PPC403GA/GB/GC/GCX embedded processors and the PPC401 core the block size is four 32-bit words, while for the PPC405 core it is eight 32-bit words. Therefore, when the cache cannot satisfy a data or instruction request, the block or line containing the requested address is loaded into the cache.

The physical implementation of a cache often places a limitation on how many lines with partially identical addresses may coexist in the cache. As illustrated below, an address may be divided into several bit fields: tag, index and offset.

32-bit Address Tag Index Offset

The offset selects an individual data item within a given cache line. Bits to the left of the offset comprise the tag and index. For performance and design reasons a cache is often implemented as a collection of sets. Shown below is the 2-way set-associative structure used in the PPC40x embedded processors and cores.

Set A Set B Index Tag Line Data Tag Line Data

0 1 2 3

For each index, or congruence class, at most two data blocks may coexist in the cache. Other possible structures include direct mapped, n-way set-associative and fully-associative. A direct mapped cache has only a single set thus permitting only a single data block for each index. An n-way cache allows n lines with common indices, while a data block can be placed anywhere within a fully-associative cache. Higher levels of associativity generally yield better software performance, but may raise other concerns relating to device cost and execution rate.

The key point to understand is that the address of an instruction or data item, combined with the associativity of the cache determines where it can be placed in the cache. With an n-way cache, attempting to access greater than n items with common indices causes bus operations to read and/or write cache data. If possible, this should be avoided to provide the highest level of system performance.

Cache Control Instructions

The PowerPC Architecture and PowerPC Embedded Environment provide a number of instructions for managing both instruction and data caches and potentially improving code performance. Most applications only use these cache control instructions during power-on initialization and when necessary to flush data cache contents to system memory. However, in time-critical code segments these instructions can often improve throughput via preloading of required cache contents and by reducing unnecessary transfers between external memory and the data cache.

2 Following are brief descriptions and the typical usage of common PowerPC cache control instructions as implemented by the PPC40x processors and cores. Since these instructions may vary by processor, more complete descriptions, syntax, architectural notes and exceptions for these and additional instructions are included in the User’s Manual for the appropriate processor or core.

dcbi Data Cache Block Invalidate: If the data block corresponding to the effective address is in the data cache, the block is marked as invalid. If modified data existed in the data block, that data is lost.

icbi Instruction Cache Block Invalidate: If the instruction block corresponding to the effective address is in the instruction cache, the block is marked as invalid.

dccci Data Cache Congruence Class Invalidate: Both cache lines associated with the congruence class specified by the effective address are invalidated. If modified data existed in the cache congruence class prior to the operation of this instruction, that data is lost.

iccci Instruction Cache Congruence Class Invalidate: Invalidates both instruction cache lines associated with the congruence class specified by the effective address.

The above instructions remove a specified cache line or congruence class from the instruction or data cache. The congruence class invalidate instructions are useful when the address of the line destined for removal from the cache is unknown. Usually dccci and iccci are only used to clear the caches. Because of this, the PPC405 core implementation of the iccci instruction ignores the parameter and invalidates the entire instruction cache.

dcbst Data Cache Block Store: If the data block at the effective address is in the data cache and marked as modified, the data block is written back to main storage and marked as unmodified in the cache.

dcbf Data Cache Block Flush: If the data block corresponding to the effective address is in the data cache and marked as modified, it is written back to main storage and then marked as invalid in the cache.

Executing dcbst or dcbf writes the cache line at the specified address to system memory if it contains modified data. Additionally, dcbf then invalidates the line in the data cache. These instructions allow an application to explicitly update system memory with the cache contents. This operation is usually necessary to enforce coherency when the processor and an external device access a common memory region.

dcbz Data Cache Block Set to Zero: If the data block at the effective address is in the cache, the data in the block is set to zero. Otherwise, dcbz establishes a cache block at the effective address and sets it to zero. dcbz provides a means of establishing a line in the data cache at a given address without reading system memory. Doing so greatly improves the performance of algorithms that completely overwrite a target data area with new values.

dcbt Data Cache Block Touch: If the data block corresponding to the effective address is not in the data cache and the effective address is marked as cacheable, the block is read from main storage into the data cache.

icbt Instruction Cache Block Touch: If the instruction block containing the effective address is not in the instruction cache and the effective address is marked as cacheable, the block is read from main storage into the instruction cache.

As implemented in the PPC40x products, an instruction touch or data touch operation causes the associated cache to read the specified cache block from system memory into the cache. By explicitly loading a cache block prior to the use of its contents, software may reduce the number of pipeline stalls incurred.

3 Cache Initialization

Operating as local memory, caches provide the processor with access to recently referenced instructions and data. A reset or power-on sequence disables the instruction and data caches and therefore they contain random data, tags and line status bits. Prior to enabling either of the caches they must be explicitly invalidated by software. A generic PowerPC assembly routine to invalidate the data cache is:

lwz r0, # number of congruence classes in cache mtctr r0 # place in counter li r1,0 # clear r1 loop: dccci 0,r1 # invalidate class containing (r1) addi r1,r1, # point to next class bdnz loop # repeat until done

Clearing the instruction cache is identical except that the dccci instruction is replaced with iccci and the number of congruence classes may be different. Because the PPC405 implementation of iccci invalidates the entire instruction cache, it is not necessary to step through each of the congruence classes. After invalidating the caches, caching may be enabled for the desired memory regions. This is accomplished through either the DCCR and ICCR real mode registers or the (MMU) on those parts with an MMU. The DCCR and ICCR registers are 32-bit registers where each bit controls the cacheability of a 128 MB memory region, whereas the MMU supports much finer granularity through variable page sizes. For additional details, please refer to the appropriate Embedded Controller or Core User’s Manual.

After invalidating and enabling the caches many systems operate without requiring additional cache control instructions. However, several circumstances may necessitate the use of cache control operations. These include self-modifying code, relocating object code from one memory area to another and enforcing coherency between the processor’s caches and other bus masters.

Coherency

A memory system is coherent when the value read from a memory address is always the value last written to that address. Systems that perform all reads and writes directly to a common memory and those with caches that utilize specialized hardware techniques to enforce coherency are always coherent. Since the PPC40x processors and cores feature caches without dedicated hardware for enforcing coherency, situations may occur where software must explicitly accomplish this task. Prior to examining techniques for enforcing coherency, several situations that can cause a loss of coherency will be examined.

How is Coherency Lost?

When an area of memory exists in a cache, any access to that region by another device or cache has the potential to result in a loss of coherency. One of the most common examples is an external controller, such as a DMA channel, directly accessing a cacheable area of system memory. Figure 1 depicts such a system. Because the external controller directly reads and writes system memory it neither obtains modified nor updates data present in the cache.

Directly reading locations in system memory also present in a cache frequently yields incorrect data. This occurs because modified cache lines in a write-back cache do not update external memory until normal cache activity or an explicit cache control instruction displaces them from the cache. In contrast, when the processor alters data in a cache that supports write-through operations that change is also written to external memory at the same time. While write-through operations are supported via an exception handler on the PPC403GC/GCX and in hardware on the PPC401GF processor and the PPC401 and

4 PowerPC Execution Unit

Instruction Data Cache Cache

External Master Memory Interface (DMA Controller)

System Memory

Figure 1. Example of a system with an external bus master.

PPC405 cores, they do not solve all aspects of the coherency problem. In particular, when an external device changes a memory location also present in the cache, the cache is not updated. Additionally, write-through operations may cause significantly higher amounts of traffic to the memory system, thus decreasing overall system performance.

External Device Accessing a Cached Memory Area

The following example illustrates the relationship between system memory and a write-back data cache. To begin, consider system memory and a direct mapped data cache in the following state:

System Memory Cache

Address Data Address V D Data

00001000 A9 2A 3A EB XXXXXXXX N N XX XX XX XX 00001010 0C 93 EE A1 XXXXXXXX N N XX XX XX XX 00001020 EF 39 EB A6 XXXXXXXX N N XX XX XX XX 00001030 3D 5F 8F 34 XXXXXXXX N N XX XX XX XX

In the cache table, the V column indicates if the associated cache line contains valid data and the D column determines if the processor modified some portion of the line, thereby making it dirty. Now, assume caching is enabled for addresses 0x1000-0x103F and the following program executes:

li r1,0x1004-4 # start at address 0x1004 li r2,12 # fill 12 words mtctr r2 # place in counter li r3,0 # start at zero loop: stwu r3,4(r1) # r1=r1+4, write r3 to address r1 addi r3,r3,1 # r3=r3+1 bdnz loop # repeat until done

5 As shown below, cache addresses 0x1004 to 0x1033 now contain the values 0 through 0x0B, respectively. More importantly though, system memory does not reflect these changes. Therefore, read and write operations to these addresses not fulfilled by the data cache result in a loss of coherency.

Address Data Address V D Data

00001000 A9 2A 3A EB 00001000 Y Y A9 00 O1 02 00001010 0C 93 EE A1 00001010 Y Y 03 04 05 06 00001020 EF 39 EB A6 00001020 Y Y 07 08 09 0A 00001030 3D 5F 8F 34 00001030 Y Y 0B 5F 8F 34

In this example, the modified address range in system memory contains all old data. Unfortunately, loss of coherency usually results in a shared memory region with a mixture of correct and incorrect data. Any or all of the modified lines in the cache can be written back to system memory before, in between, or after another device accesses those addresses. The result is that each cache block of data in system memory can become a combination of correct and incorrect data. To illustrate this, let normal cache line re-use activity displace the first two lines from the cache and write them back to system memory:

Address Data Address V D Data

00001000 A9 00 01 02 ? Y ? ? ? ? ? 00001010 03 04 05 06 ? Y ? ? ? ? ? 00001020 EF 39 EB A6 00001020 Y Y 07 08 09 0A 00001030 3D 5F 8F 34 00001030 Y Y 0B 5F 8F 34

Now, consider the effect of the cache reloading one of these lines during the process of an external device writing data into system memory locations 0x100C-0x1027:

Address Data Address V D Data

00001000 A9 00 01 FF ? Y ? ? ? ? ? 00001010 FE FD FC FB 00001010 Y N FE FD 05 06 00001020 FA F9 EB A6 00001020 Y N 07 08 09 0A 00001030 3D 5F 8F 34 00001030 Y Y 0B 5F 8F 34

In this state, neither the cache or system memory contains the expected data. Rather, it is a combination of old and new values likely to confound the programmer responsible for explaining why the system is failing. Fortunately, several methods exist to ensure that system memory and the caches remain coherent.

An application with devices capable of directly accessing system memory or with multiple caches may require specific programming to enforce coherency. As illustrated in the previous section, incoherencies occur when an access to a cached memory location is fulfilled by any data source other than the cache. The solution is to guarantee that memory locations accessed by an external device are not present in any cache. This is most easily accomplished by configuring the processor to not cache shared areas. Although simple to implement, non-cached operations degrade processor throughput and should be avoided as much as possible.

6 Dual-Mapping

As discussed previously, the PowerPC 40X products provide the DCCR and ICCR real mode cache control registers and in some cases an MMU to control cacheability. In addition, the PPC403 standard products and some implementations based on the PPC401 and PPC405 cores feature memory controllers that dual-map two address ranges to the same physical memory. By configuring the processor such that one of these regions is cacheable and the other is not, a program can access a data item either cached or non-cached. To illustrate, assume the memory interface ignores the uppermost address bit, A0, and the processor is configured such that only addresses with A0=0 are cached. Then, reading address 0x00000000 or 0x80000000 references the same memory location. The difference is that the operation to 0x00000000 is fulfilled via the cache while the read from 0x80000000 always returns the data stored in physical memory. As previously illustrated, accessing the same memory location cached and non-cached can cause a loss of coherency. Because of this, dual mapping should only be used to provide cache line granularity in selectively caching or not caching address ranges.

Enforcing Coherency via Software

For the processor to cache a shared area it must control when other devices or caches access the area. Specifically, the processor must ensure that the shared memory area is not present in any of its caches before granting another device access to the area. Furthermore, the processor must not cacheably access the area until after the other device finishes with the area.

Sharing access to an area of memory between a cache and another device requires the ability to control ownership to the area. Although not specifically required, the processor’s cache line size is the recommended granularity for aligning and sizing a shared memory area. By configuring all shared areas to start at the beginning of a cache line boundary and occupy an integral number of lines, no cache lines will contain a mixture of shared and non-shared items. Locating and sizing shared areas in this manner prevents cached accesses to non-shared memory locations from inadvertently caching adjacent shared regions.

PPC405 based products have a cache line size of 32 bytes, while the PPC401 and PPC403 implement 16 byte lines. Therefore, if a C program executing on a PPC405 requires 150 bytes for a shared buffer, it should allocate the region as follows:

#define LINE_LENGTH 32 # cache line length (bytes) #define BIT_MASK 0x1F # addr bits that select a byte in the line

char *buffer; # buffer allocated by malloc char *aligned; # cache line aligned buffer

buffer=(char)malloc(150+2*LINE_LENGTH-2); # obtain buffer if (buffer & BIT_MASK != 0) # if not at beginning of line aligned=buffer+LINE_LENGTH-(buffer & BIT_MASK); # point to start of next line else aligned=buffer; # otherwise use as is

Since malloc will not necessarily allocate a block starting on a cache line boundary, the code adjusts the requested size to allow for alignment. Then a second pointer, aligned, receives the value in buffer which, if required, is rounded up to the next starting cache line address. At this point, no other data objects exist in any cache block from aligned to aligned+149, thus preventing these blocks from being unintentionally loaded into the cache. Failure to allocate storage using this technique or through compiler directives that align and pad variables in a similar manner will likely cause coherency problems.

7 Cache Flushing

Before another device may access a shared memory area, the area must not exist in the processor’s data cache. In addition, if the processor moves executable code with the data cache, software must force the data cache to update those locations in system memory and also invalidate the same locations in the instruction cache. Removing the destination address range from the instruction cache serves to keep it coherent with the relocated instructions in system memory.

The size and placement of a shared memory area along with the size of the data cache determines the best method for removing the region from the data cache. Flushing the region by address is most efficient when the shared area is smaller than the data cache. The assembly to flush a specific region is as follows: # r1 = start of shared region # r2 = end of region loop: dcbf 0,r1 # flush line at address r1 addi r1,r1, # point to next line cmpw r1,r2 # finished? ble loop # if not, continue until done

While the algorithm executes a dcbf for each line in the shared region, only modified lines result in writes to system memory. Because non-dirty lines still match external memory, the cache controller simply invalidates them. If the shared area is larger than the data cache, flushing the entire cache often yields better performance than issuing dcbf instructions for the entire address range.

PPC40x products do not provide a direct method of flushing the entire data cache, regardless of address. Instead, software must load new data into the data cache, thereby forcing the cache to write any modified lines back to system memory. The algorithm below uses the dcbz opcode to establish a line in the cache at an unused, and possibly non-existent, address without causing a load from external memory. By executing two dcbz instructions to different addresses in the same congruence class, the code flushes both lines in the set. The subsequent dccci then invalidates both of these new lines.

li r1, li r2, li r3, mtctr r3

loop: dcbz 0,r1 # flush one way of cache set dcbz r2,r1 # flush the other dccci 0,r1 # invalidate the set addi r1,r1, # point to next line bdnz loop # continue until done sync # ensure data has been written

This code disables interrupts during the flush procedure to prevent an interrupt from occurring during the dcbz dcbz dccci sequence. With interrupts enabled, the possibility exists that the cache might write out one of the dcbz created lines thereby corrupting system memory or causing an exception.

Self-Modifying Code and Relocation of Executable Object Code

Both self-modifying code and moving executable object from one memory location to another with caching enabled require cache control instructions. The following assembly maintains coherency between the instruction cache, data cache and system memory:

8 # r1 = source address (word aligned) # r2 = target address (word aligned) # r3 = number of words to move addi r1,r1,-4 # allows use of lwzu and stwu addi r2,r2,-4 mtctr r3 loop: lwzu r4,4(r1) # read source stwu r4,4(r2) # write target dcbf 0,r2 # remove target from data cache icbi 0,r2 # and from instruction cache bdnz loop # repeat until done sync isync

This example moves one word and then flushes the corresponding address from the data cache and invalidates it in the instruction caches. By using this sequence of operations, the instruction cache remains coherent during the relocation process. However, if the application does not attempt to execute code in the target range until after moving the entire block, improved performance results from flushing the data cache and invalidating the instruction cache outside of the loop.

Conclusion

The PPC40x on-chip instruction and data caches greatly improve system performance. Although most programs do not directly manipulate the caches, cache control instructions provide programmers with the means to control the content of the PPC40x on-chip caches. This level of control is required in applications with devices that share memory with the processor or when a program relocates executable object from one memory area to another. By utilizing cache control instructions these memory areas become cacheable and overall system performance increases.

© International Business Machines Corporation, 1999

All Rights Reserved

* Indicates a trademark or registered trademark of the International Business Machines Corporation.

** All other products and company names are trademarks or registered trademarks of their respective holders.

IBM and IBM logo are registered trademarks of the International Business Machines Corporation.

IBM will continue to enhance products and services as new technologies emerge. Therefore, IBM reserves the right to make changes to its products, other product information, and this publication without prior notice. Please contact your local IBM Microelectronics representative on specific standard configurations and options.

IBM assumes no responsibility or liability for any use of the information contained herein. Nothing in this document shall operate as an express or implied license or indemnity under the intellectual property rights of IBM or third parties. NO WARRANTIES OF ANY KIND, INCLUDING BUT NOT LIMITED TO THE IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE ARE OFFERED IN THIS DOCUMENT.

9