Powerpc 400 Series Caches: Programming and Coherency Issues
Total Page:16
File Type:pdf, Size:1020Kb
PowerPC 400 Series Caches: Microcontroller Applications Programming and Coherency IBM Microelectronics Research Triangle Park, NC Issues [email protected] Version: 1.0 January 22, 1999 Abstract – The PowerPC™ instruction set provides opcodes that allow programmers to explicitly move items in or out of the processor’s instruction and data caches. This application note examines these cache control instructions in terms of initializing the caches at power-on and for enforcing coherency in systems with devices capable of directly reading and writing cacheable system memory. High performance processors, such as the IBM family of Embedded PowerPC Processors, require access to instructions and data at the clock rate of the processor. In most circumstances, external memory cannot provide this level of data throughput and on-chip memory cannot fulfill a system’s memory requirements. An effective solution is to exploit the locality of instruction and data accesses present in most programs and retain frequently accessed items in an on-chip cache memory. Typically, caches greatly increase performance while minimally affecting system and program design. However, there are methods of accessing memory that can cause on-chip caches and external memory to become incoherent, resulting in data errors and possible system failure. Key to preventing this type of problem is an understanding of what may cause these problems and how to write software to avoid them. Cache Structure The PPC40x series of Embedded Processors and Cores employ separate instruction and data caches. Separate caches provide a performance advantage by allowing simultaneous instruction and data accesses. Equivalent performance is possible through a dual-ported unified cache with the same total number of storage bits, but the unified cache has the disadvantage of occupying a larger amount of silicon. Caches attempt to retain frequently used instruction and data items so that they may be accessed quickly. Performance usually improves greatly because most software exhibits locality in terms of code and data references. Therefore, once an item has been transferred from external memory and placed in the cache it is immediately available for subsequent accesses. When the processor requests either an instruction or piece of data and the item is present in the cache a cache hit occurs. The structure and size of the cache along with a program’s flow and its pattern of memory accesses affects the cache hit ratio for a particular application. The hit ratio is a metric that quantifies the fraction of processor accesses satisfied by the cache compared to the total number of accesses. Usually the hit ratio is determined individually for the instruction and data caches. 1 Cache Mapping Conceptually, a cache is an on-chip memory array tightly coupled to the processor core that retains the contents of the most recently accessed memory locations. For performance and design reasons, most caches operate on data blocks rather than individual memory elements. In the case of the PPC401GF and PPC403GA/GB/GC/GCX embedded processors and the PPC401 core the block size is four 32-bit words, while for the PPC405 core it is eight 32-bit words. Therefore, when the cache cannot satisfy a data or instruction request, the block or line containing the requested address is loaded into the cache. The physical implementation of a cache often places a limitation on how many lines with partially identical addresses may coexist in the cache. As illustrated below, an address may be divided into several bit fields: tag, index and offset. 32-bit Address Tag Index Offset The offset selects an individual data item within a given cache line. Bits to the left of the offset comprise the tag and index. For performance and design reasons a cache is often implemented as a collection of sets. Shown below is the 2-way set-associative structure used in the PPC40x embedded processors and cores. Set A Set B Index Tag Line Data Tag Line Data 0 1 2 3 For each index, or congruence class, at most two data blocks may coexist in the cache. Other possible structures include direct mapped, n-way set-associative and fully-associative. A direct mapped cache has only a single set thus permitting only a single data block for each index. An n-way cache allows n lines with common indices, while a data block can be placed anywhere within a fully-associative cache. Higher levels of associativity generally yield better software performance, but may raise other concerns relating to device cost and execution rate. The key point to understand is that the address of an instruction or data item, combined with the associativity of the cache determines where it can be placed in the cache. With an n-way cache, attempting to access greater than n items with common indices causes bus operations to read and/or write cache data. If possible, this should be avoided to provide the highest level of system performance. Cache Control Instructions The PowerPC Architecture and PowerPC Embedded Environment provide a number of instructions for managing both instruction and data caches and potentially improving code performance. Most applications only use these cache control instructions during power-on initialization and when necessary to flush data cache contents to system memory. However, in time-critical code segments these instructions can often improve throughput via preloading of required cache contents and by reducing unnecessary transfers between external memory and the data cache. 2 Following are brief descriptions and the typical usage of common PowerPC cache control instructions as implemented by the PPC40x processors and cores. Since these instructions may vary by processor, more complete descriptions, syntax, architectural notes and exceptions for these and additional instructions are included in the User’s Manual for the appropriate processor or core. dcbi <EA> Data Cache Block Invalidate: If the data block corresponding to the effective address is in the data cache, the block is marked as invalid. If modified data existed in the data block, that data is lost. icbi <EA> Instruction Cache Block Invalidate: If the instruction block corresponding to the effective address is in the instruction cache, the block is marked as invalid. dccci <EA> Data Cache Congruence Class Invalidate: Both cache lines associated with the congruence class specified by the effective address are invalidated. If modified data existed in the cache congruence class prior to the operation of this instruction, that data is lost. iccci <EA> Instruction Cache Congruence Class Invalidate: Invalidates both instruction cache lines associated with the congruence class specified by the effective address. The above instructions remove a specified cache line or congruence class from the instruction or data cache. The congruence class invalidate instructions are useful when the address of the line destined for removal from the cache is unknown. Usually dccci and iccci are only used to clear the caches. Because of this, the PPC405 core implementation of the iccci instruction ignores the <EA> parameter and invalidates the entire instruction cache. dcbst <EA> Data Cache Block Store: If the data block at the effective address is in the data cache and marked as modified, the data block is written back to main storage and marked as unmodified in the cache. dcbf <EA> Data Cache Block Flush: If the data block corresponding to the effective address is in the data cache and marked as modified, it is written back to main storage and then marked as invalid in the cache. Executing dcbst or dcbf writes the cache line at the specified address to system memory if it contains modified data. Additionally, dcbf then invalidates the line in the data cache. These instructions allow an application to explicitly update system memory with the cache contents. This operation is usually necessary to enforce coherency when the processor and an external device access a common memory region. dcbz <EA> Data Cache Block Set to Zero: If the data block at the effective address is in the cache, the data in the block is set to zero. Otherwise, dcbz establishes a cache block at the effective address and sets it to zero. dcbz provides a means of establishing a line in the data cache at a given address without reading system memory. Doing so greatly improves the performance of algorithms that completely overwrite a target data area with new values. dcbt <EA> Data Cache Block Touch: If the data block corresponding to the effective address is not in the data cache and the effective address is marked as cacheable, the block is read from main storage into the data cache. icbt <EA> Instruction Cache Block Touch: If the instruction block containing the effective address is not in the instruction cache and the effective address is marked as cacheable, the block is read from main storage into the instruction cache. As implemented in the PPC40x products, an instruction touch or data touch operation causes the associated cache to read the specified cache block from system memory into the cache. By explicitly loading a cache block prior to the use of its contents, software may reduce the number of pipeline stalls incurred. 3 Cache Initialization Operating as local memory, caches provide the processor with access to recently referenced instructions and data. A reset or power-on sequence disables the instruction and data caches and therefore they contain random data, tags and line status bits. Prior to enabling either of the caches they must be explicitly invalidated by software. A generic PowerPC assembly routine to invalidate the data cache is: lwz r0,<classes> # number of congruence classes in cache mtctr r0 # place in counter li r1,0 # clear r1 loop: dccci 0,r1 # invalidate class containing (r1) addi r1,r1,<bytes per line> # point to next class bdnz loop # repeat until done Clearing the instruction cache is identical except that the dccci instruction is replaced with iccci and the number of congruence classes may be different.