Application-Specific Memory Management in Embedded Systems Using Software-Controlled Caches

CSAIL Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Application-Specific Memory Management in Embedded Systems Using Software-Controlled Caches Derek Chiou, Prabhat Jain, Larry Rudolph, Srinivas Devadas In the proceedings of the 37th Design Automation Conference Architecture, 2000, June Computation Structures Group Memo 448 The Stata Center, 32 Vassar Street, Cambridge, Massachusetts 02139 ApplicationSp ecic Memory Management for Emb edded Systems Using SoftwareControlled Caches Derek Chiou, Prabhat Jain, Larry Rudolph, and Srinivas Devadas Department of EECS Massachusetts Institute of Technology Cambridge, MA 02139 he and scratchpad memory onchip since each addresses ABSTRACT cac We prop ose a way to improve the p erformance of embed a dierent need Caches are transparent to software since ded pro cessors running dataintensive applications byallow they are accessed through the same address space as the ing software to allo cate onchip memory on an application larger backing storage They often improveoverall software sp ecic basis Onchip memory in the form of cache can p erformance but are unpredictable Although the cache re be made to act like scratchpad memory via a novel hard placement hardware is known predicting its p erformance ware mechanism which we call column caching Column dep ends on accurately predicting past and future reference caching enables dynamic cache partitioning in software by patterns Scratchpad memory is addressed via an indep en mapping data regions to a sp ecied sets of cache columns dent address space and thus must be managed explicitly or ways When a region of memory is exclusively mapp ed by software oftentimes a complex and cumb ersome prob to an equivalent sized partition of cache column caching lem but provides absolutely predictable p erformance Thus provides the same functionality and predictability as a ded icated scratchpad memory for timecritical parts of a real even though a pure cache system may p erform b etter over time application The ratio between scratchpad size and all scratchpad memories are necessary to guarantee that cache size can b e easily and quickly varied for each applica critical p erformance metrics are always met tion or each task within an application Thus software has much ner software control of onchip memory providing Of course b oth caches and scratchpad memories should b e the ability to dynamically tradeo p erformance for onchip available to emb edded systems so that the appropriate mem memory ory structure can b e used in each instance A static divi sion however is guaranteed to b e sub optimal as dierent ve dierent requirements Previous research 1. INTRODUCTION applications ha As timetomarket requirements of electronic systems de has shown that even within a single application dynami mand ever faster design cycles an ever increasing number cally varying the partitioning b etween cache and scratchpad of systems are built around a programmable emb edded pro memory can signicantly improve p erformance cessor that implements an ever increasing amountoffunc tionality in rmware running on the emb edded pro cessor We prop ose a way to dynamically allo cate cache and scratch The advantages of doing so are twofold software is simpler pad memories from a common p o ol of memory In partic to implement than a dedicated hardware solution and soft ular we prop ose column caching a simple mo dication ware can b e easily changed to address design errors late de that enables software to dynamically partition a cache into sign changes and pro duct evolution Only the most several distinct caches and scratchpad memories at a column timecritical tasks need to b e implemented in hardware granularity In our reference implementation eachway of an nway setasso ciativecache is a column By exclusively Onchip memory in the form of cache scratchpad SRAM allo cating a region of address space to an equalsized region and more recently emb edded DRAM or some combina of cache column caching can emulate scratchpad memory tion of the three is ubiquitous in programmable embed Column caching only restricts data placement within the ded systems to supp ort software and to provide an interface cache during replacement all other op erations are unmo di between hardware and software Most systems have b oth ed Careful mapping can p otentially reduce or eliminate replace ment errors resulting in improved p erformance It not only enables a cache to emulate scratchpad memory but separate spatialtemp oral caches a separate prefetch buer separate write buers and other traditional staticallypartitioned struc tures within the general cache as well The rest of this pap er describ es column caching and howitmight b e used Op Virtual address mentunitmust b e provided Similar control over the cache already exists for uncached data since the cacheduncached Replacement Unit BIU TLB bit resides in the TLB Hit? BIU Data 2.1 Partitioning and Repartitioning Implementation is greatly simplied if the minimum map ping granularity is a page since existing virtual memory translation mechanisms including the ubiquitous translation Column 0 Column 1 Column 2 Column 3 lo okasidebuers TLB can b e used to store mapping infor mation that will b e passed to the replacement unit TLBs accessed every memory reference are designed to b e fast in Figure Basic Column Caching Three mo dica order to minimize physical cache access time Partitioning tions to a setasso ciative cache denoted by dotted is supp orted by simply adding column caching mapping en lines in the gure are necessary i augmented TLB tries to the TLB data structures and providing a data path to hold mapping information ii mo died replace from those entries to the mo died replacement unit There ment unit that uses mapping information and iii fore in order to remap pages to columns access to the page a path between the TLB and the replacement unit table entries is required that carries that information Mapping a page to a cache partition represented bya bit ector is a two phase pro cess Pages are mapp ed to a tint 2. COLUMN CACHING v rather than to a bit vector directly A tint is a virtual group The simplest implementation of column caching is derived ing of address spaces For example an entire streaming data from a setasso cativecache where lowerorder bits are used structure could b e mapp ed to a single tint or all streaming to select a set of cachelines which are then asso ciatively data structures could be mapp ed to a single tint or just searched for the desired data If the data is not found a the rst page of several data structures could b e mapp ed to cache miss the replacement algorithm selects a cacheline a single tint Tints are indep endently mapp ed to a set of from the selected set columns represented bya bitvector such mappings can b e changed quickly Thus tints rather than bit vectors are During lo okup a column cache b ehaves exactly as a stan stored in page table entries dard setasso cative cache and thus incurs no p erformance p enaltyonacache hit Rather than allowing the replace The tintlevelofindirection is intro duced i to isolate the ment algorithm to always select from any cacheline in the user from machinesp ecic information such as the number set however column caching provides the ability to restrict of columns or the number of levels of the memory hierarchy the replacement algorithm to certain columns Each column and ii to make remapping easier is one way or bank of the nway setasso ciative cache Figure Emb edded pro cessors such as the ARM are e to reduce p ower consumption provid highly setasso ciativ 2.2 Using Columns As Scratchpad Memory Column caching can emulate scratchpad memory within the ing a large numb er of columns A bit vector sp ecifying the cache by dedicating a region of cache equal in size to a re p ermissible set of columns is generated and passed to the gion of memory No other memory regions are mapp ed to replacement unit the same region of cache Since there is a onetoone map ping once the data is broughtinto the cache it will remain A mo dication to the bit vector repartitions the cache Since there In order to guarantee p erformance software can p er every cacheline in the set is searched during every access form a load on all cachelines of data when remapping as repartitioning is graceful and fast if data is moved from is required with a dedicated SRAM That memory region one column to another but always in the same set the then behaves like a scratchpad memory If and when the asso ciativesearch will still nd the data in the new lo cation scratchpad memory is remapp ed to a dierent use the data A memory lo cation can be cached in one column during is automatically copied backifbacking RAM is provided one cycle then remapp ed to another column on the next cycle The cached data will not move to the new column taneously but will remain in the old column until it is instan 2.3 Impact of Column Caching on Clock Cycle The mo dications required for column caching are limited replaced Once removed from the cache it will b e cached in to the cache replacementunitwhichis not on the critical a column to which it is mapp ed the next time it is accessed path In realistic systems data requested from L cache to Column caching is implemented via three small mo dica main memory takes at least three cycles but generally more tions to a setasso ciative cache Figure The TLB must to return The exact replacementcacheline do es not need b e mo died to store the mapping information The replace to b e decided until the data returns giving the replacement mentunitmust b e mo died to resp ect TLBgenerated re algorithm at least three cycles to make a decision which strictions of replacement cacheline selection A path to should easily be sucient for the minor additions to the carry the mapping information from the TLB to the replace replacement path 2.8 3.

Application-Specific Memory Management in Embedded Systems Using Software-Controlled Caches

Analyzing the Benefits of Scratchpad Memories for Scientific Matrix

Memory Architectures 13 Preeti Ranjan Panda

Gpgpus: Overview Pedagogically Precursor Concepts UNIVERSITY of ILLINOIS at URBANA-CHAMPAIGN

Hardware and Software Optimizations for GPU Resource Management

Improving Memory Subsystem Performance Using Viva: Virtual Vector Architecture

A Dynamic Scratchpad Memory Unit for Predictable Real-Time Embedded Systems

Architectural Exploration of Heterogeneous Memory Systems Marcos Horro, Gabriel Rodr´Iguez, Juan Tourino,˜ Mahmut T

Designing Scratchpad Memory Architecture with Emerging STT-RAM Memory Technologies

5. Embedded Systems Dilemma of Chip Memory Diversity By

FPGA Implementation of a Configurable Cache/Scratchpad

Compiler-Directed Scratchpad Memory Management Via Graph Coloring

LEAP Scratchpads: Automatic Memory and Cache Management for Reconfigurable Logic [Extended Version] Michael Adler, Kermin E