CSAIL Science and Artificial Intelligence Laboratory

Massachusetts Institute of Technology

Application-Specific Memory Management in Embedded Systems Using Software-Controlled Caches

Derek Chiou, Prabhat Jain, Larry Rudolph, Srinivas Devadas

In the proceedings of the 37th Design Automation Conference Architecture, 2000, June

Computation Structures Group Memo 448

The Stata Center, 32 Vassar Street, Cambridge, Massachusetts 02139

ApplicationSp ecic Memory Management for Emb edded Systems

Using SoftwareControlled Caches

Derek Chiou, Prabhat Jain, Larry Rudolph, and Srinivas Devadas Department of EECS Massachusetts Institute of Technology

Cambridge, MA 02139

he and scratchpad memory onchip since each addresses

ABSTRACT cac

We prop ose a way to improve the p erformance of embed

a dierent need Caches are transparent to software since

ded pro cessors running dataintensive applications byallow

they are accessed through the same address space as the

ing software to allo cate onchip memory on an application

larger backing storage They often improveoverall software

sp ecic basis Onchip memory in the form of can

p erformance but are unpredictable Although the cache re

be made to act like scratchpad memory via a novel hard

placement hardware is known predicting its p erformance

ware mechanism which we call column caching Column

dep ends on accurately predicting past and future reference

caching enables dynamic cache partitioning in software by

patterns Scratchpad memory is addressed via an indep en

mapping data regions to a sp ecied sets of cache columns

dent address space and thus must be managed explicitly

or ways When a region of memory is exclusively mapp ed

by software oftentimes a complex and cumb ersome prob

to an equivalent sized partition of cache column caching

lem but provides absolutely predictable p erformance Thus

provides the same functionality and predictability as a ded

icated scratchpad memory for timecritical parts of a real

even though a pure cache system may p erform b etter over

time application The ratio between scratchpad size and

all scratchpad memories are necessary to guarantee that

cache size can b e easily and quickly varied for each applica

critical p erformance metrics are always met

tion or each task within an application Thus software has

much ner software control of onchip memory providing

Of course b oth caches and scratchpad memories should b e

the ability to dynamically tradeo p erformance for onchip

available to emb edded systems so that the appropriate mem

memory

ory structure can b e used in each instance A static divi

sion however is guaranteed to b e sub optimal as dierent

ve dierent requirements Previous research

1. INTRODUCTION applications ha

As timetomarket requirements of electronic systems de

has shown that even within a single application dynami

mand ever faster design cycles an ever increasing number

cally varying the partitioning b etween cache and scratchpad

of systems are built around a programmable emb edded pro

memory can signicantly improve p erformance

cessor that implements an ever increasing amountoffunc

tionality in rmware running on the emb edded pro cessor

We prop ose a way to dynamically allo cate cache and scratch

The advantages of doing so are twofold software is simpler

pad memories from a common p o ol of memory In partic

to implement than a dedicated hardware solution and soft

ular we prop ose column caching a simple mo dication

ware can b e easily changed to address design errors late de

that enables software to dynamically partition a cache into

sign changes and pro duct evolution Only the most

several distinct caches and scratchpad memories at a column

timecritical tasks need to b e implemented in hardware

granularity In our reference implementation eachway of

an nway setasso ciativecache is a column By exclusively

Onchip memory in the form of cache scratchpad SRAM

allo cating a region of address space to an equalsized region

and more recently emb edded DRAM or some combina

of cache column caching can emulate scratchpad memory

tion of the three is ubiquitous in programmable embed

Column caching only restricts data placement within the

ded systems to supp ort software and to provide an interface

cache during replacement all other op erations are unmo di

between hardware and software Most systems have b oth

ed

Careful mapping can p otentially reduce or eliminate replace

ment errors resulting in improved p erformance It not only

enables a cache to emulate scratchpad memory but separate

spatialtemp oral caches a separate prefetch buer separate

write buers and other traditional staticallypartitioned struc

tures within the general cache as well The rest of this pap er

describ es column caching and howitmight b e used

Op Virtual address

mentunitmust b e provided Similar control over the cache

already exists for uncached data since the cacheduncached Replacement Unit BIU

TLB bit resides in the TLB

Hit? BIU Data

2.1 Partitioning and Repartitioning

Implementation is greatly simplied if the minimum map

ping granularity is a page since existing virtual memory

translation mechanisms including the ubiquitous translation

Column 0 Column 1 Column 2 Column 3 lo okasidebuers TLB can b e used to store mapping infor

mation that will b e passed to the replacement unit TLBs

accessed every memory reference are designed to b e fast in

Figure Basic Column Caching Three mo dica

order to minimize physical cache access time Partitioning

tions to a setasso ciative cache denoted by dotted

is supp orted by simply adding column caching mapping en

lines in the gure are necessary i augmented TLB

tries to the TLB data structures and providing a data path

to hold mapping information ii mo died replace

from those entries to the mo died replacement unit There

ment unit that uses mapping information and iii

fore in order to remap pages to columns access to the page

a path between the TLB and the replacement unit

table entries is required

that carries that information

Mapping a page to a cache partition represented bya bit

ector is a two phase pro cess Pages are mapp ed to a tint

2. COLUMN CACHING v

rather than to a bit vector directly A tint is a virtual group

The simplest implementation of column caching is derived

ing of address spaces For example an entire streaming data

from a setasso cativecache where lowerorder bits are used

structure could b e mapp ed to a single tint or all streaming

to select a set of cachelines which are then asso ciatively

data structures could be mapp ed to a single tint or just

searched for the desired data If the data is not found a

the rst page of several data structures could b e mapp ed to

cache miss the replacement algorithm selects a cacheline

a single tint Tints are indep endently mapp ed to a set of

from the selected set

columns represented bya bitvector such mappings can b e

changed quickly Thus tints rather than bit vectors are

During lo okup a column cache b ehaves exactly as a stan

stored in page table entries

dard setasso cative cache and thus incurs no p erformance

p enaltyonacache hit Rather than allowing the replace

The tintlevelofindirection is intro duced i to isolate the

ment algorithm to always select from any cacheline in the

user from machinesp ecic information such as the number

set however column caching provides the ability to restrict

of columns or the number of levels of the

the replacement algorithm to certain columns Each column

and ii to make remapping easier

is one way or bank of the nway setasso ciative cache

Figure Emb edded pro cessors such as the ARM are

e to reduce p ower consumption provid

highly setasso ciativ 2.2 Using Columns As Scratchpad Memory

Column caching can emulate scratchpad memory within the

ing a large numb er of columns A bit vector sp ecifying the

cache by dedicating a region of cache equal in size to a re

p ermissible set of columns is generated and passed to the

gion of memory No other memory regions are mapp ed to

replacement unit

the same region of cache Since there is a onetoone map

ping once the data is broughtinto the cache it will remain

A mo dication to the bit vector repartitions the cache Since

there In order to guarantee p erformance software can p er

every cacheline in the set is searched during every access

form a load on all cachelines of data when remapping as

repartitioning is graceful and fast if data is moved from

is required with a dedicated SRAM That memory region

one column to another but always in the same set the

then behaves like a scratchpad memory If and when the

asso ciativesearch will still nd the data in the new lo cation

scratchpad memory is remapp ed to a dierent use the data

A memory lo cation can be cached in one column during

is automatically copied backifbacking RAM is provided

one cycle then remapp ed to another column on the next

cycle The cached data will not move to the new column

taneously but will remain in the old column until it is

instan 2.3 Impact of Column Caching on Clock Cycle

The mo dications required for column caching are limited

replaced Once removed from the cache it will b e cached in

to the cache replacementunitwhichis not on the critical

a column to which it is mapp ed the next time it is accessed

path In realistic systems data requested from L cache to

Column caching is implemented via three small mo dica main memory takes at least three cycles but generally more

tions to a setasso ciative cache Figure The TLB must to return The exact replacementcacheline do es not need

b e mo died to store the mapping information The replace to b e decided until the data returns giving the replacement

mentunitmust b e mo died to resp ect TLBgenerated re algorithm at least three cycles to make a decision which

strictions of replacement cacheline selection A path to should easily be sucient for the minor additions to the

carry the mapping information from the TLB to the replace replacement path 2.8 3. PREDICTABILITY IN MULTITASKING gzip 16K cache

2.6

Column caching enables predictable p erformance within a

2.4

multitasking environment where multiple jobs are execut gzip 16K column cache

2.2

ing Consider three compression gzip jobs simultaneously

on one pro cessor and each having access to the

executing 2

he To understand what is happ ening we only present cac 1.8

gzip 128K cache t of a one gzip pro cess referred

the p erformance measuremen 1.6

to as job A in the rest of the discussion in this mixture We

Clocks Per Instruction 1.4

t the results in terms of clo cks p er instruction CPI presen gzip 128K column cache

1.2

whichisinversely correlated with p erformance a lower CPI

1

means higher p erformance To compute the CPI we assume 1 2 4 8 4 8 6 2 2 4 8 16 32 64 84 68 36 76 128 256 512 07 14 28

a cycle latency to memory and that of instructions 102 204 409 819 85 163 327 655 131 262 524

104

Figure shows the variation in job

are memory op erations Context-Switch Time Quantum

As CPI when the time quantum is varied Results for b oth

a standard cache and a mapp ed column cache are presented

Figure Column caching provides predictable and

The two sets of plots corresp ond to dierent sized K and

sup erior p erformance to a standard cache over a

K caches

wide range of scheduling time quanta The clo cks

per instruction is a measure of p erformance the

Each point in this plot was generated by cho osing a time

smaller the numb er the b etter Except for very long

quantum and p erforming a roundrobin schedule of the three

time quantum p erio ds column caching provides su

gzip jobs A B and C There are two cases i each job gets

p erior p erformance Also note that the p erformance

to use the entire cache while it is running standard cache

of column caching is much less sensitive to time

and ii each job uses only its assigned columns column

quantum times as seen from the nearly horizontal

cache For the column cache the critical job is p ermitted to

curves for column caching

use the entire cache while the other two jobs are restricted

to using only a quarter of the cache In the standard cache

case there is a signicant dierence in the CPI for job A

The idea of staticallypartitioned caches is not new The

as the time quantum varies This variation is caused mainly

most common example are separate instruction and data

by the cache hit rate for job A b eing aected byintervening

caches Some existing and prop osed architectures supp ort

cache accesses due to jobs B and C The number of such

a pair of caches one for spatial lo cality and one for tem

accesses is dep endent on the time quantum Once column

poral lo cality These designs statically

caching is intro duced and most of job As data is protected

divide the twocaches Other pro cessors supp ort lo cking of

from replacementby jobs B and Cs data then the CPI of

data into the cache but do not include a waytotell

job A is signicantly less sensitive to the time quantum Job

if the desired data is in the cache Sun Microsystems Cor

Awas considered critical and it was exclusively assigned a

p oration patented a mechanism very similar to column

large fraction of the cache hence the hit rate for job A is

caching that allows partitioning of a cache b etween pro cesses

higher Therefore the CPI is signicantly smaller for small

at cache column granularitybyproviding a bit mask asso

time quanta Of course when the time quantum is really

ciated with the running pro cess limiting it to partitioning

large we eectively have batch pro cessing and the CPIs

the cache b etween pro cesses

are virtually the same for the top two plots Overall system

throughput may actually decrease due to an over allo cation

A subset of column caching abilities is achievable without

of resources to a set of critical jobs but the p erformance

hardware supp ort by page coloring achieved byintelligently

of those critical jobs is generally higher and has muchless

mapping virtual pages to physical pages Column caching

variation

however is much faster at repartitioning page coloring re

quires a memory copy uses setasso ciative caches b etter

One may argue that the time quantum could b e xed for pre

and enabling such abilities as mapping a large contiguous

dictability but in reality due to interrupts and exceptions

region of address space to a small region in the cache use

the eective time quantum can vary signicantly during the

ful for memorymapp ed devices

time that a job is running simultaneously with other jobs

Thus column caching can improve p erformance of a criti

ell as signicantly reduce p erformance variation

cal job as w 4.2 Memory Exploration in Embedded Sys-

even in the presence of interrupts or varying time quanta

tems

Cache memory issues havebeen studied in the context of

emb edded systems McFarling presents techniques of co de

t in main memory to maximize instruction cache

4. RELATED WORK placemen

hit ratio A mo del for partitioning an instruction

he among multiple pro cesses has b een presented 4.1 Cache Mechanisms cac

Motorola MPC IntegratedProcessor Users Panda Dutt and Nicolau presenttechniques for partition

ManualJuly

ing onchip memory into scratchpad memory and cache

The presented algorithm assumes a xed amount of scratch

B Nayfeh and Y A Khalidi Us patent

pad memory and a xedsize cache identies critical vari

Apparatus and metho d to preserve data in a set

ables and assigns them to scratchpad memory The algo

asso ciative memory device Dec

rithm can be run rep eatedly to nd the optimum p erfor

PRPanda N Dutt and A Nicolau Memory Issues

mance p oint

in Embedded SystemsonChip Optimizations and

ationKluwer Academic Publishers

5. CONCLUSIONS Explor

The work describ ed in this pap er represents a conuence of

PGPaulin C Liem T C May and S Sutarwala

two observations The rst observation is that given hetero DSP Design To ol Requirements for Embedded

Systems ATelecommunications Industrial

geneous applications with data streams that have signicant

Persp ective Journal of VLSI Signal Processing

variation in their lo cality prop erties it is worthwhile to pro

January

vide ner software control of the cache so the cache can b e

used more eciently Tothisendwehave prop osed a col

J V Praet G Go ossens D Lanneer and H D Man

umn caching mechanism that enables cache partitioning so

Instruction Set Denition and Instruction Selection

data with dierent lo cality characteristics can be isolated

for ASIPs In Proceedings of the th IEEEACM

International Symposium on HighLevel Synthesis

for improved p erformance The second observation is that

May

columns can emulate scratchpad memory whichisusedex

tensively to improve predictabilityinemb edded systems A

F Sanchez A Gonzalez and M Valero Software

signicant b enet of column caching is that through the ex

Management of Selective and Dual Data Caches In

ecution of a program the data stored in columns can be

IEEE Computer Society Technical Committeeon

explicitly managed as in a scratchpad or can b e implicitly

Computer Architecture Special Issue on Distributed

managed as in a cache and that the management can change

Shared Memory and Related Issues pages Mar

dynamically and at very small intervals

M Tomasko S Hadjiyiannis and W Na jjar

Acknowledgements This pap er describ es research done

Exp erimental Evaluation of ArrayCaches In IEEE

at the Lab oratory for Computer Science of the Massachusetts

Computer Society Technical Committee on Computer

Institute of Technology Funding for this work is provided

Architecture Special Issue on Distributed Shared

in part by the Advanced Research Pro jects Agency of the

Memory and Related Issues pages Mar

Department of Defense under the Air Force Research Lab o

H Tomiyama and H Yasuura Co de Placement

ratory contract F

Techniques for Cache Miss Rate Reduction ACM

ransactions on Design Automation of Electronic

6. REFERENCES T

Systems Octob er

K Asanovic Vector PhD thesis

University of California at BerkeleyMay

D T Chiou Extending the Reach of Microprocessors

Column and Curious Caching PhD thesis

Department of EECS MIT Cambridge MA Sept

Cyrix Cyrix XMX May

G Faanes A CMOS Vector Pro cessor with a Custom

Streaming Cache In Hot Chips August

Intel Intel StrongARM SA April

Y Li and W Wolf A TaskLevel Hierarchical

Memory Mo del for System Synthesis of

th

Multipro cessors In Proceedings of the Design

Automation Conference pages June

B Lynch and G Lauterbach UltraSPARCIIIA

MHz bit Sup erscalar Pro cessor for Way

Scalable Systems In Hot Chips

S McFarling Program Optimization for Instruction

rd

Caches In Proceedings of the Intl Conferenceon

Architectural Support for Programming Languages and

Operating Systems pages April