<<

ActiveOS: Virtualizing Intelligent Memory

Mark Oskin, Frederic T. Chong, and Timothy Sherwood*

Department of Science

University of California at Davis

systems with limited memory requirements suchas PDAs. Abstract

We exp ect that the memory demands of most systems will

Current trends in DRAM memory chip fabrication have led

scale with DRAM density, and multiple memory chips will

many researchers to propose \ ligent memory" architec-

b e required. Weintro duce Active Pages, a -based archi-

tures that integrate microprocessors or logic with memory.

tecture for intelligent memory systems (see Figure 1). Ac-

Such architectures o er a potential solution to the growing

tive Pages consist of a sup erpage of data and a collection

communication bottleneck between conventional microproces-

of functions which op erate on that data. Implementations

sors and memory. Previous studies, however, have focused

of ActivePages can execute functions for hundreds of pages

upon single-chip systems and have largely neglected o -chip

simultaneously,providing substantial parallelism.

communication in larger systems.

To use ActivePages, computation for an application must

We introduceActiveOS, an which demon-

b e divided, or partitioned, between pro cessor and memory.

strates multi- execution on Active Pages [OCS98], a

For example, we use Active-Page functions to gather op erands

page-based intel ligent . We present re-

for a sparse-matrix multiply and pass those op erands on to

sults from multiprogrammed workloads running on a proto-

the pro cessor for multiplication. To p erform such a compu-

typeoperating system implemented on top of the SimpleScalar

tation, the matrix data and gathering functions must rst

processor simulator. Our results indicate that paging and

b e loaded into a memory system that supp orts ActivePages.

inter-chip communication can be scheduled to achieve high

Then the pro cessor, through a series of memory-mapp ed

performance for applications that use Active Pages with min-

writes, starts the gather functions in the memory system.

imal adverse e ects to applications that only use conven-

As the op erands are gathered, the pro cessor reads them from

tional pages. Overal l, ActiveOS al lows Active Pages to ac-

-de ned output areas in each page, multiplies them, and

celerate individual applications by up to 1000 percent and to

writes the results back to the array datastructures in mem-

accelerate total workloads from 20 to 60 percent as long as

ory.

physical memory can contain the of each indi-

Previous work [OCS98 ] has shown that ActivePages are

vidual application.

a exible architecture that can lead to dramatic p erformance

improvements (up to 1000X) in a single-pro cess over conven-

tional memory systems. This pap er evaluates ActivePages

1 Intro duction

in a multi-pro cess environment. ActivePages fundamentally

change memory systems from a uniform storage resource to

Micropro cessor p erformance continues to follow phenomenal

a non-uniform collection of storage and computational ele-

growth curves whichdrive the computing industry. Unfor-

ments that require b oth management and .

tunately, memory systems are falling b ehind when \feeding"

We intro duce ActiveOS, a prototyp e op erating system

data to these pro cessors. Pro cessor-centric optimizations to

which illustrates the interaction between an OS and Ac-

bridge this processor-memory gap [WM95 ] [Wil95 ] include

tive Pages. Our goal is to demonstrate that ActivePages

prefetching, sp eculation, out-of-order execution, and mul-

can p erform correctly and ecientlyinamulti-pro cess en-

tithreading. Unfortunately, manyof these approaches can

vironment. In suchanenvironment, our results show that

lead to memory-bandwidth problems [BGK96 ].

while paging ActivePages to and from disk can b e exp en-

DRAM memory technology, however, is growing signif-

sive, opp ortunities to overlap memory and pro cessor com-

icantly denser. The Semiconductor Industry Asso ciation

putation increase dramatically. On mixed workloads of con-

(SIA) roadmap [Sem94] pro jects mass pro duction of 1-gigabit

ventional and Active-Page applications, ActivePage single-

DRAM chips by the year 2001. If wedevote half of the area

pro cess p erformance translates well to a multi-pro cess envi-

of such a chip to logic, we exp ect the DRAM pro cess to

ronment as long as memory pressure is mo derate. Note that

supp ort approximately 32 million transistors. This density

our current cycle-level simulation environment limits work-

has led many researchers to prop ose intelligent memory sys-

load sizes in our exp eriments. Our applications are highly

tems that integrate pro cessors or logic with DRAM. The

scalable, however, and we exp ect our results to generalize to

increased bandwidth and lower latency of on-chip commu-

larger workload sizes.

nication promises to address many asp ects of the pro cessor-

+

The remainder of this pap er is organized as follows: Sec-

memory gap [P 97] [KADP97 ].

tion 2 describ es the three main services provided by Ac-

These improvements, however, are limited to single-chip

tiveOS to supp ort Active Pages. Section 3 describ es our

exp erimental metho dology. Section 5 describ es the appli-

Acknowledgments: Thanks to Andre Dehon, Matt Farrens,

cations in our workload. Section 6 presents our results.

Lance Halstead, Diana Keen, and our anonymous referees. This

Section 7 discusses related work. Finally, Sections 8 and 9

work is supp orted in part by an NSF CAREER award to Fred

Chong, by NSF grant CCR-9812415, by grants from Altera, and

present future work and conclusions.

by grants from the UC Davis Academic Senate. More info at

http://arch.cs.ucdavis.edu/AP

*Now at University of California, San Diego

functions  A set of synchronization variables which the AP

and the pro cessor use to co ordinate activities b etween

the pro cessor and ActivePages within the page group.

DD

There are two imp ortant implications of this interface.

hronization variables are essentially memory-

P F First, the

mapp ed registers that allow a pro cess to eciently invoke

e-Page functions without a . Second, the al-

D Activ

lo cation of ActivePages, which are signi cantly larger than

ventional pages, requires of mixed

F con

sup erpages and pages. This memory management will b e-

come a signi cant issue as we discuss paging and our p erfor-

Disk mance results.

D

2.2 Interpage Communication D

D An application will generally use more data than will t on

ePage (each one is 512 Kbytes in our current

D a single Activ

tation). Such data will often reside in page groups,

F implemen

a collection of ActivePages whichwork together to p erform

F

functions on the data. For example, inserting an element

into the middle of an arraythatspansmultiple pages would

require elements to movebetween pages.

Such data movement is accomplished entirely via explicit

Figure 1: The ActivePage architecture

interpage communication requests. Active-Page functions

can reference any virtual address available to their allo cat-

2 ActiveOS

ing pro cess. For p erformance, applications should keep ref-

erences from functions to addresses on other pages to a min-

ActiveOS is a prototyp e op erating system for which the pri-

imum.

mary goal is to demonstrates mechanisms which correctly

Our implementation uses a processor mediated approach

manage ActivePages in a multi-pro cess environment. These

to satisfying inter-page memory references. Each Active

mechanisms fall into three key categories: pro cess services,

Page explicitly distinguishes b etween lo cal and non-lo cal ref-

interpage communication, and .

erences bychecking if a virtual address is b etween its start-

ing address, kept in a base register, and the starting address

plus the Active-Page size. When an ActivePage reaches a

2.1 Pro cess Services

non-lo cal reference it raises an and blo cks. The

The rst step in developing op erating system supp ort for Ac-

interrupt causes the pro cessor to obtain the memory request

tivePages involves providing basic services for user pro cesses

through a memory-mapp ed read and executes an OS han-

to allo cate pages and bind functions to them. Furthermore,

dler. The handler consults a p er-pro cess table which will

a pro cess must b e able to eciently call these functions.

determine if the ActivePage containing the reference is in

The ActivePage interface is designed to resemble the in-

physical memory. If so, the page is lo cated and the request is

terface of a system. Pro cesses commu-

satis ed. If not, a o ccurs and the page is brought

nicate with ActivePages through memory reads and writes.

in from disk. In practice, alarge numb er of non-lo cal ref-

Sp eci cally,theinterface includes:

erences from di erent pages can b e satis ed up on a single

interrupt, substantially reducing overhead. We assume that

 Standard memory interface functions:

a single interrupt takes 3.5 ms.

(address, data) and read(address)

Our pro cessor-mediated approach makes inter-page com-

munication exp ensive, but greatly simpli es page faults and

 A set of functions available for computation on a par-

reduces the complexity of ActivePage chip implementations.

ticular ActivePage: AP functions

We assume that our memory chips can trigger pro cessor in-

terrupts, but pro cessor p olling is also a p ossiblility.

 An allo cation function:

AP al loc(group id)

2.3 Virtual Memory

When insucentphysical memory is available to execute all

id and which allo cates an ActivePage in group group

pro cesses in the system, memory must be paged to swap

returns the virtual address of the start of the sup erpage

space. We de ne the maximum required swap space to exe-

allo cated. Pages op erating on the same data will often

cute a set of pro cesses as the memory pressure. As memory

id, in order b elong to a page group, named bya group

pressure increases, ActivePages will need to b e paged out

to co ordinate op erations.

to disk. This implies that the data, the functions, and any

active state must b e written to disk. The choice of which

 A binding function:

pages to swap in and out is critical to b oth the correctness

and p erformance of ActivePages. If the pro cessor or an Ac-

AP bind(group id, AP functions)

tivePage in memory references a page on disk, a page fault

o ccurs. If no reference o ccurs, however, a page on disk may

which asso ciates a set of functions AP functions with

still need to b e swapp ed in to complete a computation for

id of ActivePages. group group

a page group. ActivePages can only make computational

Parameter Reference

progress if they are in physical memory. Managing Active

CPU Clo ck 500 MHz

Pages is more than traditional virtual memory management

L1 I- 32K

[Den70]. It is also scheduling of computation.

L1 D-Cache 32K

To help the OS schedule ActivePages, weidentify three

L2 Cache 256K

states in which an ActivePage exists. These are initial (I),

Reconf Logic 50 MHz

dormant (D), and active (A). When a page is allo cated, prior

Cache Miss 114 ns

to function binding, the state is initial (I). When functions

Disk Access Time 10ms

are b ound to a page, but no function is currently executing

Disk Transfer Rate 10mb/sec

at the page, weidentify the page as dormant (D). Finally,

when a function is executing at the page, we identify the page

Table 1: Summary of simulator parameters

as active (A). It is exp ected that an op erating system will

provide mechanisms to control the allo cation and binding of

were executed under various page replacement p olicies and,

functions to pages, but that userpro cesses will transition a

in the case of the mixed Active-Page/conventional applica-

page from the dormant to active states.

tion suite, with various memory resource scheduling quanta.

For p erformance reasons, we restrict the means bywhich

The results of these measurements is presented in Section 6.

an ActivePage can transition from the dormant (D) to active

These measurements were taken on a prototyp e op er-

(A) states. As shown in Figure 2, an op erating system must

ating system, ActiveOS, running on a pro cessor and mem-

p erio dically schedule Active Pages residing in swap space

ory system simulator. ActiveOS runs on the SimpleScalar

backinto physical memory in order to ensure program cor-

v2.0 [BA97 ] to olkit. This to ol set provides the mechanisms

rectness. This arises b ecause an Active Page, X, may be

to compile, debug and simulate applications compiled to a

waiting up on a result that another page on disk, Y, is ab out

RISC architecture. The SimpleScalar RISC architecture is

to write to the memory of X. Since there is no reference to

lo osely based up on the MIPS instruction set architecture.

Y, the page will not b e rescheduled without some external

The SimpleScalar environment was extended by replacing

mechanism.

the simulated conventional with an Active

If we restrict the mechanism bywhichanActivePage can

Page memory system. The new simulated memory hierar-

transition from dormant to active states, then an op erating

chy provides mechanisms whichsimulate a RADram mem-

system can p erio dically schedule only active Active Pages

ory system, an FPGA-based implementation of ActivePages

back from swap, thus avoiding having to schedule dormant

describ ed in [OCS98 ]. Finally, the to olset was enhanced by

pages. The restrictions are that an ActivePage never tran-

extensive mo di cations required to supp ort the simulation

sitions from the dormant to active states without pro cessor

of an op erating system and asso ciated multi-program work-

or inter-ActivePage communication intervention. That is,

loads. The simulated machine characteristics are shown in

an ActivePage never self-activates.

Table 1.

Our results indicate that p erformance is only marginally

a ected by the scheduling quantum for periodically swap-

ping pages backintophysical memory, suggesting that as

4 Limitations

small of a quantum should b e used as p ossible such that p er-

formance is not seriously degraded in order to improve user

A signi cant limitation to our metho dology is the use of a

resp onse time.

cycle-level pro cessor-memory simulator. While our results

ActiveOS currently implements a round-robin memory

are extremely accurate, simulation sp eed limits the size of

page scheduler for ActivePages residing in swap space. For

our workloads. Consequently, our results primarly demon-

page replacement, we adopted a Least Recently Used (LRU)

strate correctness of ActiveOS mechanisms. They also reveal

algorithm which was implemented by tracking a counter

some interesting qualitative issues involving ActivePages on

value for each physical page available in the system. The

disk and overlapping memory computations with multiple

op erating system maintains a counter that is incremented

pro cesses. We note that our applications are highly-scalable

each time a memory page is referenced by the memory man-

streaming applications, hence we exp ect our results to scale

agement subsystem. The memory management subsystem

to larger workload sizes. Since our pro cessor-mediated com-

references pages when memory is allo cated, up on direct ker-

munication metho d has equal cost b etween on- and o -chip

nel access to pro cess memory, and when a TLB miss excep-

inter-page communication, communication costs are not de-

tion o ccurs. Sp ecial pro cessing is p erformed on this counter

p endentuponnumb er or size of memory chips. Future work

to ensure it do es not over ow and that the relative page ac-

will increase simulation capacity through the use of a higher-

cess times are maintained. When memory is to b e swapp ed

level emulation system.

to disk the LRU algorithm searches the entire page space

and returns a reference to the page with the lowest access

5 Multiprogrammed Workload

counter value.

Our workload consists of three applications whichuseAc-

3 Metho dology

tivePages and three conventional applications which do not.

The Active-Page applications consist of: database inserts

In order to demonstrate multi-pro cess op eration, a mixture

and deletes using a ++ array class library, matrix multiply

of Active-Page enhanced and conventional applications was

of a large nite-element sparse matrix, and protein sequence

chosen. For the Active-Page applications, conventional-only

matching using dynamic programming. The conventional

counterparts were implemented. The applications where

applications consist of: the gcc compiler running on input

chosen to represent a mixed workload, as one might en-

from a source le generated by the yacc parser generator,

counter in a multi-user system. A brief description of each

gzip compression of a 510 kilobyte executable le, and p erl

application is presented in Section 5. Both the conventional

executing a prime-numb er nder. The conventional appli-

and conventional/Active-Page mixed application workloads Process uses Operating System to Processes can explicitly terminate an AP function or allocate an Active Page an AP function can terminate itself.

DA I Process uses Operating System to bind functions Light-weight process invocations Schedule pages Can safely Can safely to/from swap space send to swap. send to swap. are used to invoke AP functions, or inter-page communication can invoke an AP function on receive.

Swap Space

Figure 2: State transition diagram for ActivePage memory

kilobyte size executable binary that compresses down to 133 cations are largely taken from the SPEC 95 suite, with the

kilobytes. exception of gzip, whichwe felt was more complete and up-

to-date than compress. Each of the Active-Page applications

was also implemented for a conventional memory system for

6 Results

comparison.

Database is a synthetic benchmark which consists of a

In this section, we present our results from executing our

sorted database of records, and requests on those records.

multiprogrammed workload using several physical memory

The requests arrive sp oradically from users and consist of

sizes. Our results indicate that Active-Page computations

inserts and deletes of individual records. The database was

tolerate mo derate memory pressure well, but p erformance

stored in an STL style C++ class array ob ject which pro-

degrades as physical memory size approaches the working-

vides O (1) lo okup. Active-Page functionalityisusedtoshift

set size. This degradation tends to happ en at slightly higher

data in parallel at the memory system.

rates for Active-Page computations than conventional com-

Matrix is a matrix multiply application whichmultiplies

putations. Finally, although not discussed, p erformance was

a matrix by itself. The input matrices were chosen from

found to be relatively insensitive to the memory-resource

the Harwell-Bo eing b enchmark suite [DGL92 ]. The normal

scheduling quantum.

and Active-Page versions of this application op erate using

sparse-matrices. Active-Page functionality is used to pre-

6.1 Application Characteristics

pro cess the input matrix against the columns of the matrix

to b e multiplied. The prepro cessing consists of computing

Table 2 lists the memory requirements, and access char-

the intersecting elements and reordering the memory such

acteristics of each application in our workload. The data

that the valuesoftheintersecting elements are compressed

was acquired by op erating the workloads in a simulated 256

into individual cache lines.

Mbyte physical memory size environment to ensure that no

DNA is a protein sequence matching application which

was required. Note that the DNA applica-

takes in anynumb er of input protein sequences of any length

tion constitutes nearly half the working set of the workload.

and computes all pair-wise sequence alignments [Gus97 ] .

Also note that the Active-Page version matrix trades extra

The result of the pair-wise alignments are values whichin-

storage for parallelism, but Database actually saves space.

dicate the relativelikeness of each protein sequence to every

Total memory consumption is one and a half megabytes

other protein sequence in the input data-set. Active-Pages

larger for the mixed workload than for the conventional

are used for computation of the largest common subsequence

workload.

result matrices. This application makes use of inter-Active-

Throughout our exp eriments, it was observed that the

Page communication facilities.

TLB misses remained relatively constantasphysical mem-

gcc: The cc1 compiler is a standard comp onent of the

ory size varied. Note that the total TLB misses for the

SPEC 95 b enchmark suite [SPE95 ]. The input le was a

ActivePage mixed workload is lower than for the conven-

1859 line size prepro cessed C le generated from the yacc

tional workload. Furthermore, a conventional application,

parser generator.

gcc,makes up the dominant factor in b oth the conventional

: The characteristics of the perl interpreter are dis-

and mixed workloads for TLB misses. Overall, ActivePage

+

cussed in [RLV 96] and [SPE95 ]. The input le was chosen

applications require fewer TLB misses, thus lowering the

to execute in roughly the same p erio d of time as it to ok the

TLB cache pressure.

gcc compiler to execute.

Table 3 lists required swap space versus physical mem-

gzip:We include the gzip le compression utilitytorepre-

ory size for our conventional and mixed workloads under

sent an up dated version of the standard SPEC 95 compress

various paging p olicies. The mixed workload as a whole re-

utility. The input le was chosen to be a challenging 510

quires roughly four megabytes more physical memory than

Application Conventional Mixed (AP/Conv) Conventional TLB Mixed (AP/Conv) TLB

Database 2512k 1720k 11121 1766

DNA 8032k 8384k 31230 623

Matrix 664k 2680k 14048 8620

GCC 2820k 2820k 202867 204827

Perl 1276k 1276k 16155 16758

GZIP 628k 628k 10914 12318

Total: 15932k 17508k 286335 244912

Table 2: Memory requirements and numb er of TLB misses of conventional and Active-Page applications

Memory Size 15360k 13312k 11264k 9216k 7168k

Conventional/LRU 0k 2052k 4100k 6148k 8196k

Mixed/LRU 5768k 6640k 8400k 10392k 12472k

Table 3: Required swap space for conventional versus mixed workloads

7 Related Work the conventional workload. Referring backtoTable 2, one

and a half megabytes of this di erence comes from algorith-

Active Disks, as prop osed by Riedel and Gibson [RG97 ],

mic restructuring in database and matrix. The remaining two

provide a mechanism for doing application sp eci c compu-

and half megabytes, however, come from fragmentation and

tation at the disks. This work is in a similar vein, in that

sup erpage e ects during swaps. The current ActiveOS im-

it is ooading some of the work to the disk, with similar

plementation do es not compact partially used conventional

consequences. A small emb edded pro cessor is inserted into

pages to free up more sup er-pages. Furthermore, Active

the disk controller which then executes p ortions of the user

Pages mustbeswapp ed out as entire sup er-pages even when

co de. Reidel and Gibson argue that by moving this com-

a single conventional page of storage is all that is needed. Fu-

putation to the other side of the disk interconnect, more

ture versions of ActiveOS will b e optimized to reduce these

ecient use of disk and host resources can b e achieved. It is

e ects and allow for b etter p erformance at higher memory

then further argued that Filters, Real-Time, Batching, and

pressures.

High-level supp ort are some of the applications which should

Our simulation results show that ActiveOS can run our

b ene t from Active Disks. Two applications, a database se-

mixed workload with high p erformance. Figure 3 plots appli-

lect and a parallel sort, are then lo oked at in greater depth.

cation time versus physical memory size for each application.

The idea of using multiple page sizes is well established.

We note that ActivePage applications take signi cant ad-

Many commo dity pro cessors supp ort the use of multiple

vantage of the ActivePage memory system to enhance their

page sizes, such as the DEC Alpha, SPARC, Intel, and HP

p erformance. Since each application is only charged pro-

PA-RISC [TH94 ]. A discussion of the use of multiple page

cess time for when its pro cess is scheduled in the pro cessor,

sizes is done byTalluri et al. [TKHP92 ]. Further work is

eachActivePage application gets a p otential six-fold per-

done byTallurri and Hill to expand placement algorithms to

formance advantage if its ActivePages execute while other

supp ort sup erpages [TH94 ]. ActivePages, however, involve

pro cesses are scheduled. Referring to the wall clo ck times in

b oth sup erpage management and computation scheduling.

Figure 3, we can see that conventional applications in our

There has b een a fair amountofwork in the design of ac-

mixed workload b ene t from the reduced time that Active

tive memory systems, most notably are the SWIM pro ject

Page applications sp end in the system.

+

[ACK94 ] and the PAM pro ject [VBR 96 ]. Asthana et al.

The p erformance of the ActivePage DNA application is

prop osed and built an active memory system for network

most a ected by mo derate reductions in physical memory

applications known as SWIM. SWIM is a memory system

size. Also note a pathological case in whichtheinteraction

built out of SRAM with small pro cessing elements on each

of LRU with the access patterns of DNA cause 13 Mbytes

chip. The memory system then has a back end on it whichis

to p erform b etter than 15 Mbytes of memory.

interfaced to communication lines and disk subsystems. The

Further note that the Active-Page version of matrix per-

Programmable Active Memories Pro ject (PAM) at Digital's

forms extremely p o orly when physical memory is very low.

Paris Research Lab oratory fo cused on the p erformance of a

At3 Mbytes, matrix p erforms an order of magnitude worse

memory mapp ed FPGA copro cessor on compute intensive

than at 7 Mbytes. This o ccurs b ecause of frequent and re-

applications. The FPGA copro cessor was linked to a high

p eated access to each ActivePage. Although each page is

bandwidth SRAM bank in order to keep up with the in-

touched rep eatedly, the quantity of data used from them is

creased demand for data in the FPGA. Other groups have

very small. Under extreme memory pressure, intense swap-

also prop osed computation in memory [SS95], but none, to

ping o ccurs with very little access to data by the pro cessor.

the authors' knowledge, have examined the op erating system

Overall, we see that ActiveOS mechanisms can eciently

rami cations of doing such.

virtualize ActivePages for light and mo derate memory pres-

sures. Although our simulation infrastructure limits work-

loads to sizes smaller than those in future systems, the data-

8 Future Work

intensive nature of our applications will scale well to large

sizes and we exp ect our results to b e qualitatively similar for

Although ActiveOS has shown promising results in virtual-

larger workloads.

izing Active Pages for a multi-pro cess environment, many

issues remain. A ma jor e ort is underway to automate the 1.00E+10 3.50E+10

9.00E+09 3.00E+10 8.00E+09

Conventional 7.00E+09 2.50E+10 Mixed 6.00E+09 2.00E+10 Conventional 5.00E+09 Mixed 3mb 15mb Cycles Cycles 1.50E+10 4.00E+09

3.00E+09 3mb 15mb 1.00E+10 2.00E+09 5.00E+09 1.00E+09

0.00E+00 0.00E+00 Database DNA Matrix GCC GZIP Perl Database DNA Matrix GCC GZIP Perl

Application Application

Figure 3: Conventional vs. mixed workload times for each application for LRU page replacement. Pro cess time is given on

the left and wall-clo ck time on the right (bars are for 3, 5, 7, 9, 11, 13, and 15 Mbytes of physical memory).

application partitioning pro cess. The number of applica- to supp ort b oth Active-Page and conventional applications.

tions in the multi-program workload needs to b e increased, Furthermore, work will address active Active-Page priorities

and the working set size of each application also needs to b e and memory resource priorityscheduling. All current paging

expanded. Furthermore, virtual memory paging and mem- p olicies rely up on access information generated by the pro-

ory resource scheduling techniques need to b e investigated cessor. However, Active-Pages degrade the qualityof this

in more detail. Op erating system supp ort for ActivePage information b ecause activity can b e o ccurring solely inside

memories with hardware communication facilities will need the memory system. This activity is currently not tracked

to b e addressed. Finally, supp ort for multiple sizes of Active by the current pro cessor access statistics.

Pages mayprove useful. Future work will expand on this notion of an information

For ActivePages to b ecome a successful commo dity ar- gap and provide solutions for it. Currently,we suggest a so-

chitecture, the application partitioning pro cess must b e au- lution of Active-Page addressable p er-page priorities. This

tomated. Currentwork uses hand-co ded libraries whichcan extends the notion of the tri-states for an Active-Page. Thus

b e called from conventional co de. Ideally, a compiler would an Active-Page exists in either the initial (I), dormant (D),

take high-level source co de and divide the computation into or active(A) with some priority (K) states. The priority

pro cessor co de and Active-Page functions, optimizing for based mechanisms would ensure a contract b etween the op-

memory bandwidth, synchronization, and parallelism to re- erating system and a user pro cess. The contract is that

duce execution time. This partitioning problem is very sim- if a pro cess prop erly varies the priorities of its asso ciated

ilar to that encountered in hardware-software co-design sys- Active-Pages, then the op erating system can more eciently

tems [VG95 ] whichmust divide co de into pieces whichrun execute the application under low-memory conditions. This

on general purp ose pro cessors and pieces which are imple- new information of Active-Page priorities can then b e used

mented by ASICs (Application-Sp eci c Integrated Circuits). to o set the lackof information the op erating system cur-

These systems estimate the p erformance of eachlineofcode rently has ab out the Active-Page activity. Fairness b etween

on alternativetechnologies, account for communication b e- ActivePage and conventional pro cesses, however, will b e an

tween comp onents, and use integer programming or simu- imp ortant issue.

lated annealing to minimize execution time and cost. Active Nothing restricts an Active-Page memory from contain-

Pages could use a similar approach, but would also need to ing a hardware-based communication network. However,

+

borrow from parallelizing compiler technology [HAA 96 ] to without hardware mechanisms in place to restrict inter-page

pro duce data layouts and schedule computation within the communication, p otential security and safety holes are cre-

memory system. ated. For this reason, hardware assisted communication in

The current application workload op erates within a con- Active-Page memory devices must havemechanisms to iden-

ned memory range. The required memory core size for the tify Active-Pages as b elonging to a particular pro cess. The

normal application mix requires 16 megabytes of memory, user-level function must be restricted from changing this

not counting the op erating system. Extending the memory pro cess identi cation without sucient op erating system in-

size of this workload naturally implies extending the length tervention. Furthermore, all communications must b e con-

of the execution time for each application. Future work will ned to within the same pro cess or to pro cesses de ned in

study applications with larger memory sizes. Furthermore, a restricted fashion. Inter-pro cess communication o ccurring

more Active-Page applications will b e implemented provid- within a hardware assisted communication network must b e

ing an opp ortunity for a b etter understanding of Active-Page restrictable by the op erating system. Finally, errant and ma-

application and OS interaction. licious application functions b ound to an Active-Page must

With more applications available, a b etter understanding b e restricted from a ecting other sup er-pages in the memory

of their paging characteristics can b e achieved. Future work system.

will fo cus on characterizing Active-Page program b ehavior Currentwork has fo cused on the RADram [OCS98 ] Ac-

in more detail, and then designing ecient paging p olicies tivePage memory system. It is anticipated that future Ac-

+

tivePage memories will include hardware supp ort for inter- [P 97] D. Patterson et al. The case for intelligent RAM: IRAM.

IEEE Micro, April 1997.

ActivePage communication. The techniques develop ed in

this pap er to provide virtual memory supp ort for Active [RG97] Erik Riedal and Garth Gibson. Active disks - remote ex-

ecution for network-attached storage. Technical rep ort,

Page memories are still applicable. However, hardware com-

Scho ol of Computer Science, Carnegie Mellon University,

munication networks for inter-ActivePage memory accesses

Pittsburgh PA 15213-3890, Decemb er 1997.

will p ose additional op erating system challenges in order to

+

[RLV 96] Theo dore H. Romer, Denis Lee, Geo rey M. Voelker, Alec

minimize communication delay, ensure pro cess protection,

Wolman, Wayne A. Wong, Jean-Loup Baer, Brian N. Ber-

and maximize overall application p erformance.

shad, and Henry M. Levy. The structure and p erformance

Finally,aninvestigation into multi-grain Active-Page im- of interpreters. In Proceedings of the 7th International

Conferenceon Architectural Support for Programming

plementations and asso ciated op erating system supp ort is re-

Languages and Operating Systems ASPLOS-VII, pages

quired. Researchinto application partitioning suggests that

150{159. ACM Press, Octob er 1996.

one metho d of tuning an application partition to a particular

[Sem94] Semiconductor Industry Asso ciation. The national tech-

problem size is to vary the size of the ActivePage. Rather

nology roadmap for semiconductors.

than cho ose a xed overall page size, it is prop osed that by

http://www.sematech.org/public/roadmap/, 1994.

providing two to four sizes of ActivePages, the maximum

[SPE95] SPEC. Sp ec b enchmark sp eci cations. SPEC95 Bench-

p ossible sp eedup can b e extended and maximized across a

mark Release, 1995.

broader range of problem sizes. Multi-grain Active-Page

[SS95] Steven Van Singel and Nandit Soparkar. Logic-enhanced

supp ort extends the Active-Page memory mo del but p oses

memories for data-intensive pro cessing (extended ab-

stract). In International Workshop on Memory Technol- new hardware and software design challenges.

ogy, Design and Testing, pages 88{94. IEEE Computer

So ciety Press, 1995.

9 Conclusion

[TH94] Madhusudhan Talluri and Mark D. Hill. Surpassing the

TLB p erfomance of sup erpages with less op erating system

ActiveOS demonstrates that intelligent memory systems can

supp ort. In Architectural Support for Programming Lan-

guages and Operating Systems V, pages 171{182. ACM

b e virtualized to achieve high multiprogrammed p erformance.

Press, 1994.

ActivePages provide a page-based intelligent memory archi-

[TKHP92] Madhusudhan Talluri, Shing Kong, Mark D. Hill, and

tecture that makes this virtualization p ossible. In fact, mul-

David A. Patterson. Tradeo s in supp orting two page

tiprogramming increases the opp ortunities for overlapping

sizes. In 19th annual International Symposium on Com-

pro cessor and memory computation. The fact that memory

puter Architecture, pages 415{424. ACM Press, 1992.

p erforms computation, however, means that some reason-

+

[VBR 96] J.E. Vuillemin, P. Bertin, D. Roncin, M. Shand, H. Touati,

able amount of physical memory must be available. Con-

and P. Boucard. Programmable active memories : Recon-

gurable systems come of age. IEEE Transactions on sequently, our results show high p erformance from zero to

VLSI systems, 4(1), March 1996.

mo derate memory pressures, but p erformance degrades sig-

[VG95] F. Vahid and D. Ga jski. SLIF: A sp eci cation-level in- ni cantly as available memory falls b elow the working set

termediate format for system design. In Proceedings of

sizeofany one Active-Page application.

the European Design and Test Conference, pages 185{

189, Washington, March 6{9 1995. IEEE Computer So ci-

ety Press. References

[Wil95] M. Wilkes. The memory wall and the CMOS end-p oint.

[ACK94] Abhaya Asthana, Mark Cravatts, and Paul Krzyzanowski.

Computer ArchitectureNews, 23(4), Septemb er 1995.

Design of an active memory system for network applica-

tions. In International Workshop on Memory Technol-

[WM95] W. Wulf and S. McKee. Hitting the memory wall: Im-

ogy, Design and Testing, pages 58{63. IEEE Computer

plications of the obvious. News,

So ciety Press, 1994.

23(1), March1995.

[BA97] D. Burger and T. Austin. The SimpleScalar to ol set, ver-

sion 2.0. Computer ArchitectureNews, 25(3), June 1997.

[BGK96] D. Burger, J. Go o dman, and A. Kagi. Quantifying

memory bandwidth limitations in future micropro cessors.

In International Symposium on Computer Architecture,

Philadelphia, Pennsylvania, May 1996. ACM.

[Den70] Peter J. Denning. Virtual memory. Computing Surveys,

2(3):153{189, Septemb er 1970.

[DGL92] Ian S. Du , Roger G. Grimes, and John G. Lewis. User's

guide for the Harwell-Bo eing sparse matrix collection.

Technical Rep ort TR/PA/92/86, CERFACS, 42 AveG.

Coriolis, 31057 Toulouse Cedex, France, Octob er 1992.

[Gus97] D. Gus eld. Algorithms on Strings, Trees, and Se-

quences.Cambridge University Press, UniversityofCali-

fornia, Davis, 1997.

+

[HAA 96] M. W. Hall, J. M. Anderson, S. P. Amarasinghe, B. R.

Murphy, S.-W. Liao, E. Bugnion, and M. S. Lam. Maxi-

mizing multipro cessor p erformance with the suif compiler.

Computer,Decemb er 1996.

[KADP97] Kimb erly Keeton, Remzi Arpaci-Dusseau, and David A.

Patterson. IRAM and SmartSIMM: Overcoming the I/O

b ottleneck,. In Workshop on Mixing Logic and

DRAM: Chips that Compute and Remember, Denver,

Colorado, June 1997.

[OCS98] Mark Oskin, Frederic T. Chong, and Timothy Sher-

wood. Active pages: A computation mo del for intelli-

gentmemory.InProceedings of the 25th Annual Interna-

tional Symposium on Computer Architecture (ISCA'98),

Barcelona, Spain, 1998. To App ear.