ActiveOS: Virtualizing Intelligent Memory
Mark Oskin, Frederic T. Chong, and Timothy Sherwood*
Department of Computer Science
University of California at Davis
systems with limited memory requirements suchas PDAs. Abstract
We exp ect that the memory demands of most systems will
Current trends in DRAM memory chip fabrication have led
scale with DRAM density, and multiple memory chips will
many researchers to propose \intel ligent memory" architec-
b e required. Weintro duce Active Pages, a page-based archi-
tures that integrate microprocessors or logic with memory.
tecture for intelligent memory systems (see Figure 1). Ac-
Such architectures o er a potential solution to the growing
tive Pages consist of a sup erpage of data and a collection
communication bottleneck between conventional microproces-
of functions which op erate on that data. Implementations
sors and memory. Previous studies, however, have focused
of ActivePages can execute functions for hundreds of pages
upon single-chip systems and have largely neglected o -chip
simultaneously,providing substantial parallelism.
communication in larger systems.
To use ActivePages, computation for an application must
We introduceActiveOS, an operating system which demon-
b e divided, or partitioned, between pro cessor and memory.
strates multi-process execution on Active Pages [OCS98], a
For example, we use Active-Page functions to gather op erands
page-based intel ligent memory architecture. We present re-
for a sparse-matrix multiply and pass those op erands on to
sults from multiprogrammed workloads running on a proto-
the pro cessor for multiplication. To p erform such a compu-
typeoperating system implemented on top of the SimpleScalar
tation, the matrix data and gathering functions must rst
processor simulator. Our results indicate that paging and
b e loaded into a memory system that supp orts ActivePages.
inter-chip communication can be scheduled to achieve high
Then the pro cessor, through a series of memory-mapp ed
performance for applications that use Active Pages with min-
writes, starts the gather functions in the memory system.
imal adverse e ects to applications that only use conven-
As the op erands are gathered, the pro cessor reads them from
tional pages. Overal l, ActiveOS al lows Active Pages to ac-
user-de ned output areas in each page, multiplies them, and
celerate individual applications by up to 1000 percent and to
writes the results back to the array datastructures in mem-
accelerate total workloads from 20 to 60 percent as long as
ory.
physical memory can contain the working set of each indi-
Previous work [OCS98 ] has shown that ActivePages are
vidual application.
a exible architecture that can lead to dramatic p erformance
improvements (up to 1000X) in a single-pro cess over conven-
tional memory systems. This pap er evaluates ActivePages
1 Intro duction
in a multi-pro cess environment. ActivePages fundamentally
change memory systems from a uniform storage resource to
Micropro cessor p erformance continues to follow phenomenal
a non-uniform collection of storage and computational ele-
growth curves whichdrive the computing industry. Unfor-
ments that require b oth management and scheduling.
tunately, memory systems are falling b ehind when \feeding"
We intro duce ActiveOS, a prototyp e op erating system
data to these pro cessors. Pro cessor-centric optimizations to
which illustrates the interaction between an OS and Ac-
bridge this processor-memory gap [WM95 ] [Wil95 ] include
tive Pages. Our goal is to demonstrate that ActivePages
prefetching, sp eculation, out-of-order execution, and mul-
can p erform correctly and ecientlyinamulti-pro cess en-
tithreading. Unfortunately, manyof these approaches can
vironment. In suchanenvironment, our results show that
lead to memory-bandwidth problems [BGK96 ].
while paging ActivePages to and from disk can b e exp en-
DRAM memory technology, however, is growing signif-
sive, opp ortunities to overlap memory and pro cessor com-
icantly denser. The Semiconductor Industry Asso ciation
putation increase dramatically. On mixed workloads of con-
(SIA) roadmap [Sem94] pro jects mass pro duction of 1-gigabit
ventional and Active-Page applications, ActivePage single-
DRAM chips by the year 2001. If wedevote half of the area
pro cess p erformance translates well to a multi-pro cess envi-
of such a chip to logic, we exp ect the DRAM pro cess to
ronment as long as memory pressure is mo derate. Note that
supp ort approximately 32 million transistors. This density
our current cycle-level simulation environment limits work-
has led many researchers to prop ose intelligent memory sys-
load sizes in our exp eriments. Our applications are highly
tems that integrate pro cessors or logic with DRAM. The
scalable, however, and we exp ect our results to generalize to
increased bandwidth and lower latency of on-chip commu-
larger workload sizes.
nication promises to address many asp ects of the pro cessor-
+
The remainder of this pap er is organized as follows: Sec-
memory gap [P 97] [KADP97 ].
tion 2 describ es the three main services provided by Ac-
These improvements, however, are limited to single-chip
tiveOS to supp ort Active Pages. Section 3 describ es our
exp erimental metho dology. Section 5 describ es the appli-
Acknowledgments: Thanks to Andre Dehon, Matt Farrens,
cations in our workload. Section 6 presents our results.
Lance Halstead, Diana Keen, and our anonymous referees. This
Section 7 discusses related work. Finally, Sections 8 and 9
work is supp orted in part by an NSF CAREER award to Fred
Chong, by NSF grant CCR-9812415, by grants from Altera, and
present future work and conclusions.
by grants from the UC Davis Academic Senate. More info at
http://arch.cs.ucdavis.edu/AP
*Now at University of California, San Diego
functions A set of synchronization variables which the AP
and the pro cessor use to co ordinate activities b etween
the pro cessor and ActivePages within the page group.
DD
There are two imp ortant implications of this interface.
hronization variables are essentially memory-
P F First, the sync
mapp ed registers that allow a pro cess to eciently invoke
e-Page functions without a system call. Second, the al-
D Activ
lo cation of ActivePages, which are signi cantly larger than
ventional pages, requires memory management of mixed
F con
sup erpages and pages. This memory management will b e-
come a signi cant issue as we discuss paging and our p erfor-
Disk mance results.
D
2.2 Interpage Communication D
D An application will generally use more data than will t on
ePage (each one is 512 Kbytes in our current
D a single Activ
tation). Such data will often reside in page groups,
F implemen
a collection of ActivePages whichwork together to p erform
F
functions on the data. For example, inserting an element
into the middle of an arraythatspansmultiple pages would
require elements to movebetween pages.
Such data movement is accomplished entirely via explicit
Figure 1: The ActivePage architecture
interpage communication requests. Active-Page functions
can reference any virtual address available to their allo cat-
2 ActiveOS
ing pro cess. For p erformance, applications should keep ref-
erences from functions to addresses on other pages to a min-
ActiveOS is a prototyp e op erating system for which the pri-
imum.
mary goal is to demonstrates mechanisms which correctly
Our implementation uses a processor mediated approach
manage ActivePages in a multi-pro cess environment. These
to satisfying inter-page memory references. Each Active
mechanisms fall into three key categories: pro cess services,
Page explicitly distinguishes b etween lo cal and non-lo cal ref-
interpage communication, and virtual memory.
erences bychecking if a virtual address is b etween its start-
ing address, kept in a base register, and the starting address
plus the Active-Page size. When an ActivePage reaches a
2.1 Pro cess Services
non-lo cal reference it raises an interrupt and blo cks. The
The rst step in developing op erating system supp ort for Ac-
interrupt causes the pro cessor to obtain the memory request
tivePages involves providing basic services for user pro cesses
through a memory-mapp ed read and executes an OS han-
to allo cate pages and bind functions to them. Furthermore,
dler. The handler consults a p er-pro cess table which will
a pro cess must b e able to eciently call these functions.
determine if the ActivePage containing the reference is in
The ActivePage interface is designed to resemble the in-
physical memory. If so, the page is lo cated and the request is
terface of a conventional memory system. Pro cesses commu-
satis ed. If not, a page fault o ccurs and the page is brought
nicate with ActivePages through memory reads and writes.
in from disk. In practice, alarge numb er of non-lo cal ref-
Sp eci cally,theinterface includes:
erences from di erent pages can b e satis ed up on a single
interrupt, substantially reducing overhead. We assume that
Standard memory interface functions:
a single interrupt takes 3.5 ms.
write(address, data) and read(address)
Our pro cessor-mediated approach makes inter-page com-
munication exp ensive, but greatly simpli es page faults and
A set of functions available for computation on a par-
reduces the complexity of ActivePage chip implementations.
ticular ActivePage: AP functions
We assume that our memory chips can trigger pro cessor in-
terrupts, but pro cessor p olling is also a p ossiblility.
An allo cation function:
AP al loc(group id)
2.3 Virtual Memory
When insucentphysical memory is available to execute all
id and which allo cates an ActivePage in group group
pro cesses in the system, memory must be paged to swap
returns the virtual address of the start of the sup erpage
space. We de ne the maximum required swap space to exe-
allo cated. Pages op erating on the same data will often
cute a set of pro cesses as the memory pressure. As memory
id, in order b elong to a page group, named bya group
pressure increases, ActivePages will need to b e paged out
to co ordinate op erations.
to disk. This implies that the data, the functions, and any
active state must b e written to disk. The choice of which
A binding function:
pages to swap in and out is critical to b oth the correctness
and p erformance of ActivePages. If the pro cessor or an Ac-
AP bind(group id, AP functions)
tivePage in memory references a page on disk, a page fault
o ccurs. If no reference o ccurs, however, a page on disk may
which asso ciates a set of functions AP functions with
still need to b e swapp ed in to complete a computation for
id of ActivePages. group group
a page group. ActivePages can only make computational
Parameter Reference
progress if they are in physical memory. Managing Active
CPU Clo ck 500 MHz
Pages is more than traditional virtual memory management
L1 I-Cache 32K
[Den70]. It is also scheduling of computation.
L1 D-Cache 32K
To help the OS schedule ActivePages, weidentify three
L2 Cache 256K
states in which an ActivePage exists. These are initial (I),
Reconf Logic 50 MHz
dormant (D), and active (A). When a page is allo cated, prior
Cache Miss 114 ns
to function binding, the state is initial (I). When functions
Disk Access Time 10ms
are b ound to a page, but no function is currently executing
Disk Transfer Rate 10mb/sec
at the page, weidentify the page as dormant (D). Finally,
when a function is executing at the page, we identify the page
Table 1: Summary of simulator parameters
as active (A). It is exp ected that an op erating system will
provide mechanisms to control the allo cation and binding of
were executed under various page replacement p olicies and,
functions to pages, but that userpro cesses will transition a
in the case of the mixed Active-Page/conventional applica-
page from the dormant to active states.
tion suite, with various memory resource scheduling quanta.
For p erformance reasons, we restrict the means bywhich
The results of these measurements is presented in Section 6.
an ActivePage can transition from the dormant (D) to active
These measurements were taken on a prototyp e op er-
(A) states. As shown in Figure 2, an op erating system must
ating system, ActiveOS, running on a pro cessor and mem-
p erio dically schedule Active Pages residing in swap space
ory system simulator. ActiveOS runs on the SimpleScalar
backinto physical memory in order to ensure program cor-
v2.0 [BA97 ] to olkit. This to ol set provides the mechanisms
rectness. This arises b ecause an Active Page, X, may be
to compile, debug and simulate applications compiled to a
waiting up on a result that another page on disk, Y, is ab out
RISC architecture. The SimpleScalar RISC architecture is
to write to the memory of X. Since there is no reference to
lo osely based up on the MIPS instruction set architecture.
Y, the page will not b e rescheduled without some external
The SimpleScalar environment was extended by replacing
mechanism.
the simulated conventional memory hierarchy with an Active
If we restrict the mechanism bywhichanActivePage can
Page memory system. The new simulated memory hierar-
transition from dormant to active states, then an op erating
chy provides mechanisms whichsimulate a RADram mem-
system can p erio dically schedule only active Active Pages
ory system, an FPGA-based implementation of ActivePages
back from swap, thus avoiding having to schedule dormant
describ ed in [OCS98 ]. Finally, the to olset was enhanced by
pages. The restrictions are that an ActivePage never tran-
extensive mo di cations required to supp ort the simulation
sitions from the dormant to active states without pro cessor
of an op erating system and asso ciated multi-program work-
or inter-ActivePage communication intervention. That is,
loads. The simulated machine characteristics are shown in
an ActivePage never self-activates.
Table 1.
Our results indicate that p erformance is only marginally
a ected by the scheduling quantum for periodically swap-
ping pages backintophysical memory, suggesting that as
4 Limitations
small of a quantum should b e used as p ossible such that p er-
formance is not seriously degraded in order to improve user
A signi cant limitation to our metho dology is the use of a
resp onse time.
cycle-level pro cessor-memory simulator. While our results
ActiveOS currently implements a round-robin memory
are extremely accurate, simulation sp eed limits the size of
page scheduler for ActivePages residing in swap space. For
our workloads. Consequently, our results primarly demon-
page replacement, we adopted a Least Recently Used (LRU)
strate correctness of ActiveOS mechanisms. They also reveal
algorithm which was implemented by tracking a counter
some interesting qualitative issues involving ActivePages on
value for each physical page available in the system. The
disk and overlapping memory computations with multiple
op erating system maintains a counter that is incremented
pro cesses. We note that our applications are highly-scalable
each time a memory page is referenced by the memory man-
streaming applications, hence we exp ect our results to scale
agement subsystem. The memory management subsystem
to larger workload sizes. Since our pro cessor-mediated com-
references pages when memory is allo cated, up on direct ker-
munication metho d has equal cost b etween on- and o -chip
nel access to pro cess memory, and when a TLB miss excep-
inter-page communication, communication costs are not de-
tion o ccurs. Sp ecial pro cessing is p erformed on this counter
p endentuponnumb er or size of memory chips. Future work
to ensure it do es not over ow and that the relative page ac-
will increase simulation capacity through the use of a higher-
cess times are maintained. When memory is to b e swapp ed
level emulation system.
to disk the LRU algorithm searches the entire page space
and returns a reference to the page with the lowest access
5 Multiprogrammed Workload
counter value.
Our workload consists of three applications whichuseAc-
3 Metho dology
tivePages and three conventional applications which do not.
The Active-Page applications consist of: database inserts
In order to demonstrate multi-pro cess op eration, a mixture
and deletes using a C++ array class library, matrix multiply
of Active-Page enhanced and conventional applications was
of a large nite-element sparse matrix, and protein sequence
chosen. For the Active-Page applications, conventional-only
matching using dynamic programming. The conventional
counterparts were implemented. The applications where
applications consist of: the gcc compiler running on input
chosen to represent a mixed workload, as one might en-
from a source le generated by the yacc parser generator,
counter in a multi-user system. A brief description of each
gzip compression of a 510 kilobyte executable le, and p erl
application is presented in Section 5. Both the conventional
executing a prime-numb er nder. The conventional appli-
and conventional/Active-Page mixed application workloads Process uses Operating System to Processes can explicitly terminate an AP function or allocate an Active Page an AP function can terminate itself.
DA I Process uses Operating System to bind functions Light-weight process invocations Schedule pages Can safely Can safely to/from swap space send to swap. send to swap. are used to invoke AP functions, or inter-page communication can invoke an AP function on receive.
Swap Space
Figure 2: State transition diagram for ActivePage memory
kilobyte size executable binary that compresses down to 133 cations are largely taken from the SPEC 95 suite, with the
kilobytes. exception of gzip, whichwe felt was more complete and up-
to-date than compress. Each of the Active-Page applications
was also implemented for a conventional memory system for
6 Results
comparison.
Database is a synthetic benchmark which consists of a
In this section, we present our results from executing our
sorted database of records, and requests on those records.
multiprogrammed workload using several physical memory
The requests arrive sp oradically from users and consist of
sizes. Our results indicate that Active-Page computations
inserts and deletes of individual records. The database was
tolerate mo derate memory pressure well, but p erformance
stored in an STL style C++ class array ob ject which pro-
degrades as physical memory size approaches the working-
vides O (1) lo okup. Active-Page functionalityisusedtoshift
set size. This degradation tends to happ en at slightly higher
data in parallel at the memory system.
rates for Active-Page computations than conventional com-
Matrix is a matrix multiply application whichmultiplies
putations. Finally, although not discussed, p erformance was
a matrix by itself. The input matrices were chosen from
found to be relatively insensitive to the memory-resource
the Harwell-Bo eing b enchmark suite [DGL92 ]. The normal
scheduling quantum.
and Active-Page versions of this application op erate using
sparse-matrices. Active-Page functionality is used to pre-
6.1 Application Characteristics
pro cess the input matrix against the columns of the matrix
to b e multiplied. The prepro cessing consists of computing
Table 2 lists the memory requirements, and access char-
the intersecting elements and reordering the memory such
acteristics of each application in our workload. The data
that the valuesoftheintersecting elements are compressed
was acquired by op erating the workloads in a simulated 256
into individual cache lines.
Mbyte physical memory size environment to ensure that no
DNA is a protein sequence matching application which
memory paging was required. Note that the DNA applica-
takes in anynumb er of input protein sequences of any length
tion constitutes nearly half the working set of the workload.
and computes all pair-wise sequence alignments [Gus97 ] .
Also note that the Active-Page version matrix trades extra
The result of the pair-wise alignments are values whichin-
storage for parallelism, but Database actually saves space.
dicate the relativelikeness of each protein sequence to every
Total memory consumption is one and a half megabytes
other protein sequence in the input data-set. Active-Pages
larger for the mixed workload than for the conventional
are used for computation of the largest common subsequence
workload.
result matrices. This application makes use of inter-Active-
Throughout our exp eriments, it was observed that the
Page communication facilities.
TLB misses remained relatively constantasphysical mem-
gcc: The cc1 compiler is a standard comp onent of the
ory size varied. Note that the total TLB misses for the
SPEC 95 b enchmark suite [SPE95 ]. The input le was a
ActivePage mixed workload is lower than for the conven-
1859 line size prepro cessed C le generated from the yacc
tional workload. Furthermore, a conventional application,
parser generator.
gcc,makes up the dominant factor in b oth the conventional
perl: The characteristics of the perl interpreter are dis-
and mixed workloads for TLB misses. Overall, ActivePage
+
cussed in [RLV 96] and [SPE95 ]. The input le was chosen
applications require fewer TLB misses, thus lowering the
to execute in roughly the same p erio d of time as it to ok the
TLB cache pressure.
gcc compiler to execute.
Table 3 lists required swap space versus physical mem-
gzip:We include the gzip le compression utilitytorepre-
ory size for our conventional and mixed workloads under
sent an up dated version of the standard SPEC 95 compress
various paging p olicies. The mixed workload as a whole re-
utility. The input le was chosen to be a challenging 510
quires roughly four megabytes more physical memory than
Application Conventional Mixed (AP/Conv) Conventional TLB Mixed (AP/Conv) TLB
Database 2512k 1720k 11121 1766
DNA 8032k 8384k 31230 623
Matrix 664k 2680k 14048 8620
GCC 2820k 2820k 202867 204827
Perl 1276k 1276k 16155 16758
GZIP 628k 628k 10914 12318
Total: 15932k 17508k 286335 244912
Table 2: Memory requirements and numb er of TLB misses of conventional and Active-Page applications
Memory Size 15360k 13312k 11264k 9216k 7168k
Conventional/LRU 0k 2052k 4100k 6148k 8196k
Mixed/LRU 5768k 6640k 8400k 10392k 12472k
Table 3: Required swap space for conventional versus mixed workloads
7 Related Work the conventional workload. Referring backtoTable 2, one
and a half megabytes of this di erence comes from algorith-
Active Disks, as prop osed by Riedel and Gibson [RG97 ],
mic restructuring in database and matrix. The remaining two
provide a mechanism for doing application sp eci c compu-
and half megabytes, however, come from fragmentation and
tation at the disks. This work is in a similar vein, in that
sup erpage e ects during swaps. The current ActiveOS im-
it is ooading some of the work to the disk, with similar
plementation do es not compact partially used conventional
consequences. A small emb edded pro cessor is inserted into
pages to free up more sup er-pages. Furthermore, Active
the disk controller which then executes p ortions of the user
Pages mustbeswapp ed out as entire sup er-pages even when
co de. Reidel and Gibson argue that by moving this com-
a single conventional page of storage is all that is needed. Fu-
putation to the other side of the disk interconnect, more
ture versions of ActiveOS will b e optimized to reduce these
ecient use of disk and host resources can b e achieved. It is
e ects and allow for b etter p erformance at higher memory
then further argued that Filters, Real-Time, Batching, and
pressures.
High-level supp ort are some of the applications which should
Our simulation results show that ActiveOS can run our
b ene t from Active Disks. Two applications, a database se-
mixed workload with high p erformance. Figure 3 plots appli-
lect and a parallel sort, are then lo oked at in greater depth.
cation time versus physical memory size for each application.
The idea of using multiple page sizes is well established.
We note that ActivePage applications take signi cant ad-
Many commo dity pro cessors supp ort the use of multiple
vantage of the ActivePage memory system to enhance their
page sizes, such as the DEC Alpha, SPARC, Intel, and HP
p erformance. Since each application is only charged pro-
PA-RISC [TH94 ]. A discussion of the use of multiple page
cess time for when its pro cess is scheduled in the pro cessor,
sizes is done byTalluri et al. [TKHP92 ]. Further work is
eachActivePage application gets a p otential six-fold per-
done byTallurri and Hill to expand placement algorithms to
formance advantage if its ActivePages execute while other
supp ort sup erpages [TH94 ]. ActivePages, however, involve
pro cesses are scheduled. Referring to the wall clo ck times in
b oth sup erpage management and computation scheduling.
Figure 3, we can see that conventional applications in our
There has b een a fair amountofwork in the design of ac-
mixed workload b ene t from the reduced time that Active
tive memory systems, most notably are the SWIM pro ject
Page applications sp end in the system.
+
[ACK94 ] and the PAM pro ject [VBR 96 ]. Asthana et al.
The p erformance of the ActivePage DNA application is
prop osed and built an active memory system for network
most a ected by mo derate reductions in physical memory
applications known as SWIM. SWIM is a memory system
size. Also note a pathological case in whichtheinteraction
built out of SRAM with small pro cessing elements on each
of LRU with the access patterns of DNA cause 13 Mbytes
chip. The memory system then has a back end on it whichis
to p erform b etter than 15 Mbytes of memory.
interfaced to communication lines and disk subsystems. The
Further note that the Active-Page version of matrix per-
Programmable Active Memories Pro ject (PAM) at Digital's
forms extremely p o orly when physical memory is very low.
Paris Research Lab oratory fo cused on the p erformance of a
At3 Mbytes, matrix p erforms an order of magnitude worse
memory mapp ed FPGA copro cessor on compute intensive
than at 7 Mbytes. This o ccurs b ecause of frequent and re-
applications. The FPGA copro cessor was linked to a high
p eated access to each ActivePage. Although each page is
bandwidth SRAM bank in order to keep up with the in-
touched rep eatedly, the quantity of data used from them is
creased demand for data in the FPGA. Other groups have
very small. Under extreme memory pressure, intense swap-
also prop osed computation in memory [SS95], but none, to
ping o ccurs with very little access to data by the pro cessor.
the authors' knowledge, have examined the op erating system
Overall, we see that ActiveOS mechanisms can eciently
rami cations of doing such.
virtualize ActivePages for light and mo derate memory pres-
sures. Although our simulation infrastructure limits work-
loads to sizes smaller than those in future systems, the data-
8 Future Work
intensive nature of our applications will scale well to large
sizes and we exp ect our results to b e qualitatively similar for
Although ActiveOS has shown promising results in virtual-
larger workloads.
izing Active Pages for a multi-pro cess environment, many
issues remain. A ma jor e ort is underway to automate the 1.00E+10 3.50E+10
9.00E+09 3.00E+10 8.00E+09
Conventional 7.00E+09 2.50E+10 Mixed 6.00E+09 2.00E+10 Conventional 5.00E+09 Mixed 3mb 15mb Cycles Cycles 1.50E+10 4.00E+09
3.00E+09 3mb 15mb 1.00E+10 2.00E+09 5.00E+09 1.00E+09
0.00E+00 0.00E+00 Database DNA Matrix GCC GZIP Perl Database DNA Matrix GCC GZIP Perl
Application Application
Figure 3: Conventional vs. mixed workload times for each application for LRU page replacement. Pro cess time is given on
the left and wall-clo ck time on the right (bars are for 3, 5, 7, 9, 11, 13, and 15 Mbytes of physical memory).
application partitioning pro cess. The number of applica- to supp ort b oth Active-Page and conventional applications.
tions in the multi-program workload needs to b e increased, Furthermore, work will address active Active-Page priorities
and the working set size of each application also needs to b e and memory resource priorityscheduling. All current paging
expanded. Furthermore, virtual memory paging and mem- p olicies rely up on access information generated by the pro-
ory resource scheduling techniques need to b e investigated cessor. However, Active-Pages degrade the qualityof this
in more detail. Op erating system supp ort for ActivePage information b ecause activity can b e o ccurring solely inside
memories with hardware communication facilities will need the memory system. This activity is currently not tracked
to b e addressed. Finally, supp ort for multiple sizes of Active by the current pro cessor access statistics.
Pages mayprove useful. Future work will expand on this notion of an information
For ActivePages to b ecome a successful commo dity ar- gap and provide solutions for it. Currently,we suggest a so-
chitecture, the application partitioning pro cess must b e au- lution of Active-Page addressable p er-page priorities. This
tomated. Currentwork uses hand-co ded libraries whichcan extends the notion of the tri-states for an Active-Page. Thus
b e called from conventional co de. Ideally, a compiler would an Active-Page exists in either the initial (I), dormant (D),
take high-level source co de and divide the computation into or active(A) with some priority (K) states. The priority
pro cessor co de and Active-Page functions, optimizing for based mechanisms would ensure a contract b etween the op-
memory bandwidth, synchronization, and parallelism to re- erating system and a user pro cess. The contract is that
duce execution time. This partitioning problem is very sim- if a pro cess prop erly varies the priorities of its asso ciated
ilar to that encountered in hardware-software co-design sys- Active-Pages, then the op erating system can more eciently
tems [VG95 ] whichmust divide co de into pieces whichrun execute the application under low-memory conditions. This
on general purp ose pro cessors and pieces which are imple- new information of Active-Page priorities can then b e used
mented by ASICs (Application-Sp eci c Integrated Circuits). to o set the lackof information the op erating system cur-
These systems estimate the p erformance of eachlineofcode rently has ab out the Active-Page activity. Fairness b etween
on alternativetechnologies, account for communication b e- ActivePage and conventional pro cesses, however, will b e an
tween comp onents, and use integer programming or simu- imp ortant issue.
lated annealing to minimize execution time and cost. Active Nothing restricts an Active-Page memory from contain-
Pages could use a similar approach, but would also need to ing a hardware-based communication network. However,
+
borrow from parallelizing compiler technology [HAA 96 ] to without hardware mechanisms in place to restrict inter-page
pro duce data layouts and schedule computation within the communication, p otential security and safety holes are cre-
memory system. ated. For this reason, hardware assisted communication in
The current application workload op erates within a con- Active-Page memory devices must havemechanisms to iden-
ned memory range. The required memory core size for the tify Active-Pages as b elonging to a particular pro cess. The
normal application mix requires 16 megabytes of memory, user-level function must be restricted from changing this
not counting the op erating system. Extending the memory pro cess identi cation without sucient op erating system in-
size of this workload naturally implies extending the length tervention. Furthermore, all communications must b e con-
of the execution time for each application. Future work will ned to within the same pro cess or to pro cesses de ned in
study applications with larger memory sizes. Furthermore, a restricted fashion. Inter-pro cess communication o ccurring
more Active-Page applications will b e implemented provid- within a hardware assisted communication network must b e
ing an opp ortunity for a b etter understanding of Active-Page restrictable by the op erating system. Finally, errant and ma-
application and OS interaction. licious application functions b ound to an Active-Page must
With more applications available, a b etter understanding b e restricted from a ecting other sup er-pages in the memory
of their paging characteristics can b e achieved. Future work system.
will fo cus on characterizing Active-Page program b ehavior Currentwork has fo cused on the RADram [OCS98 ] Ac-
in more detail, and then designing ecient paging p olicies tivePage memory system. It is anticipated that future Ac-
+
tivePage memories will include hardware supp ort for inter- [P 97] D. Patterson et al. The case for intelligent RAM: IRAM.
IEEE Micro, April 1997.
ActivePage communication. The techniques develop ed in
this pap er to provide virtual memory supp ort for Active [RG97] Erik Riedal and Garth Gibson. Active disks - remote ex-
ecution for network-attached storage. Technical rep ort,
Page memories are still applicable. However, hardware com-
Scho ol of Computer Science, Carnegie Mellon University,
munication networks for inter-ActivePage memory accesses
Pittsburgh PA 15213-3890, Decemb er 1997.
will p ose additional op erating system challenges in order to
+
[RLV 96] Theo dore H. Romer, Denis Lee, Geo rey M. Voelker, Alec
minimize communication delay, ensure pro cess protection,
Wolman, Wayne A. Wong, Jean-Loup Baer, Brian N. Ber-
and maximize overall application p erformance.
shad, and Henry M. Levy. The structure and p erformance
Finally,aninvestigation into multi-grain Active-Page im- of interpreters. In Proceedings of the 7th International
Conferenceon Architectural Support for Programming
plementations and asso ciated op erating system supp ort is re-
Languages and Operating Systems ASPLOS-VII, pages
quired. Researchinto application partitioning suggests that
150{159. ACM Press, Octob er 1996.
one metho d of tuning an application partition to a particular
[Sem94] Semiconductor Industry Asso ciation. The national tech-
problem size is to vary the size of the ActivePage. Rather
nology roadmap for semiconductors.
than cho ose a xed overall page size, it is prop osed that by
http://www.sematech.org/public/roadmap/, 1994.
providing two to four sizes of ActivePages, the maximum
[SPE95] SPEC. Sp ec b enchmark sp eci cations. SPEC95 Bench-
p ossible sp eedup can b e extended and maximized across a
mark Release, 1995.
broader range of problem sizes. Multi-grain Active-Page
[SS95] Steven Van Singel and Nandit Soparkar. Logic-enhanced
supp ort extends the Active-Page memory mo del but p oses
memories for data-intensive pro cessing (extended ab-
stract). In International Workshop on Memory Technol- new hardware and software design challenges.
ogy, Design and Testing, pages 88{94. IEEE Computer
So ciety Press, 1995.
9 Conclusion
[TH94] Madhusudhan Talluri and Mark D. Hill. Surpassing the
TLB p erfomance of sup erpages with less op erating system
ActiveOS demonstrates that intelligent memory systems can
supp ort. In Architectural Support for Programming Lan-
guages and Operating Systems V, pages 171{182. ACM
b e virtualized to achieve high multiprogrammed p erformance.
Press, 1994.
ActivePages provide a page-based intelligent memory archi-
[TKHP92] Madhusudhan Talluri, Shing Kong, Mark D. Hill, and
tecture that makes this virtualization p ossible. In fact, mul-
David A. Patterson. Tradeo s in supp orting two page
tiprogramming increases the opp ortunities for overlapping
sizes. In 19th annual International Symposium on Com-
pro cessor and memory computation. The fact that memory
puter Architecture, pages 415{424. ACM Press, 1992.
p erforms computation, however, means that some reason-
+
[VBR 96] J.E. Vuillemin, P. Bertin, D. Roncin, M. Shand, H. Touati,
able amount of physical memory must be available. Con-
and P. Boucard. Programmable active memories : Recon-
gurable systems come of age. IEEE Transactions on sequently, our results show high p erformance from zero to
VLSI systems, 4(1), March 1996.
mo derate memory pressures, but p erformance degrades sig-
[VG95] F. Vahid and D. Ga jski. SLIF: A sp eci cation-level in- ni cantly as available memory falls b elow the working set
termediate format for system design. In Proceedings of
sizeofany one Active-Page application.
the European Design and Test Conference, pages 185{
189, Washington, March 6{9 1995. IEEE Computer So ci-
ety Press. References
[Wil95] M. Wilkes. The memory wall and the CMOS end-p oint.
[ACK94] Abhaya Asthana, Mark Cravatts, and Paul Krzyzanowski.
Computer ArchitectureNews, 23(4), Septemb er 1995.
Design of an active memory system for network applica-
tions. In International Workshop on Memory Technol-
[WM95] W. Wulf and S. McKee. Hitting the memory wall: Im-
ogy, Design and Testing, pages 58{63. IEEE Computer
plications of the obvious. Computer Architecture News,
So ciety Press, 1994.
23(1), March1995.
[BA97] D. Burger and T. Austin. The SimpleScalar to ol set, ver-
sion 2.0. Computer ArchitectureNews, 25(3), June 1997.
[BGK96] D. Burger, J. Go o dman, and A. Kagi. Quantifying
memory bandwidth limitations in future micropro cessors.
In International Symposium on Computer Architecture,
Philadelphia, Pennsylvania, May 1996. ACM.
[Den70] Peter J. Denning. Virtual memory. Computing Surveys,
2(3):153{189, Septemb er 1970.
[DGL92] Ian S. Du , Roger G. Grimes, and John G. Lewis. User's
guide for the Harwell-Bo eing sparse matrix collection.
Technical Rep ort TR/PA/92/86, CERFACS, 42 AveG.
Coriolis, 31057 Toulouse Cedex, France, Octob er 1992.
[Gus97] D. Gus eld. Algorithms on Strings, Trees, and Se-
quences.Cambridge University Press, UniversityofCali-
fornia, Davis, 1997.
+
[HAA 96] M. W. Hall, J. M. Anderson, S. P. Amarasinghe, B. R.
Murphy, S.-W. Liao, E. Bugnion, and M. S. Lam. Maxi-
mizing multipro cessor p erformance with the suif compiler.
Computer,Decemb er 1996.
[KADP97] Kimb erly Keeton, Remzi Arpaci-Dusseau, and David A.
Patterson. IRAM and SmartSIMM: Overcoming the I/O
bus b ottleneck,. In Workshop on Mixing Logic and
DRAM: Chips that Compute and Remember, Denver,
Colorado, June 1997.
[OCS98] Mark Oskin, Frederic T. Chong, and Timothy Sher-
wood. Active pages: A computation mo del for intelli-
gentmemory.InProceedings of the 25th Annual Interna-
tional Symposium on Computer Architecture (ISCA'98),
Barcelona, Spain, 1998. To App ear.