<<

Appears in the International Symposium on Barcelona

Active Pages A Mo del for Intelligent Memory

Mark Oskin Frederic T Chong and TimothySherwood

Department of

University of California at Davis

yield higher parallelism and b etter integration with commo d Abstract

ity micropro cessors when compared to architectures suchas

Micropro cessors and memory systems suer from a growing

IRAM Pat Since memory technologies are a moving tar

gap in p erformance Weintro duce Active Pages a computa

get we measure the sensitivity of our results to the sp eed of

tion mo del which addresses this gap by shifting dataintensive

ActivePage implementations This allows us to generalize to

to the memory system An ActivePage consists

currently available technologies such as DRAM macro cells in

of a page of data and a set of asso ciated functions whichcan

ASIC ApplicationSp ecic technologies

op erate up on that data We describ e an implementation of Ac

This pap er starts with a description of Active Pages in

tivePages on RADram Recongurable Architecture DRAM

Section and continues with our RADram implementation

a memory system based up on the integration of DRAM and

in Section We then describ e our exp erimental metho dology

recongurable logic Results from the SimpleScalar simulator

in Section and our applications in Section Wecontinue

BA demonstrate up to X sp eedups on several applica

with the recongurable logic designs for each application in

tions using the RADram system versus conventional memory

Section We present our results in Section and generalize

systems We also explore the sensitivity of our results to im

these results to other technologies in Section Finally we

plementations in other memory technologies

conclude with a discussion of related work in Section future

work in Section and conclusions in Section

Intro duction

Active Pages

Micropro cessor p erformance continues to follow phenomenal

growth curves which drive the industry Unfortu

ActivePages intro duce new programming system and fab

nately memorysystem p erformance is falling b ehind Pro cessor

rication issues In this section we shall discuss program

centric optimizations to bridge this memory gap in

ming issues which arise from the ActivePage computational

clude prefetching sp eculation outoforder execution and mul

mo del These issues are partitioning co ordination computa

tithreading WM Several of these approaches can lead to

tional scaling and data manipulation We will discuss system

memorybandwidth problems BGK Weintro duce Active

and fabrication issues in Section where we intro duce the

Pages a mo del of computation which partitions applications

RADram ActivePage implementation

between a pro cessor and an intelligent memory system Our

To use ActivePages computation for an application must

goal is to keep pro cessors running at p eak sp eeds by oloading

b e divided or partitionedbetween pro cessor and memoryFor

data manipulation to logic placed in the memory system

example we use ActivePage functions to gather op erands for

Active Pages consist of a page of data and a set of asso

a sparsematrix multiply and pass those op erands on to the

ciated functions that op erate on that data For example an

pro cessor for multiplication To p erform such a computation

ActivePage may contain an array data structure and a set of

the matrix data and gathering functions must rst b e loaded

insert delete and nd functions that op erate on that array A

into a memory system that supp orts ActivePages The pro

memory system that implements ActivePages is resp onsible

cessor then through a series of memorymapp ed writes starts

for b oth the storage of the data and the computation of the

the gather functions in the memory system As the op erands

asso ciated functions

are gathered the pro cessor reads them from userdened out

Rapid advances in fabrication technology promise to make

put areas in each page multiplies them and writes the results

the integration of logic and memory practical Although Ac

back to the array datastructures in memory

tivePages can b e implemented in a varietyofarchitectures and

technologies we fo cus up on the integration of recongurable

Interface To simplify integration with commo dity micro

logic and DRAM Weintro duce the RADram Recongurable

pro cessors and systems the interface to ActivePages is de

Architecture DRAM system On many applications our sim

signed to resemble a conventional virtual memory interface

ulations show substantial p erformance gains for a unipro ces

Sp ecically the ActivePage interface includes the following

sor workstation using a RADram system versus a conventional

memory system RADram can also as a conventional

Standard memory interface functions

As memory system with negligible p erformance degradation

writevaddr data and readvaddr

we shall see in Section RADram is likely to have sup erior

A set of functions available for computation on a partic

functions ular ActivePage AP

Acknowledgements Thanks to Andre DeHon Matt Farrens Lance

Halstead Tom Simon Deb orah Wallach and our anonymous referees

An allo cation function

This work is supp orted in part by an NSF CAREER award to Fred

AP al locgroup idvaddr

Chong by Altera and bygrants from the UC Davis Academic Senate

which allo cates an ActivePage in group group id at vir

More info at httparchcsucdaviseduRAD

tual address vaddr Pages op erating on the same data

will often b elong to a page group named bya group id in order to co ordinate op erations

Speedup

A function binding pro cedure

AP bindgroup id AP functions

functions to a which binds a set a set of functions AP

group group id of Active Pages This set of functions

Processor / Memory

may be redened through rep eated calls to AP bind

Non-Overlap

Since implementations may limit the number or com

plexity of functions asso ciated with each page rebinding

may be necessary to make ro om for new functions by eliminating old ones Sub-page Scalable Saturated

Region Region Region

Additionally applications will commonly use several vari Speedup (Active Page/Convential)

Processor / Active Page Non-Overlap

ables in each Active Page as synchronization variables

Problem Size

functions and a pro cessor to co ordinate between AP

These variables require no additional supp ort beyond

Figure Exp ected computation scaling of ActivePages

functions and reads and writes Memory accesses by AP

a pro cessor are atomic

Activation Time Intuitively a pro cessor working with

ActivePage functions use virtual addresses and can refer

a memory system that implements ActivePages is similar to a

ence any virtual address available to the allo cating pro cess In

control pro cessor working with a small dataparallel machine

our sparsematrix example the co de b egins by calling AP al loc

Typically an is partitioned by rst dispatchingare

to allo cate a group of pages to store the matrices to b e multi

quest for a computation to o ccur on the data within an Active

plied Then AP functions are dened to include a function for

Page Awellstructured application will havetomove little

bind is called index comparison and data gathering Next AP

if any additional data into the page in order for that function

to asso ciate this function to the pages To start the page

to complete Thus the ma jority of time in dispatching a work

computations the pro cessor activates the pages with an or

request is sp entcommunicating to the ActivePage the func

dinary memory write to an applicationdened lo cation The

tion to invoke and additional required parameters Werefer

AP functions poll such synchronization variables as so on as

to the time it takes to dispatch this request as activation time

AP bind is called Once the functions have computed their

Activation time is generally constant for each page for a given

results and gathered the matrix data to b e multiplied they

function measurements for each application will b e given in

write to another set of synchronization variables to indicate

Table

the data is ready The pro cess p olls these variables and b e

gins reading and multiplying the data once it is ready

Co ordination Partitioning computations implies that Ac

Our global virtual address space implies that some Active

tivePages must co ordinate with the pro cessor and with each

Page functions may reference data in other pages Such refer

other Pro cessorpage co ordination is accomplished via pre

ences are meant to b e used sparingly and the implementation

dened synchronization variables Interpage co ordination is

of interpage memory references will b e discussed in Section

accomplished with interpage memory references

ActivePage implementations are intended to function in any

Synchronization variables are used to co ordinate activities

system that uses a conventional memory system For example

between the Active P age functions and the pro cessor The

pages may co ordinate with multiple pro cessors in a Symmetric

structure and layout of these variables are implementation and

Multipro cessor using ActivePage synchronization variables

application sp ecic The variables may serveaslocks to indi

to enforce atomicity

cate when inputs or outputs for an ActivePage op eration are

valid This interface is similar to memorymapp ed registers

Partitioning In our sparsematrix example the applica

used for network interfaces

tion was partitioned b etween work done at the memory system

The Active Page mo del of computation do es not dene

and work done at the pro cessor Such partitioning varies in

an explicit means for interpage communication Supp ort for

emphasis b etween ecient use of pro cessor computation and

communication b etween pages can b e accomplished in a va

ecient use of ActivePage computation We refer to these

riety of fashions Abstractly all forms of communication are

two extremes as processorcentric and memorycentric parti

viewed as nonlo cal memory references issued by an Active

tioning Pro cessorcentric partitioning is appropriate for al

Page For p erformance reasons an ActivePage memory sys

gorithms with complex computations such as oating p oint

tem maycho ose to combine several references into a contigu

Memorycentric partitioning is appropriate for data manipu

ous interpage memory copy Our RADram implementation

lation and integer arithmetic

Section simulates such an approach

Sparsematrix computations require substantial oating

p oint computation and suggest a pro cessorcentric partition

Computation Scaling The computational p ower of Ac

ing ActivePages compute which op erands must b e multiplied

tive P ages scales in an unusual way as application problem

with the goal of providing the pro cessor with enough op erands

sizes grow In this section wedevelop some intuition ab out

to keep it running at p eak sp eeds Our image pro cessing ap

this scaling and wewillverify these intuitions in Section

plication on the other hand uses integer arithmetic and can

Traditional multipro cessors generally op erate with a xed

b e p erformed almost entirely in ActivePages Consequently

numb er of pro cessing engines whichmust b e applied to a vari

the goal is to exploit parallelism and use as manyActivePages

able problem size With ActivePages the numb er of pro cess

as p ossible

ing engines is coupled to physical memory size Since many

systems are designed to scale memory size to contain the data

of their intended applications more ActivePages will b e avail Reconfigurable Bit-array (4096kbit)

able for the computation Logic

Figure shows howwe exp ect ActivePage p erformance to

scale as problem size grows Speedup refers to the p erformance

of a system using a conventional memory system divided the

Column-select

p erformance of a system using Active Pages NonOverlap

Row-select

Time is the time the pro cessor sp ends waiting for ActivePage

computation whichisnotoverlapp ed with pro cessor compu

tation This is indicative of the quality of partitioning As

illustrated in Figure we exp ect three regions of sp eedup as

problem sizes scale

The subpage region For very small problem sizes ap

plications use a small number of ActivePages and utilization

of those pages is p o or Activation time dominates the compu

Figure The RADram System

tation and sp eedups do not scale until the ActivePage func

tion ooads sucientwork from the pro cessor

The scalable region Once the problem is larger the

Parameter Reference Variation

numb er of ActivePages involved increases linearly The cor

CPU Clo ck GHz

er results in linear resp onding increase in computational p ow

L ICache K

sp eedups

L DCache K KK

The saturated region Although the numb er of Active

L Cache M KM

Pages grows with data size the number of pro cessors in a

Reconf Logic MHz MHz

system do es not Consequentlywe exp ect sp eedups to even

Cache Miss ns ns

tually level o as the pro cessorcomp onent of the application

saturates constant pro cessor resources This leveling o can

Table Summary of RADram parameters

also pro duce a degradation in p erformance as an increased

numb er of ActivePages can increase the synchronization and

communication overhead

I The RADram system asso ciates LEs Logic El

Ideallywewant sp eedups which are in the rightmost p or

ements a standard blo ck of logic in FPGAs which is based

tion of the scalable region Fortunately partitions can be

up on a elementLookUpTable or LUT to each of these

tuned to shift this scalable region towards sp ecic problem

subarrays This allows ecient supp ort for ActivePage sizes

sizes

which are multiples of kbytes

Each LE requires ab out K transistors of area on a logic

chip The Semiconductor Industry Asso ciation SIA roadmap

Data Manipulation In addition to providing scalable

Sem pro jects mass pro duction of gigabit DRAM chips by

computation ActivePages allow programmers to optimize for

the year If wedevote half of the area of sucha chip to

density and indexing rather than data manipulation Cur

logic we exp ect the DRAM pro cess to supp ort approximately

rently programmers have a wealth of data structures they

M transistors which is enough to provide LEs to each

can cho ose to use for anygiven problem However these data

K subarray of the remaining gigabits of memory on

structures each have advantages and disadvantages For in

the chip DeHon DeHb gives several estimates of FPGA

stance doublylinked lists provide fast insertion and deletion

area

of elements but poor random access On the other hand

We adopt a processormediated approachtointerpage com

arrays provide fast random access but p o or p erformance on

munication which assumes infrequentcommunication When

insertions and deletions

an ActivePage function reaches a memory reference that can

To some extent ActivePages remove the burden of com

not b e satised by its lo cal page it blo c ks and raises a pro ces

promise when cho osing a data structure For example our

sor interrupt The pro cessor satises the request by reading

implementation of the STL array class uses dense arrays but

and writing to the appropriate pages Once an interrupt is

exploits Active Page functions to provide fast insertion and

raised the pro cessor generally satises many requests from

deletion

dierent pages in the system Future work will evaluate hard

ware mechanisms for inchip communication increasing the

Implementation RADram

number of outstanding references per page and pro cessor

polling for requests The pro cessormediated metho dology

we describ e the Recongurable Architecture In this section

however functions well for our applications and will greatly

DRAM RADram system shown in Figure RADram is an

simplify future work in paging and virtual memory

architecture based up on the integration of the next generation

Table lists the parameters of our reference RADram im

of FPGA FieldProgrammable Gate Array and DRAM tech

plementation Several parameters were also individually var

nology To minimize latency and reduce p ower consumption

ied in our exp eriments with resp ect to the reference imple

large DRAMs are divided into subarrays each with its own

mentation The range of variation for these parameters is also

subset of address bits and deco ders I RADram exploits

given in Table Additionally a memory bus capable of trans

this structure by asso ciating a blo ck of recongurable logic

ferring bits of data b etw een memory and cache every ns

with each subarray

is assumed

RADram Architecture For gigabit DRAMs a go o d

subarray size to minimize p ower and latency is Kbytes

Why Recongurable Logic The potential of gi some applications Application p erformance however is high

gabit densities in DRAM has prompted research and devel despite conservative bandwidth

opmentina variety of implementation options for intelligent

memory IRAM Pat an integration of pro cessor core and

Metho dology

DRAM is a wellknown option studied at Berkeley RADram

howeverislikely to have b etter yield higher parallelism and

To evaluate Active Pages we conducted a detailed applica

better integration with commo dity pro cessors than IRAM

tion study The reference ActivePage platform used for this

The primary advantage of RADram memory devices is that

study was previously describ ed in Section This platform

they will be inexp ensive to fabricate Pro cessor chips cost

was studied using a three step approach First a simulator

ten times as muchasmemorychips b ecause their complexity

was implemented which mo deled the RADram ActivePage

makes their yield or p ercentage of working chips muchlower

system Second a set of applications were chosen memory

Prz DRAMs are fabricated with redundant memory cells

which represented various algorithmic domains Finally these

that can replace defective cells through laser mo dication after

applications were written and optimized for b oth the RADram

chip pro duction The uniform nature of recongurable logic

and conventional memory system architectures

allows for similar measures in RADram chips In contrast

As a base for a simulation environmentwe started with the

ork hard to avoid yields IRAM chip designers will havetow

SimpleScalar v to ol set BA This to ol set provides the

similar to pro cessor chips If IRAM chips are fabricated at

mechanisms to compile debug and simulate applications com

pro cessor costs systems will b e limited to a few IRAM chips

piled to a RISC architecture The SimpleScalar RISC archi

and to applications with smaller data RADram is intended

tecture is lo osely based up on the MIPS R instruction set

to fabricate at DRAM costs whichallows dozens of chips p er

architecture The SimpleScalar environmentwas extended by

system and much larger application data

replacing the simulated conventional memory hierarchy with

Our results will show that RADram can exploit extremely

an ActivePage memory system The new simulated mem

high parallelism by supp orting simple applicationsp ecic op

ory hierarchyprovides mechanisms which simulate RADram

erations in memory Amultigigabit RADram can have more

applicationsp ecic circuits executing within the DRAM mem

than Active Pages each of which can execute simulta

ory system Further the SimpleScalar instruction set was ex

neously Pro cessorinDRAM solutions can not supp ort such

tended with Intel MMX multimedia instruction op co des Fi

high parallelism The variety of custom op erations used in our

nally the to olset was enhanced by up dating the GNU CC

applications also suggests that xed logic would severely limit

compiler version included to the latest v compiler suite

the functionality of ActivePage applications

All applications in this study were compiled with the O op

Finally RADram is sp ecically designed to supp ort com

timization option

mo dity micropro cessors The RADram interface is compatible

After implementation of this simulation environment a set

with standard memory busses A primary goal of RADram is

of applications was chosen for architectural evaluation Each

to supply micropro cessors with enough data to keep them run

application is briey describ ed in Section Here we explore

ning at p eak sp eeds IRAM technologyhowever is intended

the metho dology used in cho osing partitioning and evaluating

to comp ete with commo dity pro cessors This comp etition may

these applications

eventually b e favorable for IRAM as the imp ortance of single

Applications were chosen with three motives in mind First

chip systems increases but evergrowing applications mayal

the to b e implemented in the application were rep

wa ys demand larger memories and multiple chips

resentative of a broad class of algorithms used in a range of ap

plications Second the algorithm or application illustrated a

Fabrication Interest in the fabrication of Merged DRAM

certain kind of partitioning as describ ed in Section Finally

Logic MDL devices has grown dramatically in the past few

an MMXinstructionset compatible application was chosen

years Ma jor manufacturers currently have the capabilityto

to explore ActivePage implementations other than RADram

fabricate DRAM cells macro cells in logic chips Pro cessors

For instance future work may investigate the p ossibility of

have also b een fabricated in DRAM chips Current DRAM in

identifying a small key set of data manipulation primitives

logic chips has p o or density Logic in DRAM chips has p o or

which should b e implemented in xed logic in the ActivePage

sp eed and density Merged DRAMlogic pro cesses whichcan

mo del

fabricate b oth kinds of structures well are b ecoming available

The rst step in studying each application or algorithm

Prz Our study however is conservative and assumes a

describ ed in Section is to implement and optimize it on a

DRAM pro cess with asso ciated p enalties in logic sp eed and

conventional memory system The application is then hand

density

partitioned for an ActivePage memory system Next Active

Page functions are co ded in VHDL and synthesized to FPGA

discussed in Section State logic The results of this are

Power Power consumption is a ma jor concern for DRAM

transition characteristics of these synthesized circuits is used

chips b ecause increased chip temp eratures result in higher

to simulate the functions with our SimpleScalar simulator

charge leakage from storage cells This leakage increases the

need for more frequent DRAM refresh Fortunately this higher

refresh can b e bundled into our logic added to each DRAM

Applications

subarra y

Although a detailed study of power is beyond the scop e

In order to demonstrate eective partitioning of applications

of this pap er wehave b een conservative in our use of p ower

between pro cessor and ActivePages wechose a range of appli

in RADram Our applications only use bits of bandwidth

cations representing b oth memory and pro cessorcentric par

between data and logic in RADram pages This could easily b e

titioning Table summarizes the attributes of these appli

increased to or bits but would result in higher p ower

cations This section describ es each application and divides

consumption Increasing bandwidth would also require more

those descriptions into each partitioning class

recongurable logic whichisbeyond our area constraints for

MemoryCentric Applications

Name Application Pro cessor Computation ActivePage Computation

Array C standard template C co de using array class Array insert delete

library arrayclass Crosspage moves and nd

Database Address Database Initiates queries Searches unindexed data

Summarizes results

Median Median lter for images Image IO Median of neighb oring pixels

Dynamic Prog Protein sequence matching Backtracking Compute MINs and lls table

Pro cessorCentric Applications

Name Application Pro cessor Computation ActivePage Computation

Matrix Matrix multiply for Floating p ointmultiplies Index comparison and

Simplex and nite element gatherscatter of data

MPEGMMX MPEG deco der using MMX dispatch MMX instructions

MMX instructions Discrete cosine transform

Table Summary of partitioning of applications b etween pro cessor and active pages

theory all records can b e searched simultaneously In practice MemoryCentric Partitioning

the records are group ed into blo cks which are roughly the size

As discussed in Section ActivePages can exploit the par

of a RADram memory page These blo cks are then distributed

allelism in applications through memorycentric partitioning

among the pages in the RADram memory system Eachpage

Our array database median ltering and dynamic program

is then custom programmed with the search engines applica

ming applications are go o d examples of such partitioning

tion sp ecic circuit To demonstrate the p erformance of the

RADram system on this application a count of exact matches

STL Array Template The STL array template is a

for the last name of an individual in the address b o ok is p er

general purp ose C template which p ermits the storage

formed The count is run on the same database in b oth the

access and retrieval of ob jects based up on a linear integer

RADram system and on a conventional implementation

index The template class supp orts the usual array access

op erators as well as insert delete and binaryndcount op

Image Pro cessing Image pro cessing and signal pro

erations All of the applications implemented hide the layout

cessing have b een traditional strengths of FPGAs and cus

of data and partitioning of algorithmic op erations from the

tom pro cessor technologies R AA K We im

application via a simple C interface However the STL

plemented an image median ltering RW application on

array b est demonstrates this principle Library calls derived

RADram Median ltering is a nonlinear metho d whichre

from a common sub class allow single source les to work with

duces the noise contained in an image without blurring the

either the ActivePage or conven tionalsystem implementation

highfrequency comp onents of the image signal The RADram

of the array template

implementation divides the image byrow blo cks among var

The implementation uses recongurable logic to sp eedup

ious Active Pages Each row blo ck contains two additional

the following op erations array insert delete and countoper

rows one ab ove the currentrowblock and one b elow it in

ations The insert and delete op erations involvemoving p or

order to p erform the median ltering kernel computation The

tions of the array in parallel to accommo date the change in

ActivePages are then programmed with a custom circuit de

array size The count op eration is implemented by a binary

signed to nd the median of nine short integer values For

comparison circuit

comparison the conventional system uses a handco ded algo

These three op erations are indicative of a broad range of

rithm whichtakes a minimal numb er of comparisons to nd

array op erations which the RADram system can eectively

the median of nine values

compute Further examples from the STL library include

Because the computational work involved is small in terms

accumulate partial sum random shue rotate and adjacent

of circuit area the bulk of the median ltering application

dierence

runs inside the RADram memory system Not surprisingly

this application allows RADram to exploit high parallelism

Database Query Several metho ds SKS exist to sp eed

and memory bandwidth RADram also uses a custom circuit

up database searches if the searches involve indexed elds

which is designed for sorting nine short integer values The

Indexing pro duces a second table within the database which

conventional implementation requires several conditional in

p ermits the database engine to quickly lo cate elds in logarith

structions as well as memory IO op erations in order to nd

mic or constant time However indexing is often not practical

the median value

for highlyvaried queries or under tight storage constraints

e time prop ortionally linear to the Unindexed queries can tak

Largest Common Subsequence This algorithm is

numb er of records Our database b enchmark uses a syntheti

representative of a broad class of string algorithms which form

cally generated address b o ok Custom ActivePage functions

the basis for mo dern biological research At the heart of the

were written to search for exact matches on any of the string

computer algorithm to reconstruct DNA sequences are string

elds contained in the address records

algorithms such as largest common subsequence global align

The RADram system time complexity of the unindexed

ment and lo cal alignment Gus The largest common sub

database query is O however the constant b ounding it is

sequence LCS computation is typically done using a dynamic

quite large The p erformance gained by the RADram system



programming construction This construction runs in O n

comes from the parallelism available in the database search In

time and space for sequences of length n One can view the

construction as a set of computations over a plane For the match fetch the data corresp onding to those indices multiply

LCS algorithm the computation can pro ceed in parallel as a the data and write the data back to its appropriate lo cation

wavefront starting at the upp er left corner and ending in the In contrast the RADram system implements a compare

lower right corner of this plane This wavefront computation gathercompute approach Active Page functions fetch and

runs in O n log n time on the RADram system compare vector indices fetchthedata values for the indices

The RADram system implements the LCS computation that match and gather the data into cacheline size blo cks

by dividing the algorithm into two steps The rst step is Vectors are colo cated on pages The pro cessor then reads the

the computation of the LCS result matrix itself The second packed data computes the multiplies and writes backcache

step is the backtracking CLR required to nd the largest line size blo cks of results Note that only useful data travels

common subsequence The RADram system executes the rst between the pro cessor and memory greatly conserving band

step entirely within the recongurable logic inside the memory width With large matrices the RADram system has enough

system Backtracking executes entirely within the pro cessor Active Pages executing to keep the pro cessor computing at

p eak oatingp ointspeeds

Pro cessorCentric Partitioning

Synthesized Logic

ActivePages are intended for simple applicationsp ecic op er

ations leaving more complex computations to generalpurp ose

In order to estimate p erformance and area of RADram logic

micropro cessors Our MMX and matrix applications are go o d

congurations each function of an applications ActivePages

examples of pro cessorcentric partitioning

in a highlevel circuitdescription language was handco ded

VHDL Ash and circuits synthesized to completely routed

MMX Primitives The MMX multimedia instruction

designs in contemp orary FPGA technology This provided a

primitives were chosen for implementation within the RADram

means to verify the timing of the simulated circuit implemen

system for two reasons First they represen t a well known

tation as well as information on circuit area which help ed

commo dity set of architecture primitives Second they are

guide the RADram design

simple primitive op erations designed for parallel execution

The results of our implementations of the application sp e

The simulator was extended to supp ort SimpleScalar MMX

cic circuits for the simulated applications are summarized in

instructions and RADram MMX instruction equivalents The

Table These results were obtained by implementing the cir

MMX instructions themselves are highly parallel simple and

cuit design in b ehavioral VHDL and synthesizing them with

generally complete in a single pro cessor cycle To improve

the Synopsys FPGA design to ols After synthesis to a tech

up on the base SimpleScalar MMX instructions the RADram

nology indep endent logic description the designs were placed

equivalents op erate on larger data widths While an MMX

and routed to an Altera FLEXK part This allowed

instruction in SimpleScalar is restricted to pro ducing only

us to study the p ostrouted designs on real FPGA technology

bits of data p er instruction a RADram MMX instruction can

The countof logic blo ck usage rep orted in Table includes

pro duce up to kbytes of data p er instruction

b oth completely used and partially used LEs The sp eed and

While implementation of the complete MMX instruction

co de size were directly rep orted by the Synopsys to ols

set is still underway enough is implemented to carry out key

The results obtained from implementation of application

p ortions of the MPEG enco ding and deco ding pro cesses While

sp ecic circuits indicate that the RADram ActivePage system

future work will explore more MPEG routines currentwork

can execute the application kernels circuits The RADram im

has fo cused up on application of the correction matrices within

plementation can implement designs with approximately

the P and B frames M Future implementation of the

LEs per Active Page and all of our designs are b elow this

MPEG algorithm will partition additional comp onents be

amount Our designs can also be further optimized by im

tween the pro cessor and RADram memory system The pro

plementing common memory interfaces in xed logic Our

cessor will b e resp onsible for the Discrete Cosine Transform

system simulation assumes a MHz clo ck for our circuits

DCT while the RADram system will handle motion detec

Given mo dest advances in FPGA technology this should b e

tion application of motion correction matrices run length en

achievable for our circuits by Finally the co de size is

co ding and deco ding RLE and Human enco ding and de

an indication of the p otential co debloat which will hap

co ding

p en when transitioning an application to the RADram system

Co de size is also indicative of the pagereplacementcost for

ActivePages whichweanticipate to b e times larger than

SparseMatrix Multiply A wide range of realworld

for conventional pages due to reconguration time However

problems can b e represented as sparse matrices Weexamine

pages which do not use ActivePage functions do not incur this

b oth a common scientic b enchmark and a more challenging

cost and future recongurable technologies may signicantly

compiler optimization problem Our scientic b enchmark in

reduce this cost see Section

volves the multiplication of matrices representing niteelement

computations taken from the HarwellBo eing b enchmark suite

D Our compiler optimization problem involves using the

Results

Simplex metho d NM to p erform optimal register allo ca

tion GW

In this section we compare our RADram simulation results of

A key computation in both these applications is sparse

each application kernel describ ed in Section toour exp ec

vectorvector dotpro duct Conventional implementations of

the ActivePage application characteristics dis tations from

this op eration are severely limited by pro cessormemory band

cussed in Section First we discuss p erformance of RADram

width Sparse vector FLOPS on a conventional system are

versus a conventional memory system executing optimized ver

often an order of magnitude lower than those for dense vec

sions of the same applications Then we explore the memory

tors The pro cessor must fetch the indices of each nonzero

hierarchy of b oth memory systems by studying the eects of

determine which indices in b oth vectors of the dot pro duct

cache parameters Finallywedevelop an analytical mo del to

100

Application LEs Sp eed Co de

ydelete ns KB

Arra 80 array-delete

yinsert ns KB

Arra array-find

ynd ns KB

Arra array-insert

ns KB Database 60 database

dyn-prog

Dynamic Prog ns KB

matrix-boeing

ns KB

Matrix 40 matrix-simplex

ns KB MPEGMMX median-kernel median-total

20 mmx

Table ActivePage functions synthesized for RADram Processor / Memory Non-Overlap (%) 0 1000 1 10 100 Problem Size (in 512K Pages) array-delete 100 array-find array-insert

database

Percent cycles the pro cessor is stalled on RADram dyn-prog Figure

10 matrix-boeing

as problem size varies

Speedup matrix-simplex median-kernel

1 median-total

uch data traveling b etween the pro

mmx sp eed pro cessor or to o m

cessor and RADram across the memory bus Performance can

actually decrease as co ordination costs dominate p erformance

en a large enough problem size all our applications would

1 10 100 Giv

entually reach the saturated region

Problem Size (in 512K Pages) ev

Pro cessorMemory Nonoverlap

Figure RADram sp eedup as problem size varies

The saturated region of ActivePage p erformance emphasizes

the imp ortance of partitioning applications to eciently use

the pro cessor in a system For pro cessorcentric applications

describ e partitioned application p erformance and then com

this dep endence is obvious The goal is to keep the pro cessor

pute the correlation b etween this mo del and our exp erimental

computing by providing a steady stream of useful data from

results

the memory system For memorycentric partitions however

the pro cessor is still a vital resource ActivePages can not

Performance

compute without activation and interpage communication

b oth provided by the pro cessor

Toevaluate p erformance of the RADram ActivePage mem

As data size grows in an ActivePage application so do es

ory system each application describ ed in Section was exe

the load up on the pro cessor We measure the remaining ca

cuted on a range of problem sizes using a xed set of machine

pacity of a pro cessor to handle this load with a metric we call

characteristics listed in Table The sp eedup of our applica

processormemory nonoverlap time Nonoverlap is the time

tions running on a RADram memory system compared to a

the pro cessor sp ends waiting for the memory system and can

conventional memory system are shown in Figure Each ap

be used to estimate the b oundary b etween the scalable and

plication was run on a range of problem sizes given in terms

saturated regions of application p erformance

of numb er of ActivePages Kbyte sup erpages Wemake

The relative percentage of time the pro cessor is stalled

two primary observations ab out this graph

waiting for memory system computation is shown in Figure

First application kernels execute signicantly faster on a

As describ ed earlier in Section the applications which

RADram memory system than a conventional memory system

reached the saturated region of sp eedup were database matrix

The one exception from our application mix is the arraydelete

simplex matrixb o eing and medianltering As is shown

primitive in the subpage region The SimpleScalar pro cessor

in Figure these applications also reach a p oint of complete

instruction set actually favors arraydelete over arrayinsert

pro cessormemory overlap The eect of this is describ ed in

To takeadvantage of this fast delete the RADram version of

Section

arraydelete uses an adaptive algorithm that uses the pro cessor

We also observe that for the array primitives and the dy

more for arrays that are smaller than one ActivePage

namic programming application the nonoverlap percentage

Second our p erformance results qualitatively scale as we

remains relatively high These applications are largely memory

exp ected in Figure We observe that most applications show

centric with very little pro cessor activity In fact the array

little growth in sp eedup as data size grows within the sub

primitives op erate asynchronously to the end of the applica

page region b elow onepagefor most applications In this

tion and are articially forced in synchronous op eration for

region RADram applications have little parallelism to oset

this study This means that an application can use the insert

activation costs As we leave this region weenter the scal

an delete array primitives with only the cost of RADram func

able region and see that p erformance on all of our applica

tion invo cation Mo dulo dep endencies on the array the time

tions grows nicely as data size increases Four applications

sp entby the memory system shifting data can b e overlapp ed

database mmx matrixsimplex matrixb o eing and median

This over with op erations outside of the STL array class

lteringalso reach the saturated region Here RADram p er

lap o ccurs in a natural way with no additional eort required

formance is limited by the progress of the pro cessor This

by the programmer who uses the RADram STL array class

limitation may be due to either to o muchwork for a given 1000000000 array-delete-1M array-delete-1M array-find-1M 10000000 array-find-1M array-insert-1M array-insert-1M database-100K database-100K 100000000 dyn-prog-4096 dyn-prog-4096 matrix-boeing matrix-boeing matrix-simplex matrix-simplex Cycle Count median-kernel-1024 Cycle Count 1000000 median-kernel-1024 10000000 median-total-1024 median-total-1024 mmx-528 mmx-528

50 100 150 200 250 50 100 150 200 250

L1 D-cache Size (Kbytes, RADram is 64) L1 D-cache Size (Kbytes, RADram is 64)

Figure Conventional left and RADram right Execution Time vs L Data Cache Size

Opp ortunities for overlapping execution of data structure op working sets will not t in realistic cache sizes Consequently

erations with datastructure usage is intriguing and is b eing without migrating to a cacheonly architecture our applica

investigated further tion p erformance is b ounded by other architectural character

The dynamic programming example maintains a very high istics such as DRAM memory latency and bandwidth

pro cessor memory nonoverlap however preliminary results

indicate that pro cessormediated communication required by

Analysis

the RADram memory system eventually dominates p erfor

mance This o ccurs for extremely large problems that are well

Toachieve a deep er understanding of the p erformance of appli

beyond the range of problem sizes presented in this study

cation partitions weintro duce an analytic mo del This mo del

is based up on an abstract application From this abstract ap

plication a formula is develop ed which mo dels p erformance

Cache Eects

under various problem sizes Additionally total application

The simulated pro cessor used for this study has a default split p erformance is b ounded byAmdahlsLaw We presentthis

instructiondata levelone cache Each levelone cache is mo del by rst developing an intuitive understanding of a par

kilobytes and is way asso ciative The pro cessor also has a titioned application Then wecharacterize pro cessor p erfor

combined leveltwocache of megabyte and is way asso cia mance with an ActivePage memory system Finallywe com

tive For this study the levelone data cache size was varied pute the correlation of this analytical mo del with the results

from to kilobytes The leveltwo cache size was varied obtained from our RADram simulator

from kilobytestomegabytes

Figure left plots total conventional application kernel

Mo del

execution time v ersus the size of the levelone data cache As

illustrated within the range of cache sizes explored most con

Section describ ed partitioning and the role it plays in ap

ventional applications where unaected However at the left

plication p erformance on an ActivePage memory system To

edge of Figure left we note that some conventional appli

investigate partitioning in more detail an abstract application

cations are aected by the size of the level one cache when it

is depicted in Figure As illustrated in Figure a parti

fell b elowkilobytes

tioned algorithm undergo es two phases from the p ersp ective

Figure right plots total RADram application kernel time

of the pro cessor activation and p ostpro cessing The activa

versus levelone data cache size As illustrated all but one

tion phase is characterized by increased ActivePage activity

application was unaected by the size of the level one cache

The p ostpro cessing phase is characterized by decreasing Ac

The mediantotal application shows various stride eects The

tivePage activity but p otential pro cessormemory nonoverlap

application consists of two phases The rst reads data into

stalls mixed with pro cessor computation

an array and transforms it into a sp ecial data layout required

The abstract application depicted in Figure uses K pages

by the ActivePage memory system The size of the level

of Active Page memory The pro cessor sp ends T i time

A

one cache plays a role in enhancing the p erformance of this

activating ActivePage i Initially the pro cessor activates all

P

K

op eration The second phase simply dispatches the request for

pages in sequence thus requiring T i time to activate

A

i

median ltering to the ActivePage memory system and waits

all pages Immediately after activation an ActivePage b egins

for the result As evident from the p erformance of median

to execute The time required to complete execution for Active

kernel the second phase is unaected by the size of the level

Page i is T i After dispatching the activation request to all

C

one cache

K pages the application returns to the rst page to p erform

All applications were also executed with a range of level

any followup pro cessor computation Before the pro cessor

two cache sizes Throughout this range no signicant p erfor

can p erform this computation however the pro cessor may

mance dierences o ccurred This combined with the level

be forced to stall and wait for the Active Page in memory

one cache results indicates that our applications are sensitive

lo cation to nish execution At this p oint in Figure the

to extremely small caches sizes but small to reasonable size

pro cessor is stal ledwaiting in nonoverlap time We account

caches achieve all of the p erformance of large caches Active

for this as NO or nonoverlap time waiting for ActivePage

Page applications tend to work with large datasets Although

The pro cessor after waiting for NO time for the Active

their primary working set may t in a small cache secondary NO(1) ProcessorTa(1) Ta(2) Ta(3) Ta(K) Tp(1) Tp(2) Tp(3) Tp(K)

Active Page 1 Tc(1)

Active Page 2 Tc(2)

Active Page 3 Tc(3)

Active Page K Tc(K)

Time

Figure Abstract view of pro cessor and ActivePage memory activity

P P P

NOi max

K i i

T i T n T n NOn

C A P

ni n n

T K

conv

S peedup

P

par titioned

K

T iT iNOi

A P

i

S peedup

ov er all

F r action

par titioned

F r action

par titioned

Speedup

par tition

Figure Simplied p erformance mo del for ActivePages

Page to complete execution can then p erform the follow up

Application T T T Pgs for Sp eedup

A P C

computation T

P

us us ms overlap correl

The abstract application shows constant p erpage activa

Arrayinsert

tion time T constant p erpage p ostcomputation time T

A P

Arraydelete

and T T This means that no other stalls or pro cessor

Arraynd

P A

Database

memory nonoverlaps o ccur In the general case however

Matrixsimplex

an application transitions b etween p ostcomputation on page

Matrixb o eing

T i to nonoverlap time NOi for the next page This

P

Mediankernel

o ccurs for all pages within the computation

MPEGMMX

Using this abstract application we observe that the all pro

cessor time for a single partitioned algorithm is accounted for

Table Activation time T computation time T p ost

A C

in three distinct sets of variables T i T i and NOi

A P

activated pro cessor time T and minimum problem size for

P

Thus total kernel execution time for a partitioned application

complete overlap

P

K

T iT iNOi is the summation

A P

 i

Figure formalizes this mo del Note that an application

on previous pages

need not have constant p erpage activation and p ostactivated

computation time Furthermore an application need not have

constant p erActivePage computation time From the pro

Correlation

cessors p ersp ective each application executes three general

In general an average activation time T and average p ost

A

phases dispatch wait for result and p ostcompute

page computation time T can b e measured using a small to

P

Figure mo dels conventional application p erformance in

medium dataset Furthermore an average ActivePage com

terms of T K That is the time sp entbya conventional

conv

putation time T can b e measured from this small dataset

C

application working with a particular data set of size K

Using these averages and the mo del in Figure a rough es

T is time p er item

conv

timate of the nonoverlap time for a particular problem size

We note that within the nonoverlap time the pro cessor

can be found Using this estimate it is p ossible to predict

sp ends b efore p ostpro cessing of page i is a maximum of zero

p erformance of a partitioned application for a range of prob

or the computation time of the ActivePage minus the time

lem sizes This prediction provides insightinto the particu

sp ent by the pro cessor between nishing activation of page

lar characteristics of a partitioned application By mo deling

i and the current time This intermediate time is sp ent ei

p erformance as activation p ostpage computation p erpage

ther activating subsequent pages stalled or p ostcomputing

ActivePage computation and pro cessormemory nonoverlap array-delete-1M 100 array-delete-1M 100 array-find-1M array-find-1M array-insert-1M array-insert-1M database-100K database-100K dyn-prog-4096 10 dyn-prog-4096 matrix-boeing matrix-boeing

Speedup matrix-simplex Speedup matrix-simplex median-kernel-1024 median-kernel-1024 median-total-1024 1 median-total-1024 10 mmx-528 mmx-528

0 0 200 400 600 20 40 60 80 100

Cache-to-DRAM Latency (cycles, RADram is 50) Logic Divisor (RADram is 10)

Figure RADram sp eedup as cachetomemory latency Figure RADram sp eedup as logic sp eed varies

varies

in terms of cachemiss p enalty In general the p erformance ad

vantage of RADram comes from inDRAM computation which time it is p ossible to gauge p erformance at a variety of prob

is unaected bycachemiss p enalty Cache eects however lem sizes and adjust the balance of work b etween the memory

account for slightchanges in b oth RADram and conventional system and pro cessor according to the exp ected workload of

system p erformance These changes can result in either in the application

creases or decreases in sp eedup as cachemiss p enalties in To illustrate Table lists the activation time p ostpage

crease The sign of the slop e dep ends up on the relative ratio of pro cessor time and p erpage ActivePage computation time

instruction cycles to memory stall cycles for the conventional for a numb er of application kernels in our workload Using a

versus the partitioned application If one splits the total ap simplied version of the formulas in Figure which assume

plication runtime into two comp onents pro cessor time and constantvalues for these metrics pages for complete overlap

memory stall time then computes the ratio of these twovalues is computed Furthermore for each application and for each

for b oth the conventional and partitioned applications then datap oint used to construct Figure a predicted sp eedup

the slop e of application sp eedup versus memory latency de is computed using these constant activation and computation

picted in Figure will dep end up on the relative ratio of these times and a measured nonoverlap time taken from Figured

two ratios The correlation b etween the predicted sp eedup from using the

Second Figure plots sp eedup versus the sp eed of the analytical mo del and the actual sp eedup observed is shown

applicationsp ecic circuit The sp eed of applicationsp ecic in the rightmost column of Table Most applications are

circuits in the simulated RADram system is measured in rela wellcorrelated to the analytical mo del A notable exception

is the matrixb o eing application This application violates the tiveclock divisions of the pro cessor clo ck In Figure a higher

assumption of constant activation and computation times p er logic divisor corresp onds to a slower recongurable logic clo ck

ActivePage The times are inherently datasp ecic for this To generalize across applications those op erating on prob

application and using constantvalues proved to b e less useful lems in the scalable region of their partitioning domain are sen

than for the other applications studied sitive to the sp eed of the ActivePage computation whereas

those applications op erating on problems in the saturated re

gions of their partitioning domain are generally insensitiveto

Sensitivity to Technology

the sp eed of the ActivePage computation

Our results for the RADram system demonstrate that Ac

tivePages can b e implemented with substantial success on a

Related Work

variety of applications RADram technology however is a

longterm goal whichisseveral years in the future Shorter The IRAM philosophy go es to the extreme by shifting all com

term and alternative longterm technologies can also b e used putation to the memory system through integration of a pro

to implement ActivePages This section describ es suchtech cessor onto a DRAM chip This results in dramatically im

nologies and analyzes the sensitivity of our results to some of proved DRAM bandwidth and latency to the pro cessor core

the key parameters in the RADram system but conventional pro cessors are not designed to exploit these

Current technologies exist to implement Active Pages at improvements B a An interesting alternativeistointe

signicantly higher cost than RADram Such costs would limit grate sp ecialized logic into DRAM to p erform op erations such

the amount of memory available to supp ort ActivePages and as ReadMo difyWrite B b This alternative is promising

consequently the problem sizes of the applications These but wehave seen that dierent applications can exploit signif

tec hnologies include small merged FPGADRAM or SRAM icantly dierent computations in the memory system Our re

chips DRAMSRAM macro cells in ASICs and small pro cessor sults have shown that integrating recongurable logic is highly

inDRAMSRAM chips In general logic sp eeds in these tech eective

nologies are either equal to or b etter than RADram assump Recongurable computing has shown considerable success

tions Chip cost however will limit most nearterm technolo B but has had at sp ecialpurp ose applications A

comp eting with micropro cessors on more general gies to substantially smaller problem sizes SRAM or multi diculty

purp ose tasks such as oatingp oint arithmetic Some groups chip solutions will also have an eect on memory latencies

fo cus up on building recongurable pro cessors HW WH Wevary twotechnological parameters in our RADram sim

RS WC but face an even more dicult comp etition ulations memory latency and logic sp eed First Figure

with commo dity micropro cessors Our approachavoids these plots the sensitivity of RADram sp eedups to memory latency

diculties by exploiting the strengths of both micropro cessors study whether interpage communication is required by a broad

and recongurable logic Wefocus up on data manipulation class of application domains and if so if it should it b e sim

to make the memory system p erform b etter for the pro cessor ulated via pro cessor intervention or implemented with dedi

DeHon describ ed limited integration of recongurable logic cated hardware supp ort Along with interpage and interchip

and DRAM in an early memo DeH but did not evaluate communication a study of interpage synchronization primi

it further tives is required Such primitives if implemented in hardware

Our philosophy is reminiscent of scattergather engines from p ose additional challenges

a long line of sup ercomputers HT SH CG Bat Finally further evaluation of application kernels is required

EJ HS L Ho ckney and Jesshop e HJ give a Instruction sets such as MMX co dify a set of datamanipulation

go o d history of such machines Our approach however sup primitives for a certain application domain Further study

ports a much wider variety of data manipulations and compu of datamanipulation primitives could distill a common base

tations than these machines Additionally our emphasis on set of primitives for a broad set of application domains If

commo ditytechnologies results in a fo cus on dierentappli such primitives exist hybrids of the RADram implementation

vestigated cations and design tradeos should b e in

Conclusion

Future Work

ActivePages provide a general mo del of computation to ex

Active Pages and our RADram implementation have shown

ploit the coming waveoftechnologies for intelligent memory

great p otential in our study Unlo cking this p otential involves

ActivePages are designed to leverage existing memory inter

manyinteresting issues including compiler supp ort for auto

faces and integrate well with commo dity micropro cessors In

matic application partitioning op erating system integration

fact a primary goal of ActivePages is to provide micropro

ultithreaded application supp ort complete application run m

cessors with enough useful data to run at p eak sp eeds

times applicationsp ecic circuits vs dataprimitives hierar

Our RADram implementation of Active Pages achieves

chical computation structures interpage and interchip com

substantial sp eedups when compared to conventional memory

munication In addition a detailed p ower yield and hardware

systems RADram provides a large numb er of simple recon

implementation study of RADram is required

gurable computational elements which can achieve sp eedups

For Active Pages to b ecome a successful commo dity ar

up to times faster than conventional systems This high

chitecture the application partitioning pro cess must b e auto

p erformance coupled with low cost through high chip yield

mated Currentwork uses handco ded libraries which can b e

makes RADram a highly promising architecture for future

called from conventional co de Ideally a compiler would take

memory systems

highlevel source co de and divide the computation into pro

cessor co de and ActivePage functions optimizing for mem

ory bandwidth synchronization and parallelism to reduce

References

execution time This partitioning problem is very similar

A R Amerson et al Teramac congurable custom comput

to that encountered in hardwaresoftware codesign systems

ing In Symp on FPGAs for Custom Computing Machines

GVNG whichmust divide co de into pieces which run on

pages Napa Valley CA April

general purp ose pro cessors and pieces which are implemented

AA PMAthanas and A L Abb ott Realtime image pro

by ASICs ApplicationSp ecic Integrated Circuits These

cessing on a custom computing platform IEEE Computer

systems estimate the p erformance of eachlineofcodeonalter

February

nativetechnologies account for communication b etween com

Ash Peter J Ashenden The VHDL co okb o ok st ed Dept of

p onents and use integer programming or simulated annealing

CS U of Adelaide S Australia July

to minimize execution time and cost ActivePages could use

B D Buell et al Splash FPGAs in a Custom Computing

a similar approach but would also need to b orrow from par

Machine IEEE Computer So ciety

allelizing compiler technology H to pro duce data layouts

B a N Bowman et al Evaluation of existing architectures in

and schedule computation within the memory system

IRAM systems In Workshop on Mixing Logic and DRAM

Denver CO June

Integration of ActivePages with a real op erating system

p oses new challenges ActivePages are similar to b oth mem

B b A Brown et al Using MML to simulate multiple dual

p orted SRAMs Parallel routing lo okups in an ATM switch

ory pages and parallel pro cessors Several op en op erating

controller In Workshop on Mixing Logic and DRAM Den

system issues exist such as allo cation p olicies paging mecha

ver CO June

nisms scheduling and security Of particular concern is the

BA D Burger and T Austin The SimpleScalar to ol set v

high cost of swapping ActivePages to and from disk Current

Comp Arch News June

FPGA technologies take s of milliseconds to recongure

Bat K E Batcher STARAN parallel pro cessor system hard

New technologies however promise to reduce these times by

ware AFIPS Conf Proceedings pages

several orders of magnitude DeHa Our future work will

Bee Nelson H F Beeb e A bibliography of publications ab out

address these issues b oth formally and practically by clarifying

the Linux op erating system Technical rep ort Ctr for Sci

the p olicy of interaction b etween an op erating system and the

entic Comp Dept of Math U of Utah Salt Lake City UT

ActivePage memory system and by simulation of a mo died

May

op erating system kernel suchasLinux Bee In addition to

BGK D Burger J Go o dman and A Kagi Quantifying memory

op erating system studies multithreaded application supp ort

bandwidth limitations in future micropro cessors In ISCA

Philadelphia PA May

will b e investigated

Future work shall address interpage and interchip com

CG A Charlesworth and J Gustafson Intro ducing replicated

VLSI to sup ercomputing The FPSMAX scientic

munication issues Before mechanisms are formalized for inter

computer IEEE ComputerMarch

page communication a detailed evaluation of interpage com

CLR T Cormen C Leiserson and R Rivest Introduction To

munication requirements is required This evaluation must

Algorithms MIT Press Cambridge MA

D I Du et al Users guide for the HarwellBo eing sparse SH N Sammur and M Hagan Mapping signal pro cessing algo

matrix collection Technical Rep ort TRPA CER rithms on parallel architectures J of Par and Distr Comp

FACS Ave G Coriolis Toulouse Cedex France February

Octob er

hatz H Korth and S Sudarshan Database Sys SKS A Silb ersc

DeH A DeHon Notes on integrating recongurable logic with tem Concepts McGrawHill

DRAM arrays Transit Note MIT AI Lab Tech

WC R Wittig and PChow OneChip An FPGA pro cessor

Sq Cambridge MA March

with recongurable logic In Symposium on FPGAs for

DeHa Andre DeHon DPGA utilization and application In Proc Custom Computing Machines pages Napa Valley

of the Int Symp on Field Programmable Gate Arrays California April

ACMSIGDA February

WH M Wirthlin and B Hutchings A dynamic instruction set

DeHb Andre DeHon Recongurable Architectures for General computer In Symposium on FPGAs for Custom Comput

Purpose Computing PhD thesis MIT ing Machines pages Napa Valley California April

EJ A Evensen and JTroyIntro duction to the architecture of

a element PEPE In Proc Sagamore Conf on WM W Wulf and S McKee Hitting the memory wall Implica

Par Processing pages tions of the obvious Computer ArchitectureNews

March

Gus D Guseld Algorithms on Strings Tr ees and Sequences

Cambridge University Press

GVNG D Ga jski F Vahid S Narayan and J Gong Specica

tion and Design of Embedded Systems Prentice Hall Inc

Englewo o d Clis New Jersey

GW D Go o dwin and K Wilken Optimal and nearoptimal

global register allo cation using integer programming

SoftwarePractice Experience

H M Hall et al Maximizing multipro cessor p erformance with

the SUIF compiler Computer Decemb er

HJ R W Ho ckney and C R Jesshop e Paral lel Computers

Architecture Programming and Algorithms Adam Hilger

Ltd Bristol UK second edition

HS W D Hillis and G L Steele The Connection Machine

MIT Press

HT R Hintz and D Tate Control data STAR pro cessor

design In COMPCON pages

HW J Hauser and J Wawrzynek Garp a MIPS pro cessor with

a recongurable copro cessor In Symp on FPGAs for Cus

tom Computing MachinesNapaValley CA April

I K Itoh et al Limitations and challenges of multigigabit

DRAM chip design IEEE Journal of SolidState Circuits

K W King et al Using MORPH in an industrial machine vi

sion system In K L Po cek and J Arnold editors Proceed

ings of IEEE Workshop on FPGAs for Custom Computing

Machines pages Napa CA april

L Charles E Leiserson et al The network architecture of the

connection machine CM In Symposium on Paral lel Ar

chitectures and Algorithms pages San Diego Cal

ifornia June ACM

M J Mitchell et al MPEG Video Compression Standard

Chapman Hall New York

NM J Nelder and R Mead A simplex metho d for function

minimization Computer Journal

Pat David Patterson Micropro cessors in Scientic Amer

ican Septemb er

Prz Steven Przybylski Emb edded DRAMs Today and toward

systemlev el integration Technical rep ort Verdande Group

Inc Lynn Oaks Drive San Jose CA September

R D Ross et al An FPGAbased hardware accelerator for

image pro cessing In W Mo ore and W Luk editors

More FPGAs Proc of the Int workshop on eld

programmable logic and applications pages Ox

ford England

RS R Razdan and M Smith A highp erformance microarchi

tecture with hardwareprogrammable functional units In

Int Symp on pages San Jose

CA Novemb er

RW G Rafael and R Woods Digital Image Processing

AddisonWesley

Sem Semiconductor Industry Asso ciation The national technol

ogy roadmap for semiconductors

httpwwwsematechorgpublicroadmap