Virtualizing Intelligent Memory

ActiveOS: Virtualizing Intelligent Memory Mark Oskin, Frederic T. Chong, and Timothy Sherwood* Department of Computer Science University of California at Davis systems with limited memory requirements suchas PDAs. Abstract We exp ect that the memory demands of most systems will Current trends in DRAM memory chip fabrication have led scale with DRAM density, and multiple memory chips will many researchers to propose \intel ligent memory" architec- b e required. Weintro duce Active Pages, a page-based archi- tures that integrate microprocessors or logic with memory. tecture for intelligent memory systems (see Figure 1). Ac- Such architectures o er a potential solution to the growing tive Pages consist of a sup erpage of data and a collection communication bottleneck between conventional microproces- of functions which op erate on that data. Implementations sors and memory. Previous studies, however, have focused of ActivePages can execute functions for hundreds of pages upon single-chip systems and have largely neglected o -chip simultaneously,providing substantial parallelism. communication in larger systems. To use ActivePages, computation for an application must We introduceActiveOS, an operating system which demon- b e divided, or partitioned, between pro cessor and memory. strates multi-process execution on Active Pages [OCS98], a For example, we use Active-Page functions to gather op erands page-based intel ligent memory architecture. We present re- for a sparse-matrix multiply and pass those op erands on to sults from multiprogrammed workloads running on a proto- the pro cessor for multiplication. To p erform such a compu- typeoperating system implemented on top of the SimpleScalar tation, the matrix data and gathering functions must rst processor simulator. Our results indicate that paging and b e loaded into a memory system that supp orts ActivePages. inter-chip communication can be scheduled to achieve high Then the pro cessor, through a series of memory-mapp ed performance for applications that use Active Pages with min- writes, starts the gather functions in the memory system. imal adverse e ects to applications that only use conven- As the op erands are gathered, the pro cessor reads them from tional pages. Overal l, ActiveOS al lows Active Pages to ac- user-de ned output areas in each page, multiplies them, and celerate individual applications by up to 1000 percent and to writes the results back to the array datastructures in mem- accelerate total workloads from 20 to 60 percent as long as ory. physical memory can contain the working set of each indi- Previous work [OCS98 ] has shown that ActivePages are vidual application. a exible architecture that can lead to dramatic p erformance improvements (up to 1000X) in a single-pro cess over conventional memory systems. This pap er evaluates ActivePages 1 Intro duction in a multi-pro cess environment. ActivePages fundamentally change memory systems from a uniform storage resource to Micropro cessor p erformance continues to follow phenomenal a non-uniform collection of storage and computational ele- growth curves whichdrive the computing industry. Unfor- ments that require b oth management and scheduling. tunately, memory systems are falling b ehind when \feeding" We intro duce ActiveOS, a prototyp e op erating system data to these pro cessors. Pro cessor-centric optimizations to which illustrates the interaction between an OS and Ac- bridge this processor-memory gap [WM95 ] [Wil95 ] include tive Pages. Our goal is to demonstrate that ActivePages prefetching, sp eculation, out-of-order execution, and mul- can p erform correctly and ecientlyinamulti-pro cess en- tithreading. Unfortunately, manyof these approaches can vironment. In suchanenvironment, our results show that lead to memory-bandwidth problems [BGK96 ]. while paging ActivePages to and from disk can b e exp en- DRAM memory technology, however, is growing signif- sive, opp ortunities to overlap memory and pro cessor com- icantly denser. The Semiconductor Industry Asso ciation putation increase dramatically. On mixed workloads of con- (SIA) roadmap [Sem94] pro jects mass pro duction of 1-gigabit ventional and Active-Page applications, ActivePage single- DRAM chips by the year 2001. If wedevote half of the area pro cess p erformance translates well to a multi-pro cess envi- of such a chip to logic, we exp ect the DRAM pro cess to ronment as long as memory pressure is mo derate. Note that supp ort approximately 32 million transistors. This density our current cycle-level simulation environment limits work- has led many researchers to prop ose intelligent memory sys- load sizes in our exp eriments. Our applications are highly tems that integrate pro cessors or logic with DRAM. The scalable, however, and we exp ect our results to generalize to increased bandwidth and lower latency of on-chip commu- larger workload sizes. nication promises to address many asp ects of the pro cessor- + The remainder of this pap er is organized as follows: Sec- memory gap [P 97] [KADP97 ]. tion 2 describ es the three main services provided by Ac- These improvements, however, are limited to single-chip tiveOS to supp ort Active Pages. Section 3 describ es our exp erimental metho dology. Section 5 describ es the appli- Acknowledgments: Thanks to Andre Dehon, Matt Farrens, cations in our workload. Section 6 presents our results. Lance Halstead, Diana Keen, and our anonymous referees. This Section 7 discusses related work. Finally, Sections 8 and 9 work is supp orted in part by an NSF CAREER award to Fred Chong, by NSF grant CCR-9812415, by grants from Altera, and present future work and conclusions. by grants from the UC Davis Academic Senate. More info at http://arch.cs.ucdavis.edu/AP *Now at University of California, San Diego functions A set of synchronization variables which the AP and the pro cessor use to co ordinate activities b etween the pro cessor and ActivePages within the page group. DD There are two imp ortant implications of this interface. hronization variables are essentially memory- P F First, the sync mapp ed registers that allow a pro cess to eciently invoke e-Page functions without a system call. Second, the al- D Activ lo cation of ActivePages, which are signi cantly larger than ventional pages, requires memory management of mixed F con sup erpages and pages. This memory management will b e- come a signi cant issue as we discuss paging and our p erfor- Disk mance results. D 2.2 Interpage Communication D D An application will generally use more data than will t on ePage (each one is 512 Kbytes in our current D a single Activ tation). Such data will often reside in page groups, F implemen a collection of ActivePages whichwork together to p erform F functions on the data. For example, inserting an element into the middle of an arraythatspansmultiple pages would require elements to movebetween pages. Such data movement is accomplished entirely via explicit Figure 1: The ActivePage architecture interpage communication requests. Active-Page functions can reference any virtual address available to their allo cat- 2 ActiveOS ing pro cess. For p erformance, applications should keep references from functions to addresses on other pages to a min- ActiveOS is a prototyp e op erating system for which the pri- imum. mary goal is to demonstrates mechanisms which correctly Our implementation uses a processor mediated approach manage ActivePages in a multi-pro cess environment. These to satisfying inter-page memory references. Each Active mechanisms fall into three key categories: pro cess services, Page explicitly distinguishes b etween lo cal and non-lo cal ref- interpage communication, and virtual memory. erences bychecking if a virtual address is b etween its starting address, kept in a base register, and the starting address plus the Active-Page size. When an ActivePage reaches a 2.1 Pro cess Services non-lo cal reference it raises an interrupt and blo cks. The The rst step in developing op erating system supp ort for Ac- interrupt causes the pro cessor to obtain the memory request tivePages involves providing basic services for user pro cesses through a memory-mapp ed read and executes an OS han- to allo cate pages and bind functions to them. Furthermore, dler. The handler consults a p er-pro cess table which will a pro cess must b e able to eciently call these functions. determine if the ActivePage containing the reference is in The ActivePage interface is designed to resemble the in- physical memory. If so, the page is lo cated and the request is terface of a conventional memory system. Pro cesses commu- satis ed. If not, a page fault o ccurs and the page is brought nicate with ActivePages through memory reads and writes. in from disk. In practice, alarge numb er of non-lo cal ref- Sp eci cally,theinterface includes: erences from di erent pages can b e satis ed up on a single interrupt, substantially reducing overhead. We assume that Standard memory interface functions: a single interrupt takes 3.5 ms. write(address, data) and read(address) Our pro cessor-mediated approach makes inter-page communication exp ensive, but greatly simpli es page faults and A set of functions available for computation on a par- reduces the complexity of ActivePage chip implementations. ticular ActivePage: AP functions We assume that our memory chips can trigger pro cessor in- terrupts, but pro cessor p olling is also a p ossiblility. An allo cation function: AP al loc(group id) 2.3 Virtual Memory When insucentphysical memory is available to execute all id and which allo cates an ActivePage in group group pro cesses in the system, memory must be paged to swap returns the virtual address of the start of the sup erpage space.

Load more