Evaluation of Existing Architectures in IRAM Systems

Evaluation of Existing Architectures in IRAM Systems Ngeci Bowman, Neal Cardwell, Christoforos E. Kozyrakis, Cynthia Romer and Helen Wang Computer Science Division University of California±Berkeley g fbowman,neal,kozyraki,cromer,helenjw @cs.berkeley.edu ct Abstra chitectures and the software that runs on them. Perhaps most important is the consideration of binary compatibility for existing software: there already exists a large body of system Computer memory systems are increasingly a bottleneck lim- software and applications that could be used ªout of the boxº iting application performance. IRAM architectures, which in- if an existing architecture is adopted. Higher application per- tegrate a CPU with DRAM main memory on a single chip, formance would only require tuning programs and compilers promise to remove this limitation by providing tremendous to the speci®c characteristics of the new memory hierarchy. main memory bandwidth and signi®cant reductions in memory latency. To determine whether existing microarchitectures can For this work, we evaluated the performance implications of tap the potential performance advantages of IRAM systems, this evolutionary approach of combining an existing microar- we examined both execution time analyses of existing micro- chitecture with an IRAM memory hierarchy. Our investigation processors and system simulation of hypothetical processors. had two complementary aspects. First we measured and ana- Our results indicate that, for current benchmarks, existing ar- lyzed the performance of applications on two existing micro- chitectures, whether simple, superscalar or out-of-order, are processors ± one simple superscalar processor and one complex unable to exploit IRAM's increased memory bandwidth and out-of-order processor ± and used these results to predict the decreased memory latency to achieve signi®cant performance performance of a hypothetical, otherwise identical system with bene®ts. an IRAM memory hierarchy. Subsequently we used complete system simulations to obtain a detailed performance evaluation of simple IRAM and conventional processors. Intro duction 1 The remainder of this paper is organized as follows: Section 2 presents the main implementation and architectural consid- erations for IRAM systems and section 3 describes the bench- One proposed solution to the growing gap between micropro- marks used in this study. Section 4 discusses our analytic cessor performance and main memory latency is to integrate evaluation of IRAM implementations of two existing architec- a processor and DRAM on the same die, an organization we tures. Section 5 describes the results of simulations of simple refer to as Intelligent RAM (IRAM) [10]. Because all mem- conventional and IRAM systems. Finally, section 6 presents ory accesses remain on-chip and the memory bus width is no our conclusions from these studies and suggests directions for longer in¯uenced by pin constraints, IRAM should improve future IRAM research. main memory bandwidth by two orders of magnitude and main memory latency by one order of magnitude. 2 I mplementationand Arc hitec turalConside rations It is not clear what processor microarchitecture will best be able to turn these advantages into signi®cant application performance bene®ts. However, there are several reasons to prefer There are several signi®cant ways in which IRAM systems will existing general-purpose microarchitectures, such as wide su- differ from today's microprocessors: the integration of DRAM perscalar, dynamic (out-of-order) execution, or even simple in- on-chip, the resulting high-bandwidth main memory bus, the order RISC CPUs, for IRAM systems. Current organizations elimination of L2 caches, and a potential slow-down incurred already achieve impressive performance across many classes by logic in today's DRAM processes. of applications, including personal productivity applications, graphics, databases, scienti®c computation, and software de- The primary difference between IRAM systems and conven- velopment. Furthermore, the performance trade-offs for such tional systems will, of course, be the integration of large architectures are well-understood, and we already have the amounts of DRAM on-chip. For example the DEC Alpha tools and know-how to design, debug, and tune both the ar- 21164 processor, with 16KBytes of ®rst level caches and 2 : m This work was supported by DARPA (DABT63-0056), the California State 96KBytes of L2 cache, occupies 299mm in a 0 5 CMOS m MICRO Program and research grants and fellowships from Intel, Sun Microsys- process. In a 256Mbit DRAM 0:25 CMOS process [12], tems and the National Science Foundation. 2 Benchmark Description this would take up approximately 75mm , or one fourth of the die area. This allows up to 24MBytes of on-chip DRAM mem- tomcatv Mesh Generation: generates a 2-D mesh ory in the remaining area. While this may not be suf®cient by su2cor Quantum Physics: computes masses of itself for high-end workstations, it is enough for low-end PCs elementary particles using a Monte Carlo and portable computers. The memory access time for such a method and the Quark-Gluon theory wave5 Electromagnetics: solves Maxwell's system could can be as low as 21ns, since off-chip communica- equationsonaCartesianmeshusinga tion over high-capacity busses has been eliminated [10]. This variety of boundary conditions is up to ten times faster than the main memory access times of gcc Compiler: uses the GNU C compiler to current conventional systems. convert a set of preprocessed source ®les into Sparc assembly language Second, since these access times for the on-chip main memory compress Compression: compresses large ®les [8] can be comparable to that of SRAM L2 caches today, using using adaptive Lempel-Ziv coding die area for an L2 cache provides little performance improve- li Lisp Interpreter: uses a Lisp interpreter ment. This area can instead be used for on-chip DRAM, which to interpret a set of programs is more than 10 times as dense [5]. Consequently, the IRAM ijpeg Imaging: performs JPEG image compression systems we consider in this study have no L2 caches. perl Perl Interpreter: Uses Perl to perform text and data manipulation go Arti®cial Intelligence: Plays the game GO Third, because main memory will be integrated on-chip, IRAM against itself systems can be designed with memory busses as wide as de- vortex A single user O-O database transaction sired. Given that the memory bus will connect the main mem- benchmark. Builds and manipulated 3 ory to the L1 cache, the bus width should probably be equal to interrelated databases the L1 cache block size. m88ksim Simulator: Simulates the Motorola 88100 processor Finally, initial IRAM implementations may suffer from logic mpeg encode Video Encoding: encodes 48 720x480 color speed degradation. Existing DRAM technology has been op- frames into a MPEG video ®le timized for density and yield, so logic transistors and gates in linpack1000 Equation Solver: 1000x1000 sparse linear DRAM processes are slower than those in corresponding logic equation solver (double precision) processes. This can translate to a processor clock frequency sort Sorting Program: disk-to-disk sort of up to 1.5 times slower than that of similar architectures imple- 100-byte records of a 21MByte database mented with conventional logic processes [5] [10]. Fortunately, high-speed logic in DRAM processes has already been demon- Table 1: The benchmarks and applications used in this study. strated in prototype systems, so it is expected that within a few years logic in DRAM chips will be as fast as microprocessor logic. 4 Evaluating IRA M through Measurement and Ex- trap olation 3 Benchm arks and Applications Our initial approach to evaluating current microarchitectures implemented asIRAM systemswasanalytical. The goal was to Table 1 describes the benchmarks and applications used for analyze the execution behavior of applications on two existing this evaluation. organizations and estimate the effect of an IRAM implementation on the total execution time and its individual components. SPEC 95 [1] is the current industry-accepted standard for This allowed us to compare the potential performance of the uniprocessor performance evaluation. We used three of the IRAM implementations to that of the originals. ¯oating point programs of the suite (tomcatv, su2cor, wave5) and all eight integer benchmarks. All SPEC 95 programs were The two architectures examined in this study were the DEC Al- compiled with base settings and run on the complete reference pha 21064 [4] and the Intel Pentium Pro [2]. The Alpha 21064 inputs. uses a simple dual-issue, in-order execution organization with direct-mapped, blocking caches. By contrast, the Pentium Pro Unfortunately, SPEC 95 benchmarks do not exercise all levels employs an aggressive triple-issue architecture, with out-of- of memory hierarchy and are not considered representative of order and speculative execution, a deeper pipeline, and 4-way possible workloads for current and future systems. Tostudy the set-associative, non-blocking caches. Table 2 summarizes the behavior of a broader range of applications, we employed three main characteristics of the two processors. These two orga- additional benchmarks for this study. Mpeg encode,which nizations represent contrasting approaches to microprocessor encodes static frames into a MPEG video ®le, is a represen- architecture. Alpha's simplicity leads to implementations with tative multimedia application. Linpack1000, a double preci- extremely high

Evaluation of Existing Architectures in IRAM Systems

Digital Semiconductor Alpha 21064 and Alpha 21064A Microprocessors Hardware Reference Manual

Alpha 21064A Microprocessors Data Sheet

Computer Architectures an Overview

Virtual Memory CS740 October 13, 1998

Address Translation & Caches Outline

Database Integration

Thesis May Never Have Been Completed

Virtual Memory

Iilihfflf WWETWIU HI Ull ISTITUTO NAZIONALE DI FISICA NUCLEARE - ISTITUTO NAZIONALE DI FISICA NUCLEARE - \ST, ^3 Laboratori Nazionali Di Frascati

Advancements in Microprocessor Architecture for Ubiquitous AI—An Overview on History, Evolution, and Upcoming Challenges in AI Implementation

A 160Mhz 32B 0.5W CMOS ARM Processor, Sribalan

Phd Musoll.Pdf