Using the Cell Synergistic Processor As a Garbage Collection Coprocessor

Cell GC: Using the Cell Synergistic Processor as a Garbage Collection Coprocessor Chen-Yong Cher Michael Gschwind IBM T. J. Watson Research Center, Yorktown Heights, NY 10598 [email protected] [email protected] Abstract 1. Introduction In recent years, scaling of single-core superscalar processor perfor- Traditional computer systems use caches to reduce average memory mance has slowed due to complexity and power considerations. To latency by providing non-architected temporary high-speed storage improve program performance, designs are increasingly adopting close to microprocessors. Alas, while this design choice reduces chip multiprocessing with homogeneous or heterogeneous CMPs. average latency and presents the programmer with a “flat” memory By trading off features from a modern aggressive superscalar core, hierarchy model, the cost of maintaining this illusion is significant. CMPs often offer better scaling characteristics in terms of aggregate This cost includes area to store cache directories, perform tag match, performance, complexity and power, but often require additional and implement cache miss recovery; impact of cache miss logic software investment to rewrite, retune or recompile programs to take on cycle time or complex speculative logic; and runtime costs like advantage of the new designs. The Cell Broadband Engine is a mod- coherence traffic which is increasing as the number of processors are ern example of a heterogeneous CMP with coprocessors (acceler- scaled up in multiprocessor systems. ators) which can be found in supercomputers (Roadrunner), blade To counter this trend and offer low latency storage with guaran- servers (IBM QS20/21), and video game consoles (SCEI PS3). A teed access time, an increasing number of designs are offering fast, Cell BE processor has a host Power RISC processor (PPE) and eight architected on-chip storage. Among these designs is the Cell Broad- Synergistic Processor Elements (SPE), each consisting of a Syner- band Engine [16, 11, 5, 12], which offers eight high-performance gistic Processor Unit (SPU) and Memory Flow Controller (MFC). RISC-based processor cores with dedicated architected low latency In this work, we explore the idea of offloading Automatic Dy- storage in the form of the SPE local store. namic Garbage Collection (GC) from the host processor onto accel- Architected local memories are most commonly used to store a erator processors using the coprocessor paradigm. Offloading part set of processor-local data, or contain working copies of large data or all of GC to a coprocessor offers potential performance benefits, sets. While local storage offers significant potential for processing because while the coprocessor is running GC, the host processor can with data-intensive applications [12, 20, 9], little exploration has continue running other independent, more general computations. been performed into the use of local memories for managed runtime- We implement BDW garbage collection on a Cell system and of- environments, and automatic memory management with garbage fload the mark phase to the SPE co-processor. We show mark phase collection, which are increasingly adopted to simplify program de- execution on the SPE accelerator to be competitive with execution velopment and maintenance. on a full fledged PPE processor. We also explore object-based and Noll et al. [18] explore the use of Cell SPE cores for executing block-based caching strategies for explicitly managed memory hier- Java applications. Alas, many system functions, such as type resolu- archies, and explore to effectiveness of several prefetching schemes tion, and garbage collection are performed on the central PPE which in the context of garbage collection. Finally, we implement Capitula- may threaten to become a bottleneck if significant speedups are ac- tive Loads using the DMA by extending software caches and quan- complished with SPE-based execution. tify its performance impact on the coprocessor. In this work, we explore the idea of a Garbage Collector Copro- cessor by running Java applications with the Boehm-Demers-Weiser Categories and Subject Descriptors C.3 [Computer Systems Or- (BDW) mark-sweep garbage collector on a live Cell Processor. In ganization]: SPECIAL-PURPOSE AND APPLICATION-BASED this work, we concentrate on optimizing GC performance for a pro- SYSTEMS; D.2.11 [Software]: SOFTWARE ENGINEERING— cessor with an explicitly managed local store by managing the local Software Architectures; D.3.3 [Software]: PROGRAMMING LAN- store using caching and prefetching techniques with the DMA copy GUAGES—Language Constructs and Features engine. In porting the garbage collector to the Cell BE, our emphasis was General Terms Algorithms, Performance the exploitation of local stores with copy-in/copy-out semantics. Ex- Keywords Cell, SPU, SPE, BDW, garbage collection, mark-sweep, ploiting local memories with explicit copy semantics and block copy coprocessor, accelerator, local store, explicitly managed memory facilities is applicable to a growing class of embedded processor de- hierarchies signs. To explore the techniques and performance attributes of the explicitly managed memory hierarchy, we focus on this attribute of the Cell BE architecture exclusively. The Cell BE architecture offers additional capabilities such as multicore thread level parallelism Permission to make digital or hard copies of all or part of this work for personal or and integrated SIMD capabilities which are not explored within the classroom use is granted without fee provided that copies are not made or distributed scope of this work. for profit or commercial advantage and that copies bear this notice and the full citation Results presented here were measured on an experimental Cell on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. BE blade with 2 Cell BE chips for a Cell BE system configuration VEE’08 March 5–7, 2008, Seattle, Washington, USA. with 2 PPEs and 16 SPEs executing Cell BE Linux 2.6.20-CBE. The Copyright c 2008 ACM 978-1-59593-796-4/08/03. $5.00 operating frequency of the system was 3.2GHz, and the system was 141 populated with 1GB of main memory. The host processor runs the looks like a heap reference) is treated as a reference and the object Java application and offloads its marking work to an SPU as needed. to which it refers is considered to be live. The upshot of ambigu- Choosing the Cell SPE to explore the use of garbage collection on ity is that ambiguously-referenced objects cannot move, since their a core with a local memory based memory hierarchy offered the ambiguous roots cannot be overwritten with the new address of the advantage of using the open Linux-based Cell ecosystem [10]. object; if the ambiguous value is not really a reference then it should Our contributions are as follows: 1) we identify the use of the not be modified. The BDW collector treats registers, static areas, and local store and necessary synchronizations for offloading the mark- thread activation stacks ambiguously. If object layout information is phase of the BDW collector to the coprocessor 2) we identify the available (from the application programmer or compiler), then the necessary steps for managing coherency of reference liveness infor- BDW collector can make use of it, but otherwise values contained in mation between the host processor and the SPU, 3) we quantify the objects are also treated ambiguously. effects of using an MFC-based software cache to improve GC per- The advantage of ambiguous roots collectors is their indepen- formance on the coprocessor giving a speedup of an average of 200% dence of the application programming language and compiler. The over accelerator baseline code without the software cache, 4) we BDW collector supports garbage collection for applications coded in demonstrate hybrid caching schemes adapted to the behavior of dif- C and C++, which preclude accurate garbage collection because they ferent data types and their reference behavior and demonstrate them are not type-safe. BDW is also often used with type-safe languages to deliver a speedup of an average of 400% over the baseline, 5) we whose compilers do not provide the precise information necessary quantify the effects of previously known GC Prefetching strategies to support accurate GC. The minimal requirement is that source pro- on the coprocessor that uses DMA for memory accesses 6) by ex- grams not hide references from GC, and that compilers not perform tending the software cache, we implement Capitulative Loads using transformations that hide references from GC. Thus, BDW is used in the MFC’s DMA controller and quantify its performance impact on more diverse settings than perhaps any other collector. As a result, the coprocessor. the BDW collector has been heavily tuned, both for basic perfor- This paper is structured as follows: we describe workload char- mance, and to minimize the negative impact of ambiguous roots [2]. acteristics of mark-and-sweep garbage collectors in section 2, and The basic structure of mark-and-sweep garbage collection is we describe the Cell SPE’s local store and memory flow controller- depth-first search of all reachable pointers on the heap. For this pur- based architecture in section 3. Section 4 gives an overview of the pose, an initial set of root pointers (from the application’s register challenges involved in porting a garbage collector to a processor us- file, application stack, and known roots in the data segment) are used ing a local memory as its primary operand storage, and we describe to find references into the application’s heap. This is accomplished the use of software caches in section 5. We compare GC marking by initializing a mark stack with these known roots. performance between a SPE and a PPE in section 6. We explore The mark phase removes heap addresses from the mark stack, the use of prefetching techniques for garbage collectors executing in and uses the reference in conjunction with information about the a local memory in section 7.

Using the Cell Synergistic Processor As a Garbage Collection Coprocessor

Data Storage the CPU-Memory

Cache-Aware Roofline Model: Upgrading the Loft

ARM-Architecture Simulator Archulator

Multi-Tier Caching Technology™

CUDA 11 and A100 - WHAT’S NEW? Markus Hrywniak, 23Rd June 2020 TOPICS for TODAY

A Characterization of Processor Performance in the VAX-1 L/780

Instruction Set Innovations for Convey's HC-1 Computer

Fundamental CUDA Optimization

Solid-State Drive Caching with Differentiated Storage Services

Cache Generally Uses This Form of Ram

Cacheout: Leaking Data on Intel Cpus Via Cache Evictions

CUDA Memory Architecture