Understanding Object-Level Memory Access Patterns Across the Spectrum

Understanding Object-level Memory Access Paerns Across the Spectrum Xu Ji Chao Wang Nosayba El-Sayed∗ Tsinghua University Oak Ridge National Laboratory CSAIL, MIT Qatar Computing Research Institute, HBKU Xiaosong Ma Youngjae Kim Sudharshan S. Vazhkudai Qatar Computing Research Institute, Sogang University Oak Ridge National Laboratory HBKU Wei Xue Daniel Sanchez Tsinghua University CSAIL, MIT ABSTRACT 1 INTRODUCTION Memory accesses limit the performance and scalability of countless The memory subsystem is crucial to computational eciency and applications. Many design and optimization eorts will benet scalability. Today’s computer systems oer both high DRAM ca- from an in-depth understanding of memory access behavior, which pacity/performance eciency and a deep memory cache layer that is not oered by extant access tracing and proling methods. mixes HBM, DRAM, and one or more NVMs. This trend has enabled In this paper, we adopt a holistic memory access proling ap- in-memory processing in multiple application domains [24, 51, 63]. proach to enable a better understanding of program-system memory On the other hand, in recent years there has been a dramatic in- interactions. We have developed a two-pass tool adopting fast on- crease in both the number of cores and the application (VM) concur- line and slow oine proling, with which we have proled, at the rency on a physical node, as seen in supercomputers, datacenters, variable/object level, a collection of 38 representative applications and public/private clouds. As a result, the memory-bandwidth-to- spanning major domains (HPC, personal computing, data analyt- FLOP ratio has been steadily declining, e.g., from 0:85 for the No. 1 ics, AI, graph processing, and datacenter workloads), at varying supercomputer on Top500 [54] in 1997 to 0:01 for the upcoming problem sizes. We have performed detailed result analysis and code projected Exaop machines [43]. For both commercial and scien- examination. Our ndings provide new insights into application tic computing, it remains important to optimize programs and memory behavior, including insights on per-object access patterns, systems for ecient memory access. adoption of data structures, and memory-access changes at dier- Such optimizations have to build upon an understanding of the ent problem sizes. We nd that scientic computation applications memory access behavior of the program itself, which is highly exhibit distinct behaviors compared to datacenter workloads, moti- challenging. Unlike I/O or network requests, which are commonly vating separate memory system design/optimizations. traced and analyzed, memory access requires temporal and spatial analysis, is a xed high-speed transaction, and is expensive to KEYWORDS perform online analysis. To this end, many optimization techniques Memory proling, object access patterns, workload characteriza- adopt oine memory access proling [16, 48, 59, 62] that collects tion, tracing, data types and structures information during separate proling runs to guide decision making in future “production runs". Although much more aordable, existing oine memory access ACM Reference format: Xu Ji, Chao Wang, Nosayba El-Sayed, Xiaosong Ma, Youngjae Kim, Sud- proling techniques mostly collect high-level statistics (such as harshan S. Vazhkudai, Wei Xue, and Daniel Sanchez. 2017. Understanding total access volume and memory references per thousand instruc- Object-level Memory Access Patterns Across the Spectrum. In Proceedings tions) [18, 19, 21, 38] or full access sequences [27, 35] (including of SC17, Denver, CO, USA, November 12–17, 2017, 12 pages. complete/sampled traces and derived information such as reuse DOI: 10.1145/3126908.3126917 distance distributions). These properties and data, while useful, are based on logical addresses and are disconnected from program ∗ Hosted partially by QCRI, HBKU through a CSAIL-QCRI joint postdoctoral program. semantics. Also, the same application cannot be expected to have the same memory access pattern for dierent input problem sizes Permission to make digital or hard copies of all or part of this work for personal or or input data. Moreover, HPC (scientic computing) applications in classroom use is granted without fee provided that copies are not made or distributed for prot or commercial advantage and that copies bear this notice and the full citation particular have received much less attention from existing memory on the rst page. Copyrights for components of this work owned by others than ACM allocation/access characterization studies. must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specic permission and/or a In this work, we address these problems by providing more fee. Request permissions from [email protected]. intuitive proling results that bridge runtime low-level memory SC17, Denver, CO, USA references and their high-level semantics (e.g., data structures that © 2017 ACM. 978-1-4503-5114-0/17/11...$15.00 DOI: 10.1145/3126908.3126917 reect programmers’ view of execution). This means that addresses SC17, November 12–17, 2017, Denver, CO, USA X. Ji et al. need to be mapped back to “objects" allocated at runtime and fur- then dene the entity grouping all the objects allocated within the ther mapped to “variables" in the source code. To do this, we design same call-stack as a variable. Here, the call-stack encompasses and implement a two-pass proling tool for variable-level access a series of function names and return addresses, from the main pattern analysis. Our tool performs the above address-to-object function all the way to the nal heap memory allocation call. and object-to-variable mapping, which facilitates subsequent on- Therefore, two malloc() calls within the same function are line memory trace processing to report object-level access behavior. considered allocating for dierent variables. This two-pass framework provides the option of collecting object void eo_fermion_force(double eps, int nflavors, field_offset x_off) allocation/access information at dierent levels of detail and over- { ... head, with the rst (fast) pass available for independent deployment. /* Allocate temporary vectors */ for(mu = 0; mu < 8; mu++) Using this tool, we have proled at the variable/object level, a tempvec[mu] = (su3_vector *) collection of 38 representative applications spanning major domains calloc(sites_on_node, sizeof(su3_vector)); (personal computing, HPC, AI, data analytics, graph processing, /* Copy x_off to a temporary vector */ and datacenter servers), each at three dierent problem sizes. For temp_x = (su3_vector *) calloc(sites_on_node, sizeof(su3_vector)); each application, we have further identied the major program ... variables, which are most dominant in memory consumption, and } have collected detailed access behavior proles such as object size and lifetime distribution, spatial density in accesses, read-write Figure 1: Sample code from SPEC milc ratio, sequentiality, and temporal/spatial locality. For such major To illustrate these denitions, Figure 1 shows sample code variables, we have performed considerable manual inspections to from SPEC milc [53]. Here, eight dierent objects (tempvec[0], identify the type and purpose of the data structures in the source tempvec[1], ..., tempvec[7]) are created by the rst calloc() in code, investigating which major data structures are adopted and the for loop (each through a separate call to calloc()). However, how they are accessed. since they share exactly the same call-stack, these objects are con- We have performed detailed proling result analysis, espe- sidered members of the same variable. The object temp_x, on the cially focusing on comparing the behavior of scientic vs. commer- other hand, is created through another call-stack, and hence be- cial/personal computing applications. To the best of our knowledge, longs to another variable. For static/global variables residing in the our study is the rst to perform such a thorough, comparative anal- data segment, such object-variable mapping is easy and one-to-one. ysis between these application classes. We organize our ndings into 7 key observations, and discuss their practical implications, 2.2 Two-Pass Variable/Object-Level Proling illustrating how they shed light on the complex heap memory allocation and access activities of modern workloads. Our results reveal We now present the design of our variable/object memory behavior that scientic applications possess multiple qualities enabling ef- proling framework, shown in Figure 2. The main idea is to perform cient combination of oine and online proling (such as fewer, two-pass proling to avoid the prohibitive time and space overheads larger, and more long-lived major data structures), and more uni- of capturing individual references to all objects. form scaling behavior across variables when computing problems of dierent sizes. However, they still have dynamic and complex memory access patterns that cannot be properly measured by sam- pling billion-instruction or shorter episodes even during their stable phases. 2 PROFILING METHODOLOGY 2.1 Objects and Variables We rst dene the building blocks that form the basis of our proling framework, namely objects and variables. Programs statically or dynamically allocate space for data from memory. Conven- Figure 2: Two-pass proling workow tionally, such a piece of allocated memory is called an “object”, instantiating a “variable” for reference in the program. Due

Load more