Architectural Exploration of Heterogeneous Memory Systems Marcos Horro, Gabriel Rodr´Iguez, Juan Tourino,˜ Mahmut T
Total Page:16
File Type:pdf, Size:1020Kb
1 Architectural exploration of heterogeneous memory systems Marcos Horro, Gabriel Rodr´ıguez, Juan Tourino,˜ Mahmut T. Kandemir Abstract—Heterogeneous systems appear as a viable design whether an access presents reuse that is easily exploitable alternative for the dark silicon era. In this paradigm, a processor by a regular cache, or whether the LRU algorithm and cache chip includes several different technological alternatives for im- conflicts make it advisable to manually allocate the variable plementing a certain logical block (e.g., core, on-chip memories) which cannot be used at the same time due to power constraints. to a scratchpad to better take advantage of the available The programmer and compiler are then responsible for select- locality. Taking into account the hardware characteristics of ing which of the alternatives should be used for maximizing the different available memory modules, access and transfer performance and/or energy efficiency for a given application. costs for each possible allocation are calculated, and a final This paper presents an initial approach for the exploration of allocation is decided considering also memory sizes and power different technological alternatives for the implementation of on-chip memories. It hinges on a linear programming-based budgets. model for theoretically comparing the performance offered by the This paper is organized as follows. Section II introduces available alternatives, namely SRAM and STT-RAM scratchpads scratchpad memories. Section III analyzes the gem5 frame- or caches. Experimental results using a cycle-accurate simulation work and discusses the simulation of scratchpad memories tool confirm that this is a viable model for implementation into in the system. The mathematical model for data allocation is production compilers. presented in Section IV. Experimental results using SRAM Index Terms—scratchpad memories, cache memories, hetero- caches and STT-RAM scratchpad memories demonstrate the geneous architectures, dark silicon, power wall, memory wall, feasibility of the proposed approach, as shown in Section V. gem5. Finally, Section VI concludes the paper. I. INTRODUCTION II. SCRATCHPAD MEMORIES HE end of Dennard scaling has brought processor per- T formance to a near standstill in recent years, pushing The importance and complexity of the on-chip memory architects towards designing increasingly parallel computers. hierarchy has increased with the advances in processor perfor- Even with this paradigm shift, the power wall is expected to mance [2], in an attempt to bridge the gap between memory soon limit multicore scaling. A dark silicon future [1] has and processor speeds. Nowadays it is possible to find several been proposed, in which chips will incorporate a high number different types of memories integrated in a single hierar- of very specialized hardware such as vector execution units, chy, e.g., private vs. shared ones, or exclusive vs. inclusive. tunable memory hierarchies and even different heterogeneous Scratchpad memories (SPMs) are one type of fast, random- core technologies. In this paradigm, users, compilers and access memories which are sometimes used as an alternative runtime cooperate to choose which parts of the core to use, to cache memories. The main feature of scratchpads is its pro- under a certain power budget, for executing a given application grammability, i.e., the possibility of handling the data allocated under given QoS restrictions. using special-purpose instructions placed by the compiler or programmer. In contrast, the contents of cache memories arXiv:1810.12573v1 [cs.AR] 30 Oct 2018 This paper focuses on the analysis and simulation of het- erogeneous memory hierarchies using the gem5 simulator, a are controlled automatically by the hardware, for instance widely used tool developed by key players in the industry such using the LRU algorithm. The use of scratchpad memories is as ARM, Intel, or Google. Its biggest potential is the ability widely extended in embedded systems, e.g. real time systems. to create new components or modify existing ones in order SPM guarantees a fixed access latency whereas an access to perform architectural exploration. Extensions to simulate to the cache may result in a miss thereby incurring longer scratchpad memories in this framework are proposed, and used latency due to off-chip access [3] [4]. This unpredictability of to assess the power-performance trade-off of different memory caches is undesirable to meet hard timing constraints. Besides, configurations and technologies. Profiling data is fed to an scratchpad memories, having no tag array, are potentially more operational research framework that decides which memory efficient energy- and performance-wise. Nevertheless, cache units to enable from an available pool. This mathematical memories are largely used in general purpose processors, framework takes variable placement decisions by checking mainly because efficiently employing a scratchpad memory implies recompiling the code for different SPM features (e.g., M. Horro, G. Rodr´ıguez and J. Tourino˜ are with the size), while the LRU algorithm employed by caches is capable Department of Electronics and Systems, Universidade da Coruna,˜ of automatically exploiting locality in a reasonable way. The fmarcos.horro,grodriguez,[email protected]. M. T. Kandemir is with the Department of Computer Science and Engi- use of one or the other flavor of on-chip memory becomes then neering, Pennsylvania State University, [email protected]. a performance/effort choice. To exploit this trade-off, several 2 specific purpose architectures have implemented scratchpad TABLE I memories: DIFFERENCES BETWEEN SIMULATORS ACCORDING TO THE FOLLOWING CRITERIA: WHETHER THE PLATFORM IS OR NOT IN CURRENT • Cell IBM [5]: this architecture was used in the nodes DEVELOPMENT,INSTRUCTION SET ARCHITECTURE (ISA) SUPPORTED of the MareNostrum supercomputer at the Barcelona AND ACCURACY.DATA EXTRACTED FROM [12] [13] Supercomputing Center [6] and in the main core of PlayStation 3, as well as in premium Toshiba TVs [7], Sim. Dev. ISA(s) Accuracy amongst others. Simics Yes Alpha, ARM, M68k, Functional • PlayStation 2 [8]: this console included small scratchpad MIPS, PowerPC, memories managed by the CPU and GPU. SPARC, x86 • Digital Signal Processor (DSP) [9]: SODA’s architecture includes a private scratchpad memory for each processing SimFlex No SPARC, x86 (requires Cycle unit and a global scratchpad memory that can be accessed Simics) for any unit. GEMS No SPARC (requires Sim- Timing • Knights Landing’s architecture Intel Xeon Phi [10]: this ics) architecture uses MCDRAM memories. The processor has access to an addressable memory with high band- m5 No Alpha, MIPS, SPARC Cycle width, and therefore the concept is similar to that of MARSS No x86 (requires QEMU) Cycle scratchpad memories. However it is also remarkable that, in contrast with the previous examples, Intel Xeon Phi is OVPsim Yes ARM, MIPS, x86 Functional a manycore architecture. PTLsim No x86 (requires Xen and Cycle There has also been at least an implementation of scratch- KVM/QEMU) pad memories in a general purpose architecture, as Cyrix Simple No Alpha, ARM, Pow- Cycle 6X86MX [11] already implemented a “scratchpad RAM”. Scalar erPC, x86 This 32-bit x86-compatible microprocessor, released in 1996, implemented a large Primary Cache of 64kB. This cache could gem5 Yes Alpha, ARM, MIPS, Cycle be turned into a scratchpad RAM memory. The cache area set PowerPC, SPARC, aside as scratchpad memory acted as a private memory for the x86 CPU and did not participate in cache operations. consumption [14] [15], the main reason being the need to III. INTEGRATION OF SCRATCHPAD MEMORIES IN GEM5 activate both the tag and data regions to obtain a certain Simulators are crucial tools in architectural development and line, while a random-access memory only needs a single array exploration, since they allow reasonably accurate estimations access. Besides, cache misses also imply penalties in energy with almost negligible cost in comparison with lithographic or consumption. FPGA solutions. Thus, there have appeared many approaches Moreover, the concept of scratchpad memories does not dif- with different granularity focused on concrete subsystems fer from conventional memories, i.e., RAM memory, as their and other with holistic nature. Table I compares different contents can be allocated and programmed by the application simulators with both private licenses, as Simics, and open code. Thus, in order to implement scratchpad memories (SPM) source licenses, which are the rest. in gem5, the starting point has been the memory modules al- gem5 is a framework for performing cycle-accurate com- ready implemented (SimpleMemory class). We have adapted puter architecture simulations. The main reasons for choosing this implementation in order to add new parameters, such as gem5 over the rest in this work have been: (i) its current latencies. development and implication of different organizations such as Google, Intel or ARM, as well as its large community of users; (ii) its open source license; and (iii) the possibility of simulating different ISAs. In addition, the modularity of the platform allows an easy modification of the system and the integration of new components. It provides two modes of operation: full system and system call emulation mode. The latter is interesting for executing benchmarks without loading an OS image. Regarding memory hierarchy simulation, there are two modes: classic and Ruby. An advantage