Performance Evaluation and Feasibility Study of Near-Data Processing on DRAM Modules (DIMM-NDP) for Scientific Applications Matthias Gries, Pau Cabré, Julio Gago
Total Page:16
File Type:pdf, Size:1020Kb
Performance Evaluation and Feasibility Study of Near-data Processing on DRAM Modules (DIMM-NDP) for Scientific Applications Matthias Gries, Pau Cabré, Julio Gago To cite this version: Matthias Gries, Pau Cabré, Julio Gago. Performance Evaluation and Feasibility Study of Near-data Processing on DRAM Modules (DIMM-NDP) for Scientific Applications. [Technical Report] Huawei Technologies Duesseldorf GmbH, Munich Research Center (MRC). 2019. hal-02100477 HAL Id: hal-02100477 https://hal.archives-ouvertes.fr/hal-02100477 Submitted on 15 Apr 2019 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. TECHNICAL REPORT MRC-2019-04-15-R1, HUAWEI TECHNOLOGIES, MUNICH RESEARCH CENTER, GERMANY, APRIL 2019 1 Performance Evaluation and Feasibility Study of Near-data Processing on DRAM Modules (DIMM-NDP) for Scientific Applications Matthias Gries , Pau Cabre,´ Julio Gago Abstract—As the performance of DRAM devices falls more and more behind computing capabilities, the limitations of the memory and power walls are imminent. We propose a practical Near-Data Processing (NDP) architecture DIMM-NDP for mitigating the effects of the memory wall in the nearer-term targeting server applications for scientific computing. DIMM-NDP exploits existing but unused DRAM bandwidth on memory modules (DIMMs) and takes advantage of a subset of the forthcoming JEDEC NVDIMM-P protocol in order to integrate application-specific, programmable functionality near memory. DIMM-NDP works on shared memory with the host CPU by definition, takes advantage of abundant memory capacity in the main memory subsystem and remains economic by relying on standard, unmodified DRAM devices. We evaluate DIMM-NDP with a range of bandwidth, latency and compute-bound workloads from the domains of predictive data analytics and machine learning that depend on dense and sparse linear algebra. Simulation results show up to 6.3x better performance for bandwidth-limited applications, representing 79% of the theoretical peak, and up to 3x improved energy efficiency. We complement the evaluation with feasibility checks for DIMM-like form factors to offer 32GB to 128GB capacity per DIMM, hardware overhead costs (below 20%), and power envelopes for standard (13W) and custom DIMMs (40W). A sensitivity analysis of interface properties for comparison with traditional accelerator coupling over PCIe, as well as a case study on porting software kernels, showing in the order of one month programming effort per application, outline reasonable operating points for DIMM-NDP. Index Terms—Processing In-Memory, near-memory accelerator, NUMA, software porting, performance engineering, case study. ✦ CONTENTS 5 Hardware Setups for DIMM-NDP 5 5.1 Power & Area Analysis of Blocks . 6 1 Introduction 2 5.2 Chip Packages and Form Factors . 6 1.1 LimitedAdaptationofNDPsofar . 2 5.3 ResultingNDP-enabledMemModules . 6 1.2 MotivationforDIMM-NDP . 2 5.4 Cost Analysis for Memory Module . 7 2 Related Work 2 6 Evaluation Setup and Flow 7 2.1 Synchronization between Host and NDP 3 6.1 SelectedWorkloads . 7 2.2 Integration of NDP Technology . 3 6.2 SystemConfiguration. 8 2.3 Variants of NDP Functionality . 3 6.3 EvaluationFlow.............. 9 2.4 FormFactors................ 3 2.5 PositioningDIMM-NDP . 3 7 Performance Evaluation of DIMM-NDP 10 7.1 Performance and Efficiency Results . 10 3 DIMM-NDP Architecture 3 7.2 Near-datavs.LooselyCoupledAcc . 11 3.1 ArchitectureOverview . 3 7.3 NVDIMM-P Asynchronous Accesses . 13 3.2 NDPUnitSetup.............. 4 7.4 Software Porting Effort . 14 3.3 LateralTransfersbetweenUnits . 4 7.5 Discussion of Further NDP Variants . 14 3.4 ECCHandling............... 4 8 Concluding Remarks 15 4 Software View of DIMM-NDP 4 4.1 Workload Partitioning and Placement . 4 References 15 4.2 Protection/Virtual Address Spaces . 5 4.3 Synchronization. ... ... ... .... 5 4.4 SoftwareImplementationFlow . 5 • Matthias Gries is with Huawei Technologies D¨usseldorf GmbH at the All product or company names that may be mentioned in this Munich Research Center (MRC), Riesstr. 25, 80992 M¨unchen, Germany report are tradenames, trademarks or registered trademarks of their E-mail: [email protected] • Pau Cabr´eand Julio Gago are with Metempsy, Pamplona 88, Barcelona respective owners. 08018, Spain E-mail: {pau.cabre|julio.gago}@gmail.com Research Report released on April 15th, 2019 TECHNICAL REPORT MRC-2019-04-15-R1, HUAWEI TECHNOLOGIES, MUNICH RESEARCH CENTER, GERMANY, APRIL 2019 2 1 INTRODUCTION limited capacity, may depend on 2.5D integration with The implications of the memory and power walls on mem- additional interposers and are subject to further thermal ory subsystems are imminent [1], [2]. Operations have to stress and testing challenges. • move closer to data in the memory hierarchy in order Lacking standards in the past for interface protocols to reduce redundant data movement as one of the major and programming, such that NDP has been perceived sources of wastage [3], [4], [5]. As a result of decades of as an application-specific and proprietary solution. scaling DRAM devices for memory capacity, the memory subsystem has been adapted to keep pace in terms of 1.2 Motivation for DIMM-NDP bandwidth, as shown in Fig. 1 for the stream [6] benchmark. The main features of DIMM-NDP are: “Brute force” effects, such as adding more memory chan- • Building on standard IP: NDP units enhance the Media nels, are nearly exploited and challenged by signal integrity, Controller (MedC) on a memory module. The MedC is a pin limitations and process technology scaling, as the design discrete buffer chip positioned side-by-side the DRAM of analog frontends for IO interfaces becomes more com- devices on the module and needed for forthcoming plex [7]. Recent approaches to mitigate the memory wall interface standards like JEDEC NVDIMM-P [14], [15], Gen-Z [16] and CCIX [17]. 1000000 • DIMM-NDP employs unmodified DRAM chips and Dual socket, 3 channels exploits unused rank-level bandwidth on DIMM, such integrated MC, DDR3-1333 100000 that we follow the economy of scale of manufacturing Dual channel DDR2-400 standard DRAM. 10000 Dual socket, 4 channels • Dual channel integrated MC, DDR4-2133 Memory capacity is not wasted. We focus on NDP in the RDRAM shared memory subsystem of the host, where both sides 1000 (host and NDP) have concurrent access to memory. ooo core The memory module appears as normal Load-Reduced 100 DIMM if NDP is switched off. 64b SDRAM, single channel, in-order core • streamcopy bandwidth[MB/s] Versatility: We use standard cores with special func- 10 tional units as programmable NDP units that different 1990 1995 2000 2005 2010 2015 2020 application domains can take advantage of, as expected year for a server setup. • We take advantage of recent standard protocols to Fig. 1. Reported Stream [6] bandwidth on system-level, annotated with ease the implementation, namely forthcoming JEDEC architecture modifications to sustain memory performance. NVDIMM-P [14], [15] that introduces asynchronous transfers as a subset, as well as NUMA and math follow either in-memory (on the same technology process, libraries like BLAS [18] for programming. such as [8] and references therein) or near-memory (e.g., 3D stacked, as in [9], [10], [11], [12]) ideas. These represent By relying on standard DRAM, modular DIMMs and com- either longer-term research, since the economic manufac- mon programming abstractions, we see DIMM-NDP as turing at high density still has to be shown, or relatively an enabler for gradually introducing NDP into general- costly solutions with stacked DRAM, like High-Bandwidth purpose servers for use with many applications that benefit Memory (HBM) and the Hybrid Memory Cube (HMC), that from CPU-centric programming and high memory capacity. are limited by memory capacity and more costly to integrate We lower the bar for adapting Near-Data Processing (NDP) than standard DRAM devices. in the main memory subsystem, since DIMM-NDP does not However, the effects of dark silicon [13] lead designers waste DRAM capacity. Application programmers can thus to more heterogeneous architectures. Programmers become gradually migrate memory-bound code to DIMM-NDP by more experienced with using accelerators and non-uniform working on shared memory. memory in return. We can rely on this trend to introduce an The contributions of this paper are a) the proposal of acceptable solution by building on the concept of NUMA the DIMM-NDP architecture and host-CPU centric pro- (non-uniform memory accesses) to associate memory loca- gramming view amenable to scientific computing b) the tions with near-data accelerators. performance evaluation of DIMM-NDP, c) a case study of the usability looking at the software porting effort, and d) revealing design tradeoffs in terms of form factor, perfor- 1.1 Limited Adaptation of Near-Data Processing so far mance, power and costs with feasibility checks. The lack of adapting Near-Data Processing (NDP) in- or After surveying related work in the next section, sec- near-memory may be attributed to the following challenges: tions 3 and 4 introduce the architecture and software stack for DIMM-NDP. Section 5 shows feasibility checks for form • The integration of processing has been tried on the factors, power and overhead costs. We describe the eval- same technology process with the memory, either Non- uation flow, performance results and the software porting Volatile Memory (NVM) or DRAM. In case of DRAM, efforts in sections 6 and 7, followed by concluding remarks. this leads to either not scalable and slow, or not very complex designs, whereas the area of NVM integration and manufacturing is an open field of research.