Case Study on Integrated Architecture for In-Memory and In-Storage Computing

electronics Article Case Study on Integrated Architecture for In-Memory and In-Storage Computing Manho Kim 1, Sung-Ho Kim 1, Hyuk-Jae Lee 1 and Chae-Eun Rhee 2,* 1 Department of Electrical Engineering, Seoul National University, Seoul 08826, Korea; [email protected] (M.K.); [email protected] (S.-H.K.); [email protected] (H.-J.L.) 2 Department of Information and Communication Engineering, Inha University, Incheon 22212, Korea * Correspondence: [email protected]; Tel.: +82-32-860-7429 Abstract: Since the advent of computers, computing performance has been steadily increasing. Moreover, recent technologies are mostly based on massive data, and the development of artificial intelligence is accelerating it. Accordingly, various studies are being conducted to increase the performance and computing and data access, together reducing energy consumption. In-memory computing (IMC) and in-storage computing (ISC) are currently the most actively studied architectures to deal with the challenges of recent technologies. Since IMC performs operations in memory, there is a chance to overcome the memory bandwidth limit. ISC can reduce energy by using a low power processor inside storage without an expensive IO interface. To integrate the host CPU, IMC and ISC harmoniously, appropriate workload allocation that reflects the characteristics of the target application is required. In this paper, the energy and processing speed are evaluated according to the workload allocation and system conditions. The proof-of-concept prototyping system is implemented for the integrated architecture. The simulation results show that IMC improves the performance by 4.4 times and reduces total energy by 4.6 times over the baseline host CPU. ISC is confirmed to Citation: Kim, M.; Kim, S.-H.; Lee, significantly contribute to energy reduction. H.-J.; Rhee, C.-E. Case Study on Integrated Architecture for Keywords: near data processing; processing in memory; in-storage computing In-Memory and In-Storage Computing. Electronics 2021, 10, 1750. https://doi.org/10.3390/ electronics10151750 1. Introduction Academic Editor: Fabio Grandi Recently, there has been growing interest and demand for artificial intelligence and immersive media. These applications require high performance devices to process massive Received: 15 June 2021 data. However, it is difficult to compute and transfer such large amounts of data in the Accepted: 17 July 2021 traditional von Neumann architecture because it consumes significant power and dissipates Published: 21 July 2021 heat energy as Moore’s Law and Dennard scaling reach physical limits. It is known that data access and movement are the main bottleneck of processing speed, and they consume Publisher’s Note: MDPI stays neutral more energy than computing itself [1]. Thus, many studies about near data processing are with regard to jurisdictional claims in ongoing to overcome these challenges [2,3]. published maps and institutional affil- Near data processing is applicable not only in traditional memory, such as SRAM [4,5] iations. and DRAM [2,6–10], but also in emerging memory, such as PCM [11], STT-MRAM [12], and ReRAM [13]. There are also various attempts to reduce the data movement overhead by computation offloading to storage devices [3,14–16]. However, real-world applications are composed of complex and diverse tasks, and thus, there is a limit to solving the problems Copyright: © 2021 by the authors. with an individual device-level approach. In this paper, the requirements of tasks for Licensee MDPI, Basel, Switzerland. real application scenarios are analyzed in terms of task partition and the data movement This article is an open access article overhead. By providing various options for near data processing in consideration of the distributed under the terms and characteristics of each task that composes an application, opportunities to reduce the data conditions of the Creative Commons movement overhead are explored at the architectural level. Attribution (CC BY) license (https:// • This paper makes the following contributions: creativecommons.org/licenses/by/ • The requirements of the task are analyzed from the memory and storage points of view. 4.0/). Electronics 2021, 10, 1750. https://doi.org/10.3390/electronics10151750 https://www.mdpi.com/journal/electronics Electronics 2021, 10, x FOR PEER REVIEW 2 of 13 • Electronics 2021, 10, 1750 The requirements of the task are analyzed from the memory and storage points2 of of 12 view. • An integrated architecture for in-memory and in-storage computing is proposed, and its implementation is verified on a prototyping system. • An integrated architecture for in-memory and in-storage computing is proposed, • Possible performance improvement is shown in terms of execution time and energy. and its implementation is verified on a prototyping system. • Possible performance improvement is shown in terms of execution time and energy. 2. Background and Related Work 2.1.2. Background In-Memory Computing and Related Work 2.1. In-MemoryIn-memory Computing computing (IMC), also known as processing in-memory (PIM), is a para- digmIn-memory that places computing processing (IMC), elements also knownnear or as inside processing the memory in-memory in (PIM)order, to is aobtain paradigm fast memorythat places access processing and high elements bandwidth. near orAmong inside th thee various memory types in order of memory, to obtain this fast paper memory fo- cusesaccess on and in-memory high bandwidth. computing Among for DRAM, the various which types is most of memory, widely thisused paper and has focuses high on den- in- sity.memory There computing are two forcategories DRAM, of which in-DRAM is most computing widely used depending and has high on density.the package There and are moduletwo categories assembly of in-DRAM of the DRAM computing chip: depending computation on thein packagetwo-dimensional and module (2D) assembly planar DRAM,of the DRAM and computation chip: computation in three-dimensional in two-dimensional (3D) stacked (2D) planar DRAM. DRAM, In 2D and planar computation DRAM, processingin three-dimensional elements (3D)reside stacked in the DRAM. DRAM In die 2D as planar shown DRAM, in Figure processing 1a. Since elements data residetravel throughin the DRAM on-chip die inter-connection as shown in Figure on1 thea. Since DRAM data die, travel it is throughfree from on-chip off-chip inter-connection IO constraints, andon the also DRAM it can die,achieve it is high free fromparallelism off-chip by IO exploiting constraints, the andbit-line also of it the can subarray, achieve highrow buffer,parallelism and byinternal exploiting data theline bit-line [2,6]. However, of the subarray, there are row some buffer, difficulties and internal in computing data line [2 ,6in]. 2DHowever, planar thereDRAM. are someBecause difficulties the DRAM in computing manufacturing in 2D planarprocess DRAM. is optimized Because for the low DRAM cost andmanufacturing low leakage, process only simple is optimized logic can for be low placed, cost anddue low to the leakage, bulky onlyand slow simple transistors. logic can Thebe placed, standard due logic to the process bulky is and optimized slow transistors. to deal with The complex standard logic logic functions, process isbut optimized requires frequentto deal with refresh complex operations, logic functions, due to the but current requires leaky frequent characteristic. refresh operations, This leads due to to high the powercurrent consumption leaky characteristic. and poor This performance. leads to high In power order consumptionto overcome these and poor drawbacks, performance. there haveIn order been to overcomeattempts to these add drawbacks, separate chips there fo haver processing been attempts elements to add over separate the 2D chips planar for DRAMprocessing chip elements in the dual-inline over the 2D memory planar DRAMmodule chip(DIMM) in the [7,8]. dual-inline However, memory additional module assembly(DIMM) process [7,8]. However, is required, additional and cost assembly is consequently process is required,increased. and On cost the is consequentlycontrary, 3D stackedincreased. DRAM, On the such contrary, as high 3D stackedbandwidth DRAM, memory such as(HBM) high bandwidthand hybrid memory memory (HBM) cube (HMC),and hybrid can memoryuse the DRAM cube (HMC), process can and use standard the DRAM logicprocess processand together standard because logic they process pile uptogether more than because two they dice. pile Therefore, up more more than complex two dice. processing Therefore, elements more complex can be processinglocated on theelements base die can with be the located standard on the logic base process die with as shown the standard in Figure logic 1b [9,10]. process Even as though shown the in physicalFigure1b interface [ 9,10]. Even(PHY), though through-silicon the physical via interface(TSV), and (PHY), IEEE1500 through-silicon area for test via pins (TSV), and circuitsand IEEE1500 occupy area the base for test die, pins there and is still circuits extra occupy space thethatbase can be die, used there for is other stillextra purposes. space that can be used for other purposes. (a) (b) Figure 1. In-memoryIn-memory computing computing structure examples: ( a) 2D 2D planar planar DRAM, DRAM, such such as as DIMM DIMM type, type, and and ( (bb)) 3D 3D stacked stacked DRAM, DRAM, such as HBM type. 2.2. In-Storage Computing For in-storage computing (ISC), some computations are offloaded to the storage device. There was a study in which a hard disk drive (HDD) used a processor to compute offloaded tasks for ISC [14]. However, HDDs were quickly replaced by solid-state drives (SSDs), after SSDs were introduced. The conventional SSDs do not process data, Electronics 2021, 10, 1750 3 of 12 even though they have an embedded processor. The internal structure of the SSD is shown in Figure2 . The SSD consists of non-volatile NAND flash chips that do not lose data re- gardless of power-off, and an embedded processor, such as the ARM processor, to manage the firmware, such as the flash translation layer (FTL).

Case Study on Integrated Architecture for In-Memory and In-Storage Computing

Performance Impact of Memory Channels on Sparse and Irregular Algorithms

ECE 571 – Advanced Microprocessor-Based Design Lecture 17

Adm-Pcie-9H3 V1.5

High Bandwidth Memory for Graphics Applications Contents

Architectures for High Performance Computing and Data Systems Using Byte-Addressable Persistent Memory

How to Manage High-Bandwidth Memory Automatically

Exploring New Features of High- Bandwidth Memory for Gpus

AXI High Bandwidth Memory Controller V1.0 Logicore IP

Arxiv:1905.04767V1 [Cs.DB] 12 May 2019 Ing Amounts of Data Into Their Complex Computational Data Locality, Limiting Data Reuse Through Caching [121]

Breaking the Memory-Wall for AI: In-Memory Compute, Hbms Or Both?

High Performance Memory

High-Performance Architectures for Embedded Memory Systems