Sorting in Memristive Memory

1 Sorting in Memristive Memory Mohsen Riahi Alam, Student Member, IEEE, M. Hassan Najafi Member, IEEE, and Nima TaheriNejad, Member, IEEE, Abstract—Sorting is needed in many application domains. In0 6 6 2 0 Out0 The data is read from memory and sent to a general purpose In1 7 7 3 1 Out1 processor or application specific hardware for sorting. The sorted In 3 2 6 2 Out data is then written back to the memory. Reading/writing data 2 2 A Min(A,B) 2 3 7 3 from/to memory and transferring data between memory and In3 Out3 processing unit incur a large latency and energy overhead. B Max(A,B) In4 5 0 0 4 Out4 0 5 1 5 In this work, we develop, to the best of our knowledge, the (a) In5 Out5 first architectures for in-memory sorting of data. We propose In 1 1 4 6 Out two architectures. The first architecture is applicable to the 6 6 (b) 4 4 5 7 conventional format of representing data, weighted binary radix. In7 Out7 The second architecture is proposed for the developing unary processing systems where data is encoded as uniform unary bit- Fig. 1. (a) Schematic symbols of a CAS block (b) CAS network for an streams. The two architectures have different advantages and 8-input bitonic sorting. disadvantages, making one or the other more suitable for a specific application. However, the common property of both is a a a significant reduction in the processing time compared to prior 0 0 OR b AND m0 b M0 sorting designs. Our evaluations show on average 37× and 138× 0 0 a1 a1 energy reduction for binary and unary designs, respectively, as AND OR b1 m1 b1 M1 compared to conventional CMOS off-memory sorting systems. ... ... Index Terms—In-memory computation, sorting networks, a n a n 2-1 AND m n 2-1 OR M n n 2-1 n 2-1 unary processing, stochastic computing, memristors, memristor- b2-1 b2-1 aided logic (MAGIC), median filtering, ReRAM. (a) (b) Fig. 2. Logic design of a CAS block: (a) conventional binary design I. INTRODUCTION processing data (b) parallel unary design processing unary bit-streams. ORTING is a fundamental operation in computer sci- ence used in databases [1], [2], scientific computing [3], S such as quick sort, merge sort, and bubble sort since the scheduling [4], artificial intelligence and robotics [5], im- order of comparisons is fixed in advance; the order is not data age [6], video [7], and signal processing [8], among others. dependent as is the case with software algorithms [18]. The Much research has focused on harnessing the computational implementation cost of these networks is a direct function of power of many core Central Processing Unit (CPU)- and the number of CAS blocks and the cost of each block. A CAS Graphics Processing Unit (GPU)-based processing systems for block is conventionally designed based on the weighted binary efficient sorting [9]–[11]. For high-performance applications radix representation of data. The CAS design consists of an sorting is implemented in hardware using either Application n-bit magnitude comparator and two n-bit multiplexers where Specific Integrated Circuits (ASICs) or Field Programmable n is the precision (i.e., data-width) of the input data. Fig. 2(a) Gate Arrays (FPGAs) [12]–[14]. The parallel nature of shows the conventional design of a CAS unit. Increasing the arXiv:2012.09918v2 [cs.ET] 23 Jan 2021 hardware-based solutions allows them to outperform software- data-width increases the complexity of this design. based solutions executed on CPU/GPU. Unary processing [19] has been recently exploited for The usual approach for hardware-based sorting is to wire simple and low-cost implementation of sorting network cir- up a network of Compare-and-Swap (CAS) units in a con- cuits [18]. In unary processing, unlike weighted binary radix, figuration called a Batcher (or bitonic) network [15]. Batcher all digits are weighted equally. The paradigm borrows the networks provide the lowest known latency for hardware-based concept of averaging from stochastic computing [20], [21], but sortings [16], [17]. Each CAS block compares two input values is deterministic and accurate. Numbers are encoded uniformly and swaps the values at the output, if required. Fig. 1(a) by a sequence of one value (say 1) followed by a sequence shows the schematic symbol of a CAS block. Fig. 1(b) shows of the other value (say 0) in a stream of 1’s and 0’s– called a the CAS network for an 8-input bitonic sorting network unary bit-stream. The value of a unary bit-stream is determined made up of 24 CAS blocks. Batcher sorting networks are by the frequency of the appearance of 1’s in the bit-stream. fundamentally different from software algorithms for sorting For example, 11000000 is a unary bit-stream representing Mohsen Riahi Alam and M. Hassan Najafi are with the School of Com- 2/8. Simplicity of hardware design is the main advantage of puting and Informatics, University of Louisiana at Lafayette, LA, 70504, unary processing. Minimum and maximum functions, the main USA. Nima TaheriNejad is with the Institute of Computer Technology, Tech- nische Universitat¨ Wien (TU Wien), Vienna, Austria (email: mohsen.riahi- operations in a CAS block, can be implemented using simple [email protected], najafi@louisiana.edu, [email protected]) standard AND and OR gates in unary domain [18]. These 2 gates showed a similar functionality when fed with correlated stochastic bit-streams [22]. In a serial manner, one AND and A A V0 one OR gate implements a CAS block by processing one bit A A n of the two bit-streams at each cycle. Hence, a total of 2 V0 n A A processing cycles is needed to process two 2 -bit bit-streams HRS LRS LRS (equivalent to two n-bit binary data). Alternatively, the bit- LRS LRS HRS streams can be processed in one cycle by replicating the logic A gates and performing the logical operations in parallel. Fig 2(b) A+B n B V0 shows the parallel unary design of a CAS block. 2 pairs of A B A+B AND and OR gates sort two 2n-bit bit-streams. V0 All these prior sorting designs were developed based on A B A+B HRS HRS LRS LRS the Von-Neumann architecture, separating the memory unit LRS HRS LRS HRS HRS LRS LRS HRS where the data is stored and the processing unit where the LRS LRS LRS HRS data is processed (i.e., sorted). A significant portion of the (a) (b) total processing time and total energy consumption is wasted Fig. 3. (a) NOT and NOR logical operations in MAGIC and their truth tables. on 1) reading the data from memory, 2) transferring the data LRS and HRS represent logical ‘1’ and logical ‘0’, respectively. (b) Crossbar between memory and processing unit, and 3) writing the result implementation of NOT and NOR logical operations. back into the memory. In-Memory Computation (IMC) or Processing in Memory (PIM) is an emerging computational approach that offers the ability to both store and process then elaborate the design of complete sorting networks made data within memory cells [23]–[28]. This technique eliminates up of the proposed in-memory CAS units. the high overhead of transferring data between memory and This paper is structured as follows. Section II and III present processing unit, improving the performance and reducing the the proposed in-memory Binary and Unary Sorting designs, energy consumption by processing data in memory. For data respectively. Section IV compares the performance of the intensive applications, developing efficient IMC methods is an proposed designs with the conventional off-memory CMOS- active area of research [29]–[38]. based designs, and applies the proposed architectures to an One of the promising technologies for IMC is memristive- important application of sorting, i.e., median filtering. Finally, based IMC using logics such as Memristor-Aided Logic conclusions are drawn in Section VI. (MAGIC) [39]. MAGIC is fully compatible with the usual crossbar design. It supports NOR operation, which can be used II. PROPOSED IN-MEMORY BINARY SORTING to implement any Boolean logic. MAGIC considers two states In this section, we present our proposed method for in- of memristors: Low Resistance State (LRS) as logical ‘1’ and memory sorting of binary radix data. First, we discuss the High Resistance State (HRS) as logical ‘0’. Fig. 3(a) shows implementation of a basic sorting unit and then generalize the how NOR and NOT logical operations can be implemented in architecture to complete sort systems. MAGIC. The memristors connected to the ground are output memristors. To execute the operation, the output memristors are first initialized to LRS. By applying a specific voltage A. Basic Binary Sorting Unit (V0) to the negative terminal of the input memristors, the A CAS block in conventional binary radix requires one n- output memristors may experience a state change from LRS bit comparator and two n-bit multiplexers. While there are to HRS depending on the states of the inputs. The truth tables several prior designs for in-memory equality comparator [40], embedded in Fig. 3(a) show all possible cases of the input [41], to the best of our knowledge, Angizi et al. [42] develop memristors’ states and switching of the output memristors. the only prior near/in-memory magnitude comparator. Their Fig. 3(b) shows how these operations are realized in crossbar design uses in-memory XOR operations to perform bit-wise memory.

Sorting in Memristive Memory

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support