<<

1

Sorting in Memristive Memory

Mohsen Riahi Alam, Student Member, IEEE, M. Hassan Najafi Member, IEEE, and Nima TaheriNejad, Member, IEEE,

Abstract—Sorting is needed in many application domains. In0 6 6 2 0 Out0 The data is read from memory and sent to a general purpose In1 7 7 3 1 Out1 or application specific hardware for sorting. The sorted In 3 2 6 2 Out data is then written back to the memory. Reading/writing data 2 2 A Min(A,B) 2 3 7 3 from/to memory and transferring data between memory and In3 Out3 processing unit incur a large latency and energy overhead. B Max(A,B) In4 5 0 0 4 Out4 0 5 1 5 In this work, we develop, to the best of our knowledge, the (a) In5 Out5 first architectures for in-memory sorting of data. We propose In 1 1 4 6 Out two architectures. The first architecture is applicable to the 6 6 (b) 4 4 5 7 conventional format of representing data, weighted binary radix. In7 Out7 The second architecture is proposed for the developing unary processing systems where data is encoded as uniform unary bit- Fig. 1. (a) Schematic symbols of a CAS block (b) CAS network for an streams. The two architectures have different advantages and 8-input bitonic sorting. disadvantages, making one or the other more suitable for a specific application. However, the common property of both is a a a significant reduction in the processing time compared to prior 0 0 OR b AND m0 b M0 sorting designs. Our evaluations show on average 37× and 138× 0 0 a1 a1 energy reduction for binary and unary designs, respectively, as AND OR b1 m1 b1 M1

compared to conventional CMOS off-memory sorting systems......

Index Terms—In-memory computation, sorting networks, a n a n 2-1 AND m n 2-1 OR M n n 2-1 n 2-1 unary processing, stochastic computing, memristors, memristor- b2-1 b2-1 aided logic (MAGIC), median filtering, ReRAM. (a) (b) Fig. 2. Logic design of a CAS block: (a) conventional binary design I.INTRODUCTION processing data (b) parallel unary design processing unary bit-streams. ORTING is a fundamental operation in computer sci- ence used in databases [1], [2], scientific computing [3], S such as quick sort, merge sort, and bubble sort since the scheduling [4], artificial intelligence and robotics [5], im- order of comparisons is fixed in advance; the order is not data age [6], video [7], and [8], among others. dependent as is the case with software algorithms [18]. The Much research has focused on harnessing the computational implementation cost of these networks is a direct function of power of many core (CPU)- and the number of CAS blocks and the cost of each block. A CAS Graphics Processing Unit (GPU)-based processing systems for block is conventionally designed based on the weighted binary efficient sorting [9]–[11]. For high-performance applications radix representation of data. The CAS design consists of an sorting is implemented in hardware using either Application n-bit magnitude comparator and two n-bit multiplexers where Specific Integrated Circuits (ASICs) or Field Programmable n is the precision (i.e., data-width) of the input data. Fig. 2(a) Gate Arrays (FPGAs) [12]–[14]. The parallel of shows the conventional design of a CAS unit. Increasing the arXiv:2012.09918v2 [cs.ET] 23 Jan 2021 hardware-based solutions allows them to outperform software- data-width increases the complexity of this design. based solutions executed on CPU/GPU. Unary processing [19] has been recently exploited for The usual approach for hardware-based sorting is to simple and low-cost implementation of sorting network cir- up a network of Compare-and-Swap (CAS) units in a con- cuits [18]. In unary processing, unlike weighted binary radix, figuration called a Batcher (or bitonic) network [15]. Batcher all digits are weighted equally. The paradigm borrows the networks provide the lowest known latency for hardware-based concept of averaging from stochastic computing [20], [21], but sortings [16], [17]. Each CAS block compares two input values is deterministic and accurate. Numbers are encoded uniformly and swaps the values at the output, if required. Fig. 1(a) by a sequence of one value (say 1) followed by a sequence shows the schematic symbol of a CAS block. Fig. 1(b) shows of the other value (say 0) in a stream of 1’s and 0’s– called a the CAS network for an 8-input bitonic sorting network unary bit-stream. The value of a unary bit-stream is determined made up of 24 CAS blocks. Batcher sorting networks are by the frequency of the appearance of 1’s in the bit-stream. fundamentally different from software algorithms for sorting For example, 11000000 is a unary bit-stream representing Mohsen Riahi Alam and M. Hassan Najafi are with the School of Com- 2/8. Simplicity of hardware design is the main advantage of puting and Informatics, University of Louisiana at Lafayette, LA, 70504, unary processing. Minimum and maximum functions, the main USA. Nima TaheriNejad is with the Institute of Computer Technology, Tech- nische Universitat¨ Wien (TU Wien), Vienna, Austria (email: mohsen.riahi- operations in a CAS block, can be implemented using simple [email protected], najafi@louisiana.edu, [email protected]) standard AND and OR gates in unary domain [18]. These 2 gates showed a similar functionality when fed with correlated stochastic bit-streams [22]. In a serial manner, one AND and A A V0 one OR gate implements a CAS block by processing one bit A A n of the two bit-streams at each cycle. Hence, a total of 2 V0 n A A processing cycles is needed to process two 2 -bit bit-streams HRS LRS LRS (equivalent to two n-bit binary data). Alternatively, the bit- LRS LRS HRS streams can be processed in one cycle by replicating the logic A gates and performing the logical operations in parallel. Fig 2(b) A+B n B V0 shows the parallel unary design of a CAS block. 2 pairs of A B A+B AND and OR gates sort two 2n-bit bit-streams. V0 All these prior sorting designs were developed based on A B A+B HRS HRS LRS LRS the Von-Neumann architecture, separating the memory unit LRS HRS LRS HRS HRS LRS LRS HRS where the data is stored and the processing unit where the LRS LRS LRS HRS data is processed (i.e., sorted). A significant portion of the (a) (b) total processing time and total energy consumption is wasted Fig. 3. (a) NOT and NOR logical operations in MAGIC and their truth tables. on 1) reading the data from memory, 2) transferring the data LRS and HRS represent logical ‘1’ and logical ‘0’, respectively. (b) Crossbar between memory and processing unit, and 3) writing the result implementation of NOT and NOR logical operations. back into the memory. In-Memory Computation (IMC) or Processing in Memory (PIM) is an emerging computational approach that offers the ability to both store and process then elaborate the design of complete sorting networks made data within memory cells [23]–[28]. This technique eliminates up of the proposed in-memory CAS units. the high overhead of transferring data between memory and This paper is structured as follows. Section II and III present processing unit, improving the performance and reducing the the proposed in-memory Binary and Unary Sorting designs, energy consumption by processing data in memory. For data respectively. Section IV compares the performance of the intensive applications, developing efficient IMC methods is an proposed designs with the conventional off-memory CMOS- active area of research [29]–[38]. based designs, and applies the proposed architectures to an One of the promising technologies for IMC is memristive- important application of sorting, i.e., median filtering. Finally, based IMC using logics such as Memristor-Aided Logic conclusions are drawn in Section VI. (MAGIC) [39]. MAGIC is fully compatible with the usual crossbar design. It supports NOR operation, which can be used II.PROPOSED IN-MEMORY BINARY SORTING to implement any Boolean logic. MAGIC considers two states In this section, we present our proposed method for in- of memristors: Low Resistance State (LRS) as logical ‘1’ and memory sorting of binary radix data. First, we discuss the High Resistance State (HRS) as logical ‘0’. Fig. 3(a) shows implementation of a basic sorting unit and then generalize the how NOR and NOT logical operations can be implemented in architecture to complete sort systems. MAGIC. The memristors connected to the ground are output memristors. To execute the operation, the output memristors are first initialized to LRS. By applying a specific voltage A. Basic Binary Sorting Unit (V0) to the negative terminal of the input memristors, the A CAS block in conventional binary radix requires one n- output memristors may experience a state change from LRS bit comparator and two n-bit multiplexers. While there are to HRS depending on the states of the inputs. The truth tables several prior designs for in-memory equality comparator [40], embedded in Fig. 3(a) show all possible cases of the input [41], to the best of our knowledge, Angizi et al. [42] develop memristors’ states and switching of the output memristors. the only prior near/in-memory magnitude comparator. Their Fig. 3(b) shows how these operations are realized in crossbar design uses in-memory XOR operations to perform bit-wise memory. comparison between corresponding bits of two data beginning In MAGIC, NOR and NOT logical operations can be natively from most significant bit towards least significant bit. However, executed within memory with a high degree of parallelism. the comparison process involves reading the output of the Inherently parallel architectures such as sorting networks thus XOR operations and the data from memory by the control can benefit greatly from IMC. In this work, to the best of our unit (a near-memory operation). Therefore, its latency (i.e., knowledge, we propose the first IMC architectures for high- number of processing cycles) is non-deterministic and depends performance and energy-efficient sorting of data in memory. on the data being compared. In this section, we propose an We propose two different architectures. The first architecture, in-memory magnitude comparator with deterministic latency “Binary Sorting”, is based on the conventional weighted binary and no memory read operation (hence, a fully in-memory representation and applicable to the conventional systems that magnitude comparator). store the data in memory in binary format. The second archi- In general, implementing an n-bit comparator requires tecture, “Unary Sorting”, is based on the non-weighted unary roughly (11n−2) NOR and (7n−2) NOT logic gates. Figs. 4(a) representation and useful for unary processing systems. For and 5(a) show the generic logic and the NOR-based logic each one of these designs we first discuss the basic operation design of a 4-bit binary magnitude comparator. Fig. 5(b) shows of sorting two n-bit precision data (i.e., a CAS block). We our proposed MAGIC-based in-memory implementation for 3

TABLE I THE REQUIRED RESOURCESAND NUMBEROF PROCESSING CYCLESFORTHE PROPOSED BASIC BINARY SORTING UNIT

Required # of Initialized # of Reused # of Logical # of Copies Reused init Total # of Energy Data-Width Dimension Memristors Memristors Operation Cycles [Copy Cycles] Cycles Cycles (pJ) 4 4×14 40 44 21 6 (12) 6 39 199.4 8 8×22 88 88 25 14 (28) 10 63 417 16 16×38 184 176 33 30 (60) 18 111 845 32 32×70 376 352 49 62 (124) 34 207 1728 n n × (8+2n-2) 12n − 8 11n 18+(n-1) 2n-2 [2(2n-2)] n + 2 6n + 15 -

B0 multiplexers, we re-use the memory cells of the comparison A0 Bi step. To this end, we initialize the columns to LRS in one clock B1 A Maxi cycle. Fig. 6(b) shows our MAGIC-based in-memory design 1 A>B B1 Ai for the two 4-bit multiplexers the sorting circuit requires to A1 B2 A>B select the maximum and minimum data. The input data (i.e., A2 A

B0 1 A 5 15 21 0 AB A2 23 B3 11 14 20 4 8 (a) A3

A0,0 B0,0 G1,1 G5,2 G15,5 C24,25 G25,27

A1,0 B1,0 G2,1 G6,2 G9,3 G12,4 G16,5 G18,6 C24,25 G25,27

A2,0 B2,0 G3,1 G7,2 G10,3 G13,4 G17,5 G19,6 C24,25 G25,27

A3,0 B3,0 G4,1 G8,2 G11,3 G14,4 G20,6 C17,8G23 ,9 C16,14G22 ,15 C15,17G21 ,21 C19,12 C18,20 G24,23 G25,27

(b)

Fig. 5. (a) NOR-based logic design of a 4-bit binary magnitude comparator. (b) MAGIC-based 4-bit binary in-memory comparator. Gi memristor holds the output of the i-th gate. Ci memristor copies the state of Gi memristor. The second number shown on each memristor (e.g., 2 in G5,2) determines the processing cycle in which the memristor operates. A Ai i Pi T AB Qi A Ai i Ri U A>B B i Min Bi i i (a) A

_ _ A0,0 B0,0 A>B_ ,0 A

_ _ A1,0 B1,0 A>B_ ,0 A

_ _ A2,0 B2,0 A>B_ ,0 A

_ _ A3,0 B3,0 A>B_ ,0 A

(b)

Fig. 6. (a) NOR-based logic design of a multi-bit binary 2-to-1 multiplexer and (b) in-memory MAGIC-based 4-bit binary multiplexer for Max/Min selection. The second number shown on each memristor (e.g., 3 in P0,3) determines the processing cycle in which the memristor operates. sorting network in memory. The memory is split into four into six steps equal to the number of CAS groups (stages). partitions, namely partitions A, B, C, and D. The number of In the first step, the two inputs in each partition are sorted partitions is taken from the number of CAS units that can using the basic sorting operation proposed in Section II-A. run in parallel (i.e., N/2.). Each partition includes two out In the second step, each maximum number (i.e., the larger of the eight unsorted input data. The sorting process is split number between the two in the partition) found by the sorting 5

TABLE II NUMBEROF PROCESSING CYCLES,SIZEOF CROSSBAR MEMORY, AND ENERGY CONSUMPTION (nJ) TO IMPLEMENT DIFFERENT BITONIC SORTING NETWORKS (DW = DATA-WIDTH, BL = BIT-STREAM LENGTH)

Network Binary Sorting DW = 4 DW = 8 DW = 16 DW = 32 Size Cycles Size Energy Cycles Size Energy Cycles Size Energy Cycles Size Energy 4 128 4×28 1.2 200 8×44 2.5 344 16×76 5.1 632 32×140 10 8 280 4×56 4.7 424 8×88 10 712 16×152 20 1288 32×280 41 16 544 4×112 15 784 8×176 33 1264 16×304 68 2224 32×560 138 32 1048 4×224 47 1408 8×352 100 2128 16×608 205 3568 32×1120 415 Network Unary Sorting BL = 16 BL = 64 BL = 256 BL = 1024 Size Cycles Size Energy Cycles Size Energy Cycles Size Energy Cycles Size Energy 4 26 16×10 1.37 26 64×10 5.4 26 256×10 21.88 26 1024×10 87 8 76 16×20 5.4 76 64×20 21 76 256×20 87 76 1024×20 350 16 194 16×40 18 194 64×40 72 194 256×40 291 194 1024×40 1168 32 538 16×80 54 538 64×80 218 538 256×80 875 538 1024×80 3503 64 1406 16×160 153 1406 64×160 613 1406 256×160 2452 1406 1024×160 9809 128 3624 16×320 408 3624 64×320 1635 3624 256×320 6540 3624 1024×320 26159 256 9176 16×640 1051 9176 64×640 4204 9176 256×640 16817 9176 1024×640 67268

operations of the first step is copied to another partition where Step1 Step2 Step3 Step4 Step5 Step6 it is needed. The bitonic network determines the destination partition. For instance, the maximum found by executing the A B A sorting operation in partition A (i.e., 7 in the example of B Fig. 7) will be copied into partition B to be compared with the A B minimum number between the two initial data in partition B. A Similarly, in each one of the next steps (i.e., steps 3 to 6), one B A B output data from each partition is copied to another partition and a sorting operation is executed. B D C A In each step, the sortings in different partitions are executed in parallel. After six steps and execution of a total of 24 C C C (=4 × 6) basic sorting operations the sorted data is ready in D memory. Each basic sorting operation is implemented based on C D the in-memory basic binary sorting proposed in Section II-A. C Table II shows the total number of processing cycles, the D D D required size of crossbar memory, and the energy consumption of different sizes of in-memory bitonic networks. The number of processing cycles is found by [Number of steps × (1 + Par��on A Par��on B Par��on C Par��on D Number of Processing Cycles to Execute one Basic Sorting Steps IN1 IN2 MIN MAX IN1 IN2 MIN MAX IN1 IN2 MIN MAX IN1 IN2 MIN MAX 1 Operation) + Number of Copy Operations]. The required size of crossbar memory is also found by [Data-width × N/2 × 2 Crossbar Size of one Basic Sorting Unit]. 3

III.PROPOSED IN-MEMORY UNARY SORTING 4 Unary (or burst) processing [19], [45], an unconventional 5 computing paradigm first introduced in 1980’s, is re-emerging 6 as an alternative paradigm to conventional binary due to advantages such as simplicity of design and immunity to [18], [46]–[53]. The paradigm has some characteristics Fig. 7. High-level flow of 8-input Bitonic Sorting in Memory. common to stochastic computing [20], [21] but is deterministic and produces exact results. The unary designs are highly robust against noise as multiple bit flips in a long unary data stored in memory in unary format, a large portion of bit-stream produce small and uniform deviations from the the latency and energy is spent on reading/writing from/to nominal value [18]. An extremely simple method for sorting memory, and transferring the bit-streams between the memory of data based on unary processing was proposed recently [18], and the processing circuit. Considering the fact that the length [54]. The method sorts unary data off-memory using simple of unary bit-streams increases exponentially with n or data- standard CMOS logic gates. Given a unary system with the width, the high overhead of transferring data to the processing 6

Ai Mini

Bi Cycle1 Cycle2 Cycle3 __ A0 B0 A0 B0 Min0 Fig. 8. Example of performing maximum and minimum operations on unary bit-streams. _ _ A1 B1 A1 B1 Min1 circuit becomes a major concern and a limiting factor for the efficiency of the system. In this section, we propose a _ novel method for sorting of unary data in memory to avoid _ A2n-1 B2n-1 A2n-1 B2n-1 Min2n-1 the overheads of off-memory processing in the unary system. We first discuss the basic operation of sorting two unary bit- streams in memory and then elaborate the design of a complete (a) unary sorting network. Ai Maxi Bi A. Basic Unary Sorting Unit Cycle4 Cycle5 The maximum and minimum functions are the essential _ operations in a basic sorting unit. Performing bit-wise logical A0 B0 A0+B0 Max0 AND on two same-length unary bit-streams gives the minimum of the two bit-stream. Bit-wise logical OR, on the other hand, _ gives the maximum of the two same-length unary bit-streams. A1 B1 A1+B1 Max1 Fig. 8 shows an example of the maximum and the minimum operation on two unary bit-streams. The example presents these operations in a serial manner by processing one bit of _ the input bit-streams at any cycle. While the serial approach n n A2n-1 B2n-1 A2 -1+B2 -1 Max2n-1 is extremely simple to implement with only one pair of AND and OR gates, it incurs a long latency proportional to the length of the bit-streams. An n-bit precision data in the binary (b) domain corresponds to a 2n-bit bit-stream in the unary domain. This means a latency of 2n cycles with a serial unit. Parallel Fig. 9. Proposed in-memory unary sorting (a) in-memory minimum operation and (b) in-memory maximum operation on two unary bit-streams. sorting of two n-bit precision data represented using two 2n- bit bit-streams requires performing 2n logical AND operations (to produce the minimum bit-stream), and 2n logical OR operations (to produce the maximum bit-stream) in parallel as is achieved by first performing MAGIC NOR on the input bit- shown in Fig 2(b). The compatibility of memristive crossbar streams and then MAGIC NOT on the outputs of the NOR for running parallel logical operations in-memory makes it operations. Hence, the execution of the OR operation takes a perfect place for low-latency parallel sorting of unary bit- two cycles. The columns that we use for the execution of the streams. AND operation to store the inverted version of the bit-streams Fig. 9 shows our proposed design for MAGIC-based in- (e.g,. the third and fourth columns in Fig. 9(a)) are re-used memory execution of minimum and maximum operations on in execution of the OR operation to avoid using additional two unary bit-streams. As shown, implementing this sorting memristors. In contrast to the proposed in-memory binary unit using MAGIC NOR and NOT operations requires a mem- sorting of Section II-A that its latency varies by changing the ristor crossbar (proportional to the length of the bit-streams), bit-width, the processing latency of the proposed unary sorting and a simple circuitry to generate control pulses. The unsorted is fixed at five cycles and does not change with the data-width. unary data (i.e., A and B bit-streams) is stored in two different Table III shows the required resources, number of cycles, columns. Both inputs have the same length of 2n. As shown and energy consumption of the proposed basic sorting unit in Fig. 9(a) the AND operation (minimum function) is realized for different bit-stream lengths. The number of memristors is by first inverting the bit-streams by executing MAGIC NOT directly proportional to the length of the bit-streams. In a fully on the bit-streams, and then performing bit-wise MAGIC parallel design approach, the size of the memory, particularly NOR on the inverted bit-streams. This effectively implements the number of rows, defines an upper-limit on the maximum the AND operation as A ∧ B = A ∨ B. The first and the data-width for the to-be-sorted unary data. In such a system, second bit-stream are inverted in the first and the second cycle, bit-streams with a length longer than the number of rows can respectively. The NOR operation is executed in the third cycle. be supported by splitting each into multiple shorter sub-bit- As shown in Fig. 9(b) the OR operation (maximum function) streams, storing each sub-bit-stream in a different column, 7

TABLE III THE REQUIRED RESOURCES,NUMBEROF PROCESSING CYCLES, AND ENERGY CONSUMPTIONOFTHE PROPOSED BASIC UNARY SORTING BitStream Required # of Initialized # of Reused # of NOR # of NOT Initial Operation Energy Length Dim. Memristors Memristors Operations Operations Cycles Cycles (pJ) 24 16 × 5 64 32 32 48 1 5 227 26 64 × 5 256 128 128 192 1 5 910 28 256 × 5 1024 512 512 768 1 5 3640 210 1024 × 5 4096 2048 2048 3072 1 5 14558

TABLE IV ENERGY CONSUMPTION (nJ) AND LATENCY (µs) OFTHE IMPLEMENTED IN-MEMORY AND OFF-MEMORY BITONIC SORTING DESIGNSWITH DATA-WIDTH=8 (E: ENERGY,L:LATENCY) Network Size 8 16 32 64 128 256 Design Method E L E L E L E L E L E L Off-Memory Binary Sorting (+ Binary R/W) 850 6.5 1,701 13 3,403 26 6,806 52 13,613 104 27,227 209 Proposed In-Memory Binary Sorting 10 0.55 33 1.02 100 1.8 281 3.4 794 6.8 1,927 14 Off-Memory Unary Sorting (+ Binary R/W) 851 6.5 1,703 13 3,406 26 6,811 52 13,622 104 27,244 209 Off-Memory Unary Sorting (+ Unary R/W) 27,226 210 54,452 419 108,904 839 217,809 1,679 435,618 3358 871,236 6,717 Proposed In-Memory Unary Sorting 87 0.10 291 0.25 875 0.7 2,452 1.8 6,540 4.7 16,817 12

and executing the CAS operations in parallel. The sub-results IV. COMPARISON AND APPLICATION will be finally merged to produce the complete minimum and A. Circuit-Level Simulations maximum bit-streams. This design approach sorts the data with reduced latency as the primary objective. A different For circuit-level evaluation of the proposed designs we im- approach for sorting long bit-streams is to perform CAS plemented a 16×16 crossbar and necessary control signals in operations on the sub-bit-streams in a serial manner by re- Cadence Virtuoso. For memristor simulations, we used the using the CAS unit(s). This approach reduces the area (number Voltage Controlled ThrEshold Adaptive Memristor (VTEAM) of used memristors) at the cost of additional latency. In this model [55]. The Parameters used for the VTEAM model can case, after sorting each pair of sub-bit-streams, the result is be seen in Table V. We evaluated the designs in an analog saved, and a new pair of sub-bit-stream is loaded for sorting. mixed-signal environment by using the Spectre simulation Assuming that each input bit-stream is split into N sub-bit- platform with 0.1ns transient step. For MAGIC operations, streams, the number of processing cycles to sort each pair we applied VSET =2.08V with 1ns pulse-width to initialize of input data increases by a factor of N. Some additional the output memristors to LRS. For the simplicity of controller processing cycles are also needed for saving each sub-output design, we consider the same clock cycle period of ∼1.25ns and copying each pair of sub-input. Combining the parallel and V0 pulse-width of 1ns for all operations. V0 voltage for and the serial approach is also possible for further trade- NOT, 2-input NOR, 3-input NOR, and 4-input NOR is 1.1V , off between area and delay. These approaches increase the 950mV , 1.05V , and 1.15V , respectively. We perform the copy range of supported data-widths but incur a more complicated operations by using two consecutive NOT operations. implementation and partition managing when implementing To estimate the total energy of in-memory computations, complete sorting networks. we first find the energy consumption of each operation. The energy number measured for each operation depends on the states of input memristors (i.e., LRS, HRS). We consider all possible cases when measuring the energy of each operation. B. Complete Unary Sort System For example, the 3-input NOR has eight possible combinations of input states. We consider the average energy of these eight Implementing a bitonic sorting network in the unary domain cases as the energy of 3-input NOR. The reported energy for the follows the same approach as we discussed in Section II-B proposed in-memory sorting designs is the sum of the energy for binary implementation of sorting networks. The number consumed by all operations. Our synthesis results show that of sorting steps and the required number of basic sorting the overhead energy consumption of the MAGIC peripheral operations are exactly the same as those of the binary sorting circuits used to control the execution of operations is negligible network design. The essential difference, however, is that compared to that of the in-memory operations. in the unary sorting system the data is in the unary for- mat. Therefore, the basic 2-input sorting operation should be implemented based on the unary sorting unit proposed in B. Comparison of In- and Off-Memory Section III-A. Table II shows the number of processing cycles We compare the latency and energy consumption of the and the required size of memory for implementation of unary proposed in-memory binary and unary sorting designs with the bitonic networks of different sizes. We report the latency, area, conventional off-memory CMOS-based designs for the case and energy of these networks as well. of implementing bitonic networks with data-width of eight. 8

TABLE V in-memory design can reduce the latency and energy by a PARAMETER VALUESUSEDINTHE VTEAMMODEL [55]. factor of up to 65× and 9.7×, respectively. For a realistic Parameter Value Parameter Value and more accurate energy consumption comparison, however, the overhead of transferring data on the interconnect between Ron 1 kΩ xoff 3 nm Roff 300 kΩ kon -216.2 m/sec the memory and the processing unit must be added for the VTon -1.5 V koff 0.091 m/sec off-memory cases. We note that these numbers are highly VToff 300 mV αon 4 dependent on the architecture of the overall system and the xon 0 nm αoff 4 interconnects used. Therefore, different system architectures may substantially change these numbers, however, they do TABLE VI not change the fact that our proposed method is more advanta- THE AVERAGE MEASURED ENERGY CONSUMPTIONOF EACH OPERATION geous. In fact, they only change the extent of this improvement BASEDON VTEAMMODEL. (and further increase it). Hence, by eliminating them, we Operation Average Energy present the minimum improvement obtained by our method memristor initialization 2350 fJ and leave the further improvement to the final implementation memristor copy 40.08 fJ details of designers. NOT 20.04 fJ 2-input NOR 9.01 fJ C. Application to Median Filtering 3-input NOR 37.24 fJ Median filtering has been widely used in different applica- 4-input NOR 54.51 fJ tions from image and video to speech and signal processing. In these applications, the digital data is often affected by noise re- For a fair comparison, we assume that the to-be-sorted data is sulting from sensors or transmission of data. A median filter — already stored in memristive memory when the sorting process which replaces each input data with the median of all the data begins and hence do not consider the delay for initial storage. in a local neighborhood (e.g., a 3×3 local window)— is used We do not consider this latency because it is the same for to filter out these impulse noises and smooth the data [57]. A both cases of the proposed in-memory and the off-memory variety of methods for implementation of Median filters have counterparts. For the case of off-memory binary designs, we been proposed. Sorting network-based architectures made of assume 8-bit precision data is read from and written to a CAS blocks are one of the most common approaches [18]. The memristive memory. For the case of off-memory unary design incoming data is sorted as it passes the network. The middle we evaluate two approaches: 1) unary data (i.e., 256-bit bit- element of the sorted data is the median. We develop an in- streams) is read from and written to memory, and 2) 8-bit memory architecture for a 3 × 3 median filtering based on our binary data is read from and written to memory. For the proposed in-memory binary and unary sorting designs. second approach, the overhead of conversion (i.e., binary Fig. 10 depicts a high-level flow of memory partitioning to/from unary bit-stream) is also considered. This conversion is for our in-memory 3 × 3 Median filter design. Similar to performed off-memory using combinational CMOS logic [18]. our approach in implementing the complete sort system, the The conventional CMOS-based off-memory sorting systems memory is split into multiple partitions, here partitions A, B, read the raw data from memory, sort the data with CMOS C, D, and E. Each sorting unit is assigned to one partition. logic, and write the sorted data into memory. These read and Compared to a complete sorting network a fewer number of write operations take the largest portion of the latency and sorting units is required as only the median value is targeted. energy consumption. We use the per-bit read and write latency All partitions except E include two out of the nine input data. and per-bit energy consumption reported in [56] to calculate The process is split into eight steps, each step executing some the total latency and energy of reading from and writing into basic sorting operations in parallel. We implement these basic memory. We multiply the per-bit latency and energy with the sorting operations using both the proposed binary and unary number of data and the bitwidth of data to find these numbers. bit-stream-based in-memory architectures. Table VII reports For the proposed in-memory designs, the entire processing step the latency, required memristors, and energy consumption of is performed in memory, and so there is no read and write the developed designs for (i) a single 3×3 Median filter and (ii) operations from and to the memory. For the off-memory cases a 3 × 3 Median filter image processing system that processes we do not incorporate the transferring overhead between the images of 64×64 size. The corresponding latency and energy memory and the processing unit as it depends on the media and consumption of the off-memory CMOS-based binary and distance. We implemented the off-memory processing units unary designs are also reported in Table VII. As it can be seen, using Verilog HDL and synthesized using the Synopsys Design the proposed in-memory binary and unary designs reduce the Compiler v2018.06-SP2 with the 45nm NCSU-FreePDK gate energy by a factor of 13× and 6.6×, respectively, for the image library. processing system. Note that for the off-memory cases, we did Table IV compares the results. As reported, the proposed not incorporate the overhead latency and energy of transferring in-memory designs provide a significant latency and energy data on the bus or other interconnects, which is a large reduction, on average 14× and 37×, respectively, for the portion of energy consumption in transferring data between binary, and 1200× and 138× for the unary design, compared memory and processing unit [58]. Considering this overhead, to the conventional off-memory designs. For the unary systems our approach would have significantly larger advantage over with the data stored in memory in binary format the proposed others in a complete system. 9

TABLE VII THE REQUIRED RESOURCES,LATENCY (L), AND ENERGY CONSUMPTION (E) OFTHE IMPLEMENTED MEDIAN FILTER DESIGNS Median Filter 3 × 3 64 × 64 Image Processor Median Filter 3 × 3 64 × 64 Image Processor Design Design Cycle Area E(nJ) L(ns) Cycle Area E(µJ) L(µs) E(nJ) L(µs) E(mJ) L(ms) Proposed Off-Memory 544 8 × 110 8.5 680 4896 208 × 1980 35 6.1 121 0.94 0.49 3.2 Binary Binary Proposed Off-Memory 72 256 × 25 69 90 684 2048 × 1425 283 0.81 3,882 30 1.59 104 Unary Unary

Step1 Step2 Step3 Step4 Step5 Step6 Step7 Step8 The nonideality of memristors can influence the perfor- 1 1 1 1 A mance of the proposed system too (and by extension, possibly 8 8A 6 4 3 the accuracy of sorting). However, as mentioned above, the B 4 4 B 4A 6 5 2 main novelty and focus of this work is at the architectural 6 B 6 8 8 8 4 4 4 level, and the details of implementation are meant to show the B A feasibility of implementation. The details of circuit-level im- 3 3 2 2 2 5A 5 5 5 C D plementation and memeristive devices used can substantially C 5 5C 5 D 3 4 8 change while maintaining the novelty and feasibility of the C E proposed architecture. A comprehensive study on the impact of 9 2 D 3 5 6 E 2 D 9 9 9 the nonidealities of memristors is outside the scope of current work since it requires its own venue where all the many details 7 7 7 7 7 7 7 7 7 of such study can be published. We plan to conduct such comprehensive studies in the future. However, one of the most Fig. 10. High-level flow of the 3×3 Median filtering design. important concerns in memristive crossbars —which we have already considered in our simulations— is the sneak paths problem. To eliminate the sneak paths effect, and following V. DISCUSSIONS many prominent implementations in the literature [60]–[62], Time delay and energy consumption are essential metrics we use 1T1R design in modeling each crossbar memristor in evaluating the efficiency of in-memory solutions. How- cell. The extra helps us to enable only intended ever, memristive technology is an emerging technology still memristors while processing in memory and avoiding sneak in evolution, with many competing implementation methods paths effect. in the process of maturation [59]. Properties such as time VI.CONCLUSION delay and especially power and energy consumption heavily Binary radix- and unary bit-stream-based design of sorting dependent on the used technology and change drastically systems have been previously developed. These designs are from one to another. In this article, the experimental results all developed based on the conventional approach of process- are provided by simulation tools using the VTEAM model. ing off-memory incurring high overhead of reading/writing Using different memristive technologies and models, LRS from/to memory and transferring between memory and pro- and HRS value, as well as programming or reading pulses cessing unit. In this paper, for the first time to the best of with different amplitude or width, can substantially affect the our knowledge, we developed two methods for in-memory actual delay and energy numbers reported here. However, we sorting of data: a binary and a unary sorting design. We would like to point out that this is a secondary concern, since compared the area, latency, and energy consumption of the our main contributions are at the architectural-level and the basic and the complete sorting systems for different data- proposed in-memory sorting designs. The actual memristive widths and network sizes. The latency and energy are signif- implementation is a showcase of feasibility and meaningful- icantly reduced compared to prior off-memory CMOS-based ness of such an in-memory sorting design. That is, there exists sorting designs. We further developed in-memory binary and an in-memory implementation (namely using the memristive unary designs for an important application of sorting, median technology we have used here) to be significantly beneficial. filtering. In future works, we plan to extend the proposed Therefore, we consider the properties and comparison of architectures to other applications of sorting, for instance, other ways of implementing our proposed architecture (using efficient in-memory implementation of weighted and adaptive CMOS IMC, other memristor technologies, or other emerging median filters. We will also explore applications of in-memory memory technologies) as an exciting future work but outside sorting in communications and coding domains. Evaluating the the current article’s scope. Nonetheless, we would like to point robustness of the proposed in-memory designs, particularly the out that we have provided the number of memristors (memory proposed in-memory unary design, to process variations and cells) and the number of operation cycles that are technology nonidealities of the memristive memory is another important independent. Therefore, others can independently evaluate and part of our planned future works. compare their own implementations with ours in a technology- agnostic fashion and using the number of memory cells and REFERENCES the number of cycles their implementation needs (regardless [1] G. Graefe, “Implementing sorting in database systems,” ACM Comput. of the LRS and HRS values or pulse amplitude and width). Surv., vol. 38, no. 3, Sep. 2006. 10

[2] N. Govindaraju, J. Gray, R. Kumar, and D. Manocha, “GPUTeraSort: [27] L. Xie, H. A. Du Nguyen, J. Yu, A. Kaichouhi, M. Taouil, M. Al- High Performance Graphics Co-processor Sorting for Large Database Failakawi, and S. Hamdioui, “Scouting logic: A novel memristor-based Management,” in SIGMOD, 2006, pp. 325–336. logic design for resistive computing,” in ISVLSI’17, 2017, pp. 176–181. [3] A. Colavita, E. Mumolo, and G. Capello, “A novel sorting algorithm and [28] N. Taherinejad, S. M. P. D., M. Rathmair, and A. Jantsch, “Fully its application to a gamma-ray telescope asynchronous data acquisition digital write-in scheme for multi-bit memristive storage,” in 2016 13th system,” Nuclear Inst. and Methods in Physics Research Sec. A, vol. International Conference on Electrical Engineering, Computing Science 394, no. 3, pp. 374 – 380, 1997. and Automatic Control (CCE), Sept 2016, pp. 1–6. [4] D. C. Stephens, J. C. R. Bennett, and H. Zhang, “Implementing [29] M. Imani, S. Gupta, and T. Rosing, “Ultra-Efficient Processing In- scheduling algorithms in high-speed networks,” IEEE JSAC, vol. 17, Memory for Data Intensive Applications,” in DAC, 2017. no. 6, pp. 1145–1158, Jun 1999. [30] S. Ganjeheizadeh Rohani, N. Taherinejad, and D. Radakovits, “A semi- [5] V. Brajovic and T. Kanade, “A VLSI sorting image sensor: global parallel full-adder in imply logic,” IEEE Transactions on Very Large massively parallel intensity-to-time processing for low-latency adaptive Scale Integration (VLSI) Systems, vol. 28, no. 1, pp. 297–301, Jan 2020. vision,” IEEE Trans. on Robotics and Automation, vol. 15, no. 1, Feb [31] S. G. Rohani and N. TaheriNejad, “An improved algorithm for IMPLY 1999. logic based memristive full-adder,” in 2017 IEEE 30th Canadian Con- [6] P. Li, D. Lilja, W. Qian, K. Bazargan, and M. Riedel, “Computation ference on Electrical and Computer Engineering (CCECE), April 2017, on stochastic bit streams digital image processing case studies,” IEEE pp. 1–4. TVLSI, vol. 22, no. 3, pp. 449–462, 2014. [32] N. Talati, S. Gupta, P. Mane, and S. Kvatinsky, “Logic design within [7] C. Chakrabarti and L.-Y. Wang, “Novel sorting network-based architec- memristive memories using memristor-aided logic (MAGIC),” IEEE tures for rank order filters,” IEEE TVLSI, vol. 2, no. 4, pp. 502–507, TNANO, vol. 15, no. 4, pp. 635–650, 2016. Dec 1994. [33] A. Haj-Ali, R. Ben-Hur, N. Wald, and S. Kvatinsky, “Efficient algorithms [8] D. S. K. Pok, C. I. H. Chen, J. J. Schamus, C. T. Montgomery, and for in-memory fixed point multiplication using MAGIC,” in ISCAS, May J. B. Y. Tsui, “Chip design for monobit receiver,” IEEE Trans. on 2018. Theory and Techniques, vol. 45, no. 12, pp. 2283–2295, [34] N. TaheriNejad, T. Delaroche, D. Radakovits, and S. Mirabbasi, “A Dec 1997. semi-serial topology for compact and fast IMPLY-based memristive full [9] D. Cederman and P. Tsigas, “GPU-Quicksort: A Practical Quicksort adders,” in 2019 IEEE New Circuits and Systems symposium (NewCAS), Algorithm for Graphics Processors,” ACM JEA, vol. 14, pp. 4:1.4– 2019, pp. 1–5. 4:1.24, 2010. [35] D. Radakovits, N. Taherinejad, M. Cai, T. Delaroche, and S. Mirabbasi, [10] N. Satish, M. Harris, and M. Garland, “Designing efficient sorting “A memristive multiplier using semi-serial imply-based adder,” IEEE algorithms for manycore GPUs,” in IPDPS, May 2009, pp. 1–10. Transactions on Circuits and Systems I: Regular Papers, pp. 1–12, 2020. [11] G. Capannini, F. Silvestri, and R. Baraglia, “Sorting on GPUs for [36] M. Riahi Alam, M. H. Najafi, and N. TaheriNejad, “Exact In-Memory large scale datasets: A thorough comparison,” Information Processing Multiplication Based on Deterministic Stochastic Computing,” in 2020 & Management, vol. 48, no. 5, pp. 903 – 917, 2012, large-Scale and IEEE International Symposium on Circuits and Systems (ISCAS), 2020, Distributed Systems for Information Retrieval. pp. 1–5. et al. [12] R. Chen , “Computer Generation of High Throughput and Memory [37] A. Siemon, S. Menzel, R. Waser, and E. Linn, “A complementary IEEE Trans. on Paral. and Dist. Efficient Sorting Designs on FPGA,” resistive -based crossbar array adder,” IEEE journal on emerging Sys. , vol. 28, no. 11, pp. 3100–3113, Nov 2017. and selected topics in circuits and systems, vol. 5, no. 1, pp. 64–74, [13] S. Olarlu, M. C. Pinotti, and Si Qing Zheng, “An optimal hardware- 2015. algorithm for sorting using a fixed-size parallel sorting device,” IEEE [38] M. Teimoory, A. Amirsoleimani, J. Shamsi, A. Ahmadi, S. Alirezaee, TC, vol. 49, no. 12, pp. 1310–1324, 2000. and M. Ahmadi, “Optimized implementation of memristor-based full [14] D. Koch and J. Torresen, “FPGASort: A High Performance Sorting adder by material implication logic,” in 2014 21st IEEE International Architecture Exploiting Run-time Reconfiguration on FPGAs for Large Conference on Electronics, Circuits and Systems (ICECS), Dec 2014, Problem Sorting,” in FPGA, 2011. pp. 562–565. [15] K. E. Batcher, “Sorting networks and their applications,” in AFIPS. [39] S. Kvatinsky, D. Belousov, S. Liman, G. Satat, N. Wald, E. G. Friedman, ACM, 1968. A. Kolodny, and U. C. Weiser, “MAGIC—memristor-aided logic,” IEEE [16] A. Farmahini-Farahani, H. J. D. III, M. J. Schulte, and K. Compton, Transactions on Circuits and Systems II: Express Briefs, vol. 61, no. 11, “Modular design of high-throughput, low-latency sorting units,” IEEE pp. 895–899, Nov 2014. TC, vol. 62, no. 7, pp. 1389–1402, July 2013. [17] S. W. Al-Haj Baddar and B. A. Mahafzah, “Bitonic sort on a chained- [40] S. Aga, S. Jeloka, A. Subramaniyan, S. Narayanasamy, D. Blaauw, 2017 IEEE Intern. Symp. on High cubic tree interconnection network,” JPDC, vol. 74, no. 1, pp. 1744– and R. Das, “Compute caches,” in Performance Computer Architecture (HPCA) 1761, 2014. , Feb 2017, pp. 481–492. [18] M. H. Najafi, D. Lilja, M. D. Riedel, and K. Bazargan, “Low-Cost [41] S. Jeloka, N. B. Akesh, D. Sylvester, and D. Blaauw, “A 28 nm con- Sorting Network Circuits Using Unary Processing,” IEEE Trans. on VLSI figurable memory (tcam/bcam/sram) using push-rule 6t bit cell enabling Systems, vol. 26, no. 8, pp. 1471–1480, Aug 2018. logic-in-memory,” IEEE Journal of Solid-State Circuits, vol. 51, no. 4, [19] W. Poppelbaum, A. Dollas, J. Glickman, and C. O’Toole, “Unary Apr. 2016. processing,” in Advances in Computers. , 1987, vol. 26, pp. 47 [42] S. Angizi, Z. He, A. S. Rakin, and D. Fan, “CMP-PIM: An Energy- – 92. Efficient Comparator-based Processing-In-Memory NN Accelerator,” in [20] B. Gaines, “Stochastic computing systems,” in Advances in Information DAC, June 2018. Systems Science. Springer US, 1969, pp. 37–172. [43] B. Gedik, R. R. Bordawekar, and P. S. Yu, “Cellsort: High performance [21] A. Alaghi, W. Qian, and J. P. Hayes, “The Promise and Challenge sorting on the cell processor,” in VLDB, 2007, pp. 1286–1297. of Stochastic Computing,” IEEE Trans. on Computer-Aided Design of [44] S. Gupta, M. Imani, and T. Rosing, “FELIX: Fast and Energy-Efficient Integ. Circ. and Sys., vol. 37, no. 8, pp. 1515–1531, Aug 2018. Logic in Memory,” in ICCAD, Nov 2018, pp. 1–7. [22] A. Alaghi and J. Hayes, “Exploiting correlation in stochastic circuit [45] W. Poppelbaum, “Burst processing: A deterministic counterpart to design,” in Computer Design (ICCD), 2013 IEEE 31st International stochastic computing,” in Proceedings of the 1st Intern. Symp. on Conference on, Oct 2013, pp. 39–46. Stochastic Computing and its Apps., 1978. [23] M. A. Zidan, J. P. Strachan, and W. D. Lu, “The future of electronics [46] J. E. Smith, “Space-time algebra: A model for neocortical computation,” based on memristive systems,” Nature electronics, vol. 1, pp. 22–29, in ISCA’18, 2018, pp. 289–300. 2018. [47] A. H. Jalilvand, M. H. Najafi, and M. Fazeli, “Fuzzy-logic using [24] N. Taherinejad, P. D. S. Manoj, and A. Jantsch, “Memristors’ potential unary bit-stream processing,” in 2020 IEEE International Symposium for multi-bit storage and pattern learning,” in 2015 IEEE European on Circuits and Systems (ISCAS), May 2020, pp. 1–5. Modelling Symposium (EMS), Oct 2015, pp. 450–455. [48] S. Mohajer, Z. Wang, and K. Bazargan, “Routing Magic: Performing [25] D. Radakovits and N. TaheriNejad, “Implementation and characteriza- Computations Using Routing Networks and Voting Logic on Unary tion of a memristive memory system,” in 2019 IEEE 32nd Canadian Encoded Data,” in FPGA’18, 2018, pp. 77–86. Conference on Electrical and Computer Engineering (CCECE), May [49] S. R. Faraji and K. Bazargan, “Hybrid binary-unary hardware accelera- 2019, pp. 1–5. tor,” IEEE Transactions on Computers, vol. 69, no. 9, pp. 1308–1319, [26] S. Hamdioui, S. Kvatinsky, G. Cauwenberghs, L. Xie, N. Wald, S. Joshi, 2020. H. M. Elsayed, H. Corporaal, and K. Bertels, “Memristor for computing: [50] A. Madhavan, T. Sherwood, and D. Strukov, “Race logic: A hardware Myth or reality?” ser. DATE ’17, 2017. acceleration for dynamic programming algorithms,” in 2014 ACM/IEEE 11

41st International Symposium on Computer Architecture (ISCA), 2014, M. Hassan Najafi received the B.Sc. degree in pp. 517–528. Computer Engineering from the University of Is- [51] G. Tzimpragos, D. Vasudevan, N. Tsiskaridze, G. Michelogiannakis, fahan, Iran, the M.Sc. degree in Computer Archi- A. Madhavan, J. Volk, J. Shalf, and T. Sherwood, “A computational tecture from the University of Tehran, Iran, and temporal logic for superconducting accelerators,” in Proceedings of the Ph.D. degree in Electrical Engineering from the Twenty-Fifth International Conference on Architectural Support for the University of Minnesota, Twin Cities, USA, in Programming Languages and Operating Systems, ser. ASPLOS ’20. 2011, 2014, and 2018, respectively. He is currently New York, NY, USA: Association for Computing Machinery, 2020, p. an Assistant Professor with the School of Com- 435–448. [Online]. Available: https://doi.org/10.1145/3373376.3378517 puting and Informatics, University of Louisiana at [52] M. H. Najafi, S. R. Faraji, K. Bazargan, and D. Lilja, “Energy-efficient Lafayette, Louisiana, USA. His research interests in- pulse-based for near-sensor processing,” in 2020 IEEE clude stochastic and approximate computing, unary International Symposium on Circuits and Systems (ISCAS), 2020, pp. processing, in-memory computing, computer-aided design, and machine- 1–5. learning. In recognition of his research, he received the 2018 EDAA Out- [53] D. Wu, J. Li, R. Yin, H. Hsiao, Y. Kim, and J. S. Miguel, “UGEMM: standing Dissertation Award, the Doctoral Dissertation Fellowship from the Unary Computing Architecture for GEMM Applications,” in 2020 University of Minnesota, and the Best Paper Award at the 2017 35th IEEE ACM/IEEE 47th Annual International Symposium on Computer Archi- International Conference on Computer Design (ICCD). tecture (ISCA), 2020, pp. 377–390. [54] M. H. Najafi, D. J. Lilja, M. Riedel, and K. Bazargan, “Power and area efficient sorting networks using unary processing,” in 2017 IEEE International Conference on Computer Design (ICCD), 2017, pp. 125– Nima TaheriNejad (S’08-M’15) received his Ph.D. 128. degree in electrical and computer engineering [55] S. Kvatinsky, M. Ramadan, E. G. Friedman, and A. Kolodny, “Vteam: from The University of British Columbia (UBC), A general model for voltage-controlled memristors,” IEEE Transactions Vancouver, Canada, in 2015. He is currently a on Circuits and Systems II: Express Briefs, vol. 62, no. 8, pp. 786–790, “Universitatsassistant”¨ at the TU Wien (formerly Aug 2015. known as Vienna University of Technology as well), [56] C. Xu, X. Dong, N. P. Jouppi, and Y. Xie, “Design implications of Vienna, Austria, where his areas of work include memristor-based rram cross-point structures,” in 2011 Design, Automa- computational self-awareness, resource-constrained tion Test in Europe, 2011, pp. 1–6. cyber-physical embedded systems, systems on chip, [57] E. Nikahd, P. Behnam, and R. Sameni, “High-speed hardware implemen- in-memory computing, memristor-based circuit and tation of fixed and runtime variable window length 1-d median filters,” systems, health-care, and robotics. He has published IEEE Tran. on Circ. & Sys. II: Express Briefs, 2016. two books and more than 55 peer-reviewed articles. He has also served [58] S. W. Keckler, W. J. Dally, B. Khailany, M. Garland, and D. Glasco, as a reviewer, an editor, an organizer, and the chair for various journals, “Gpus and the future of parallel computing,” IEEE Micro, vol. 31, no. 5, conferences, and workshops. Dr. Taherinejad has received several awards pp. 7–17, 2011. and scholarships from universities and conferences he has attended. In the [59] N. TaheriNejad and D. Radakovits, “From behavioral design of mem- field of memristive circuits and systems, his focus has been on physical ristive circuits and systems to physical implementations,” IEEE Circuit implementations, reliability, memory, logic, and in-memory computing and Systems (CAS) Magazine, pp. 1–11, 2019. architecture. [60] H. J. Wan, P. Zhou, L. Ye, Y. Y. Lin, T. A. Tang, H. M. Wu, and M. H. Chi, “In situ observation of compliance-current overshoot and its effect on resistive switching,” IEEE Electron Device Letters, vol. 31, no. 3, pp. 246–248, 2010. [61] C. Li, D. Belkin, Y. Li, P. Yan, M. Hu, N. Ge, H. Jiang, E. Montgomery, P. Lin, Z. Wang, J. P. Strachan, M. Barnell, Q. Wu, R. S. Williams, J. J. Yang, and Q. Xia, “In-memory computing with memristor arrays,” in 2018 IEEE International Memory Workshop (IMW), 2018, pp. 1–4. [62] C. Li, M. Hu, Y. Li, H. Jiang, N. Ge, E. Montgomery, J. Zhang, W. Song, N. Davila,´ C. E. Graves, Z. Li, J. P. Strachan, P. Lin, Z. Wang, M. Barnell, Q. Wu, R. S. Williams, J. J. Yang, and Q. Xia, “Analogue signal and image processing with large memristor crossbars,” Nature Electronics, vol. 1, no. 1, pp. 52–59, Jan 2018. [Online]. Available: https://doi.org/10.1038/s41928-017-0002-z

Mohsen Riahi Alam received his B.Sc. degree in Computer Engineering from Shiraz University, Shiraz, Iran, in 2011, and his M.Sc. degree in Com- puter Architecture from the University of Tehran, Tehran, Iran, in 2014. He was a Research Assistant with the Silicon Intelligence and Very Large-Scale Integration (VLSI) Signal Processing Laboratory, the University of Tehran from 2014 to 2016. He is currently a Research Scholar at the School of Com- puting and Informatics, the University of Louisiana at Lafayette, Louisiana, USA. His current research interests include in-memory computation, stochastic computing, low power and energy-efficient VLSI design, and embedded systems.