1

Nonconventional Arithmetic Circuits, Systems and Applications Leonel Sousa, Senior Member, IEEE

Abstract—Arithmetic plays a major role in a computer’s basic levels and leads to high power consumption. Hence, performance and efficiency. Building new computing platforms the research on unconventional number systems is of the supported by the traditional binary arithmetic and silicon-based utmost interest to explore parallelism and take advantage of technologies to meet the requirements of today’s applications is becoming increasingly more challenging, regardless whether we the characteristics of emerging technologies to improve both consider embedded devices or high-performance . As a the performance and the energy efficiency of computational result, a significant amount of research effort has been devoted to systems. Moreover, by avoiding the dependencies of binary the study of nonconventional number systems to investigate more systems, nonconventional number systems can also support efficient arithmetic circuits and improved computer technologies the design of reliable computing systems using the newest to facilitate the development of computational units that can meet the requirements of applications in emergent domains. available technologies, such as nanotechnologies. This paper presents an overview of the state of the art in non- conventional computer arithmetic. Several different alternative computing models and emerging technologies are analyzed, such A. Motivation as nanotechnologies, superconductor devices, and biological- and quantum-based computing, and their applications to multiple The Complementary Metal-Oxide Semiconductor (CMOS) domains are discussed. A comprehensive approach is followed was invented over fifty years ago and has played in a survey of the logarithmic and residue number systems, a key role in the development of modern electronic devices the hyperdimensional and stochastic computation models, and and all that it has enabled. The CMOS transistor has evolved the arithmetic for quantum- and DNA-based computing systems into nanodevices, with characteristic dimensions less than and techniques for approximate computing. Technologies, pro- cessors and systems addressing these nonconventional computer 100 nm. The downscaling of the gate length has became arithmetic systems are also reviewed, taking into consideration one of the biggest challenges hindering progression in each some of the most prominent applications, such as deep learning new generation of CMOS and integrated circuits. or postquantum cryptography. In the end, some conclusions are New device architectures and materials have been proposed to drawn, and directions for future research on nonconventional address this challenge, namely, the Fin Field-Effect transistor computer arithmetic are discussed. (FinFET) multigate devices [5]. This type of nonplanar transis- tor became the dominant gate design from the 14 nm/10 nm I.INTRODUCTION generations, used in a broad range of applications, ranging from consumer applications to embedded systems and high- The demise of Moore’s Law and the waning of Dennard performance computing [6]. scaling, which respectively stated that the number of tran- sistors on silicon chips would double every two years and that this increase in the transistor density is not achieved at constant power consumption [1], mark the end of an era in which the computational capacity growth was mainly based on the downscaling of silicon-based technology. At the same time, the demand for data processing is higher than ever before as a result of the more than 2.5 quintillion bytes that are created on a daily basis [2]. This number continues to grow exponentially, and therefore, ingenious solutions must be developed to address the limitations of traditional com- putational systems. These innovative solutions must include developments not only at the technological level but also at the arithmetic, architectural and algorithmic levels of computing. Fig. 1: Emerging technologies and CMOS speed, size, cost, Although binary arithmetic has been successfully used to and switching energy (scales are logarithmic) [7] design silicon-based computing systems, the positional nature of this representation imposes the processing of carry chains, Fig. 1 plots the cost, speed, size, and energy per operation which precludes the exploitation of parallelism at the most relationship for the CMOS and other emerging nanotechnolo- gies; all the scales are logarithmic, covering many orders of Leonel Sousa is with the the Department of Electrical and Computer magnitude. Examples of these technologies include supercon- Engineering, INESC-ID, Instituto Superior Tecnico,´ Universidade de Lis- boa; Address: Rua Alves Redol, 9, 1000-029 Lisboa, PORTUGAL; e-mail: ducting electronics, molecular electronics, resonant tunneling [email protected]. devices, quantum cellular automata, and optical . 2

Superconducting digital logic circuits use Single-Flux the conventional binary system has been dominant at the data Quantum (SFQ), also known as magnetic flux quanta, to representation layer. However, when considering emergent encode and process data. An Rapid Single Flux Quantum technologies or new architectures for designing these systems, (RSFQ) device is a superconducting ring with a Josephson data representation must also be reconsidered to properly adapt junction that opens and closes to admit or expel a flux each system to the features of both the bottom and top layers. quanta. The Adiabatic Quantum-Flux-Parametron (AQFP), an- Data representation and the related computer arithmetic are the other SFQ-based device not represented in the figure with motivation behind this survey, which looks beyond the CMOS characteristics that are analyzed later in this survey, drastically technology and conventional systems. reduces the energy dissipation by using Alternating Current (AC) bias/excitation currents for both the clock signal and B. Structure of the survey power supply. This is high-speed nanotechnology that offers This survey analyzes the confluence of emergent technolo- low-power dissipation and scalability. However, its main draw- gies, nonconventional computer arithmetic, innovative com- backs include the need for expensive cooling technologies and puting paradigms and new applications. Similar to a jigsaw improved techniques in manufacturing the elements. Molec- puzzle, although these pieces exist by themselves, it is only ular electronics use a single molecule as the . The by connecting them in the right way that the whole picture configuration of the molecule determines its state, thereby can be derived and, in this particular case, the way to design “switching” on or off the current flow through the molecule. efficient systems can be determined. Fig. 3 depicts the tight While conceptually appealing, the molecular process exhibits interconnections and dependencies of all these aspects in a poor self-assembly techniques and low reproducibility. In matrix representation. In addition, this figure shows how these Quantum Cellular Automata (QCA), cells are switched on and very important topics are addressed in this paper, highlighting off by moving the position of the electrons from one cell to the fact that nonconventional arithmetic exposes the character- another. Classical physics does not apply to these technologies, istics of the underlying technology to assist in the development which support, for example, quantum computing. While it is of efficient and reliable computing systems useful for different challenging to build general purpose computing engines based applications. on these technologies, they are much faster on certain specific tasks. A last example is the integrated optical technology, SECTION II: COMPUTER ARITHMETIC AND ARCHITECTURES [10-77] which is based on having small optical cavities that change High Performance Hyper Low Power Reliability RNS LNS Stochastic dimensional Noise Robustness their properties to either store or emit photons – a zero or a Residual Arithmetic HDC one. It is a challenging approach, both from the fan-out and Nanophotonic Computer European Logarithmic StoRM cost point of view. As will be seen, computing-in-the-network Microprocessor circuits can be designed with integrated photonics switches. PIM

Quantum Computing Approximate Computing DNA Computing SECTION III: NANOTECHNOLOGIES [78-119] /Hybrid Memories

DNA/Molecular Biology Superconductors (AQFP)

Integrated Photonics CMOS Invertible Circuits

SECTION IV: APPLICATIONS [120-158] Quantum-Resistant Cryptography

Machine Learning

Fig. 2: Emerging research information processing devices, Fig. 3: Structure of the survey: sections and topics highlighting the non-conventional data representation (adapted from [9]) Section II introduces nonconventional number systems, the main characteristics and the associated computer arithmetic. These emergent technologies will become more and more The Logarithmic Number System (LNS) and the Residue relevant as the end of the CMOS era approaches [8]. Nev- Number System (RNS) shorten up the carry chain under- ertheless, for the time being, there is no technology better lying binary arithmetic. While LNS compresses the number than CMOS, as it provides a good balance between energy representation significantly, mitigating the problem of carry consumption, cost, speed and scalability, as shown in Fig. 1. chains, it lowers the complexity of the multiplication, division, Yet, better solutions can be identified in Fig. 1 along every and square-root arithmetic operations [11]. RNS [4] is a individual axis. However, to design and implement processing type of nonpositional number system in which integers are systems based on these new technologies and devices, different represented as a set of smaller residues that can be processed solutions have to be found at various levels: from the circuits independently and in parallel. Stochastic Computing (SC) and to the data representation and architectures, as depicted in Hyper-Dimensional Computing (HDC) introduce robustness Fig. 2. For the conventional scaled CMOS technology, the against randomness in number representation. SC represents layers have been quite well established, and the characteristics the probability of a bit taking the value ‘1’ or ‘0’ in a bitstream, were successfully explored from the material layer, silicon, regardless of its position [56], processing data bitwise with up to the topmost layer, multicore architectures. Moreover, simple circuits. HDC operates on very large binary vectors 3 assuming that information is represented in a highly structured single-precision Floating Point (FP) format, a number P may way using hyperdimensional vectors, as in the human brain, assume the normalized absolute values in Table I. For a 24-bit where details are not very important for data processing [70]. mantissa, if the hidden bit is counted, the effective resolution Section III is devoted to the analyses of new technologies is between one part in 223 and one part in 224, approximately and computing paradigms. Nanotechnologies, such as spin 10−7. Note that the boundary values for the exponent (0 and transistors, superconducting electronics, molecular electron- 255) are reserved for special cases. ics, and resonant tunneling devices, have emerged as alter- native technologies to CMOS [1], while Deoxyribonucleic TABLE I: IEEE 754 single-precision FP and LNS, 32-bit Acid (DNA)-based computing [163] and Quantum Computing formats (QC) [155] are new computing paradigms. It is shown that nonconventional arithmetic is fundamental not only in the SFP (sign bit) EFP (exponent): 8 bits FFP (mantissa): 23 bits design of efficient computer systems based on these new SLNS (sign bit) ITLNS (integer): 8 bits FLNS (fractional): 23 bits S E −127 paradigms and technologies but also in the mitigation of some P = −1 FP × 1.FFP × 2 FP p − SLNS × ITLNS .FLNS of its intrinsic disadvantages. For example, it is shown that = 1 2 Absolute values minimum maximum the RNS is useful in designing energy-efficient integrated P (FP) 2−126 ≈ 1.2 × 10−38 2+128 ≈ 3.4 × 10+38 photonics-based computational units and in overcoming the p (LNS) 2−128 ≈ 2.9 × 10−39 2+128 ≈ 3.4 × 10+38 negative effects caused by the instability of the biochemical re- actions and the error hybridizations in DNA computing [113], as highlighted in Fig. 3. Furthermore, it is discussed how FP is a “semi-logarithmic” representation with a fixed-point SC can be applied to mitigate the deep pipelining nature of component – significand or mantissa – that is scaled by a factor the AQFP logic devices, while HDC may support computing – the exponent – that is represented by its logarithmic value in nanosystems through the heterogeneous integration of multiple base 2. In comparison, the same number P is represented in emerging nanotechnologies [106]. LNS by the 3-tuple (b, s, p), where b is the adopted logarithm In Section IV, two classes of applications are considered base (without lack of generality, we consider the typical case b = 2), the s bit denotes the sign, and p = log P . The as examples of use cases that highly benefit from noncon- 2 | | ventional arithmetic: machine learning and postquantum cryp- exponent can be computed from the integer and fractional tography. While quantum-resistant cryptography makes use of parts. This fixed-point format provides a range of numbers RNS and SC, as shown in Fig. 3, machine learning applications similar to the FP: this range is defined by the IT bits, while the may take advantage of all the classes of nonconventional width F controls the precision. The logarithmic representation arithmetic considered in this paper. For example, the Deep may be beneficial to applications with data and/or parameters Learning (DL) framework that adopts AQFP to achieve high that follow nonuniform distributions, such as the Convolutional energy efficiency with superconductivity is an example of SC Neural Networks (CNNs), as described in Section IV-A. to machine learning on emerging technologies [79]. With LNS, operations with higher complexity, such as The goal of this paper is to provide a survey on noncon- multiplication, division, square and square root, are easily ventional computer arithmetic circuits and systems, describing performed by applying fixed-point addition, subtraction, and their main characteristics, mathematical tools and algorithms, left/right bit-shift, respectively. Considering unsigned operands and to discuss architectures and technologies for their im- P and Q, represented in LNS as p = log2 P and q = log2 Q, plementation. We follow a tutorial approach on how to use and BitShift(x, z), regarded as the left-shifted value of x nonconventional arithmetic to compute emergent applications by an integer z in fixed-point arithmetic, these operations are and systems, and in the end, we draw some conclusions formulated in (1). and ideas for future research work. We find that this work can be useful for engineers and researchers in computational log (P Q) = p + q ; arithmetic, in particular as a source of inspiration for doctoral 2 ∗ log (P/Q) = p q ; students. We also provide more than one hundred and seventy 2 2 − references to the most important works related to nonconven- log2(P ) = 2 p = BitShift(p, 1) ; 1∗ tional computer arithmetic, distributed according to the main log (√P ) = p = BitShift(p, 1) . (1) topics and sections presented in Fig. 3. 2 2 ∗ − This simple logarithmic arithmetic comes at the cost of more II.NONCONVENTIONALNUMBERSYSTEMS: ARITHMETIC complex additions and subtractions, which map in nonlinear ANDARCHITECTURES operations. For two unsigned numbers P and Q (the sign This section introduces the LNS, RNS, SC and HDC can be handled in parallel), addition and subtraction can be nonconventional representations and the associated arithmetic calculated using Lemma 1. properties. These properties are then leveraged in the design Lemma 1. For , and with as the function of arithmetic units found in many processors. (P,Q) N Max that returns the maximum∈ value of a tuple of numbers, the A. Logarithmic number systems following formulation can be adopted to compute addition and subtraction in the LNS domain (log (P Q)): The LNS has been proposed for both the fixed-point [10] 2 ± and floating-point [11] formats. In the standardized IEEE 754 log ( P Q ) = Max(p, q) + log (1 2−|p−q|) (2) 2 | ± | 2 ± 4

Proof. For P Q, addition and subtraction can be calculated using (3): ≥ Q log (P Q) = log (P (1 )) = p + log (1 2q−p) (3) 2 ± 2 ± P 2 ± while for P < Q: P log (Q ± P ) = log (Q(1 ± )) = q + log (1 ± 2p−q) (4) 2 2 Q 2

By combining (3) and (4): Fig. 4: Plot of the functions log (1 + 2λ) and log (1 2λ) 2 2 − log ( P Q ) = Max(p, q) + log (1 2−|p−q|) (5) (the latter has a singularity when λ 0) 2 | ± | 2 ± → and Lemma 1 is proved.  between λ = 1 and λ = 0). Mathematical transformations, Note that the addition and subtraction operations employ usually known− as cotransformations, are used in this region. the Gaussian logarithms represented in (6) and in Fig. 4. Using alternative functions defined in adjustable subdomains is λ one of these transformations. An example of cotransformation G = log2(1 2 ) , λ = q p (6) λ ± −| − | is the replacement of the computation of log2(1 2 ) with − For a modest number of bits, typically up to 20, the func- two successive subtractions [22], as expressed in (7). In this tions in Fig. 4 can be implemented through lookup tables [13]. equation, 2k1 is individually chosen for each value, such However, given that the table size increases exponentially with that the index λ[1] for the inner subtraction falls on the the word length, table-lookup-and-addition methods can be nearest modulo-∆[1] boundary beneath λ; ∆[1] is fixed at a adopted instead to reduce the burden of storing large tables in large value, with ∆ representing the interval tables that are memory, such as bipartite tables and multipartite methods [14]. stored. The value of the inner product for λ[1] can therefore For a large number of bits, for example, 32 bits (Fig. I) be obtained directly from a lookup table that covers the or even 64 bits, a piecewise polynomial approximation [15] range 1 < λ < ∆[1] at modulo-∆[1] intervals. Since − − or the digit-serial methods [17], [18] can be used to im- ∆[1] k[1] < 0, 2k1 1 and the magnitude of the index − ≤ ≈ plement LNS addition/subtraction. Digit-serial methods, also for the outer subtraction k2 < 1, the computation is shifted − known as iterative methods, calculate the result digit by away from the critical region [22]. digit. Similarly to the COordinate Rotation DIgital Com- puter (CORDIC) method [16], an iterative algorithm can 2p 2q = (2p 2q+k1 ) 2q+k2 approximate the exponential and logarithmic computation with − − − 2k1 + 2k2 = 1 k = log (1 2k1 ) (7) sequences of digit-serial computations [17]. Signed-digit arith- ⇒ 2 2 − metic was proposed to simplify the exponential computation A different type of cotransformation decomposes the com- and reduce the number of pipeline stages required for LNS λ putation of G = log2(1 2 ) according to (8). The implemen- addition/subtraction [18]. Although the digit-serial methods ± 1−2λ tation in [23] approximates the cotransformation log2 1+2λ in require a low circuit area, they exhibit high latency. a limited range [1, 2). The alternative piecewise polynomial approximation meth- ods provide a trade-off between performance and cost. Linear λ Taylor interpolation has been used to implement 20-bit and λ 1 2 λ log2(1 2 ) = log2 − + log2(1 + 2 ) (8) 32-bit LNS Arithmetic Units (AUs) by applying a table-based − 1 + 2λ error correction scheme to achieve a similar accuracy to that of LNS AUs have been used in the design of configurable and their FP counterparts [19]. Although a first-order interpolation general purpose processors [21], [23]–[26]. is used, the error correction step requires a multiplier, making The European Logarithmic Microprocessor (ELM) is a 32- the complexity similar to that of a second-order interpolator (as bit scalar microprocessor with a fixed-point LNS-based AU shown in Fig. 5 for an LNS AU, to be analyzed later in this (b = 2), with numbers represented by a sign bit and a fractional section). Both Lagrange interpolation [164] and Chebyshev component of 23 bits [26]. This processor implements addition polynomial approximation [20] (second-order approximations) and subtraction by splitting the range of λ in (6) into segments have been used to compute the Gaussian logarithms. Second- for every increasing power of 2. In turn, these segments are λ order minimax polynomial approximation has also been ap- split into smaller intervals, and the function F = log2(1 2 ) plied to compute the addition/subtraction in LNS [21]. Higher and its derivative (D) are stored in memory for each of± those accuracy than what can be found in standard FP arithmetic is intervals, as depicted in Fig. 5, with the values of intermediate achievable [20]. points estimated with a Taylor interpolation. To calculate the The approaches referred to above for subtraction in the LNS error for each interval, the maximum error E is tabulated domain work well, except in the critical region (λ [ 1, 0]), alongside F and D – this error corresponds to when a point due to the singularity of log (1 2λ) when λ 0 (see∈ − Fig. 4 falls near the next stored value tuple. A separate table of 2 − → 5

the ELM is marginally more accurate than the FP processor for additive operators, while multiplicative operators return exact results. For a typical kernel, making equal use of each one of the operators, LNS will incur roughly half the total error of FP arithmetic. Moreover, notwithstanding the ELM 5 running at 6 of the clock speed of the FP processor, the additive times were marginally better, the multiplications were 3.4 times faster, and the divisions and square roots were many times faster in the former than in the latter processor. Although LNS enables the design of faster and, typically, more accurate processors than those of FP arithmetic (the LNS implementation tends to have smaller worst-case relative errors than those of FP [28]), general purpose processors currently still adopt the standardized FP, which also easily interfaces with integer binary arithmetic. Another LNS-based AU that adopts the hardware-efficient cotransformation (8) for subtraction in the critical region, computing the second term of this equation with a second- order polynomial, was proposed in [23]. The architecture of the LNS¸-based AU consists of four main blocks: i) the pre- and postprocessing blocks, including the preparation steps of the operations and the combination and/or selection of the results produced by the two main interpolation blocks; ii) the main interpolator block that implements the approx- λ imations for computing log2(1 + 2 ) and exponentiation on the complete range ( 25, 0], and log (1 2λ) outside the Fig. 5: LNS addition and subtraction unit 2 critical region ( 25,−0.25], using second-order− piecewise polynomial approximations;− − and iii) the critical region block used to calculate the logarithm of the cotransformation func- proportions (P ) stores the normalized shape of the common tion (8) for the critical region ( 0.25, 0] by approximating the error curve. Each proportion value will be 0 - when the λ− cotransformation function 1−2 using a first-order polynomial required value is at the beginning of a segment - or up to 1+2λ and computing the required function based on a range 1 - when it is estimated by the interpolation. The error is log2 calculated by adding to the result of the interpolation the value reduction technique that normalizes the function argument to the range . Since only one instruction is evaluated at of E P . In Fig. 5, a range shifter is inserted immediately after [1, 2) a given time, interpolator data paths within the two parallel the selection× of p for the calculation of λ. If λ falls close to blocks ii) and iii) can be shared to obtain a more compact zero in the subtraction operation, p and λ will be transformed design. into new values, and λ will be moved to the linear region by applying the cotransformation (7) [15]. The LNS-based AU proposed in [23] was integrated into The ELM adopts a register-memory addressing mode the datapath of a custom 32-bit OpenRISC core [29] using a Instruction Set Architecture (ISA), with a single-level cache CMOS 65 nm process. For comparison purposes, an identical integrated into the pipeline. The architecture includes 16 system was designed with featuring a standard IEEE 754 general-purpose registers, an 8 kbyte L1 data cache, two single-precision FP AUs with hardware support for additions, adders/subtractors operating in three clock cycles, and four subtractions and multiplications. The error analysis show sim- combined multiplier/divider/integer units operating in one ilar maximum and average values to the ones achieved for ad- clock cycle. Vector operations are implemented in the ELM by dition and subtraction with the ELM. Executing also at a clock using four identical functional units in parallel, requiring a data frequency of 125 MHz, the adder and subtractor have a latency cache organization in which four consecutive words are read of four clock cycles, while the multiplier, divisor and square out simultaneously. The ELM is designed and fabricated in root calculus exhibit a latency of one clock cycle. The LNS- 0.18µ technology [26]. Its performance and accuracy, running based adder/subtractor of this processor significantly reduces at 125 MHz, are evaluated against the TMS320C6711 contem- the circuit area required by the individual LNS units, mainly porary Digital Signal Processor (DSP) [27], available in the due to circuit sharing. Comparing the energy efficiency of the same technology and running at 150 MHz. The TMS320C6711 LNS (per LOGarithmic Operation (LOGP)) and the FP-based is a Very Long Instruction Word (VLIW) processor that is AUs (per FLoating-Point Operation (FLOP)) for a CMOS able to issue eight independent instructions issued every cycle 65 nm process, addition and subtraction in FP is approxi- with two pipelined FP adders and two other pipelined FP mately 3 times more efficient than in LNS (40 pJ/FLOP vs multipliers with a latency of four cycles and two integer 130 pJ/LOGP), while multiplication in LNS is approximately units with a latency of one cycle. It has been shown that 1.5 times more efficient (48 pJ/FLOP vs 30 pJ/LOGP), division 6

is 17 times more efficient (525 pJ/FLOP vs 30 pJ/LOGP), and Moreover, for m , , mn pairwise co-prime integers and 1 ··· square root is 38 times more efficient than FP (609 pJ/FLOP M = m1 mn, the CRT also asserts the following ring vs 16 pJ/LOGP). isomorphism× · · [33]: · × A combination of the LNS and FP representations have ZM Zm1 Zmn been explored as an approach to designing hybrid AUs [30]. → × · · · × It enables, for example, the multiply and divide operations X f(X) = (x , xn) → 1 ··· to be computed using the LNS format, while addition and = ( X m , X m ) , (11) subtraction are efficiently performed in FP. The hardware h i 1 · · · h i n for performing the FP-to-LNS and LNS-to-FP conversions is which allows carrying the ring operations on the residues for embedded within the hybrid AUs. each modulo independently. Thus, RNS is a nonpositional number system based on smaller residues, the quantity and range of which are con- B. Residue number systems trolled by the chosen pairwise co-prime moduli set [34]. The Chinese Remainder Theorem (CRT) was originally Each residue corresponds to a “digit”, and “digits” can be proposed in the 3rd century AD [31]. In the 1950s, it became processed in parallel. The direct application of RNS properties the support for the proposal of computer arithmetic based to addition and multiplication expressed in (12) is widely on the RNS [32]. The CRT states that given a moduli set disseminated in the design of efficient arithmetic, in particular m1, m2, , mn of pairwise co-prime numbers, i.e., the for linear processing [35], [36]. { ··· } Greatest Common Divisor (GCD)(mi, mj) = 1 for i = j 6 ◦:+,−,∗ ( X represents the residue of X for the modulus m): X ◦ Y 7−−−−−→{hx1 ◦ y1i , ··· , hxn ◦ yni } (12) h im m1 mn

n Y {hXi |1 ≤ i ≤ n} is unique for any 0 ≤ X < M ; ,M = m mi i RNS has been proposed mainly for fixed-point arithmetic, i=1 M (9) accommodating also numbers with signs in the range [ 2 : M − The CRT asserts that there is a bijection, formally described 2 ). The main limitation of RNS-based arithmetic is related to in Lemma 2. the operations, such as division, comparison, sign and overflow detection, which are difficult to implement with non-weighted Lemma 2. Let m1, , mn be n pairwise co-prime integers number representations. These require a combination of the Qn ··· and M = i=1 mi. The CRT asserts that there is a bijection residues to reach some type of positional representation, i.e., between ZM and Zm1 Zmn × · · · × the partial or complete application of the RNS-to-binary (re- verse) conversion, supported by the CRT or the Mixed Radix ZM Zm1 Zmn . (10) ↔ × · · · × Conversion (MRC) [37], [38]. Although both approaches use modular operations, the CRT carries out the computation in a single step over a large modular sum X (Mi = M/mi, Proof. Consider n = 2 and two co-prime numbers m1 and −1 −1 −1 −1 Mi m is the multiplicative inverse of Mi, and α is an m2. The modular inverse m = m and m = i 1 1 m2 2 integer) (13). −1 m2 exist since m1 and m2 are co-prime numbers [165]. m1 * n + n Thus, the value Y satisfies the equation for X: X −1 X −1 X = Mi Mi xi = Mi Mi xi αM mi mi − −1 −1 i=1 M i=1 Y = x1m2m + x2m1m 2 1 M=m1×m2 (13) On the other hand, the MRC uses shorter modulo arithmetic since and . Y m1 = x1 Y m2 = x2 with an iterative procedure: It remainsh i to be shownh thati no other solutions exist modulo M, which means X Y (mod M). By contradiction, it is X = y1 + m1 × (y2 + m2 × (y3 + m3 × (··· ))) (14) ≡ assumed that and , which y1 = x1 (X = Y ) < M X mi = Y mi 6 h i h i D −1 E means X Y = 0, i = 1, 2. Hence, (X Y ) is the Least y2 = (x2 − y1) × m1 mi m2 h − i − m2 Common Multiple (LCM) of mi. However, this contradicts D −1  −1 E y3 = x3 − y1) × m1 − y2 × m2 the fact that mi are pairwise co-prime; thus, the LCM is M. m3 m3 m3 We can represent an element of ZM by one element of Zm1 ··· and one element of Zm2 , and vice versa. The CRT can now be proven for an arbitrary number n Algorithms for RNS division [39], comparison [40], [41], (number of co-prime integers) m1 1 i n by induction. M −1 | ≤ ≤ sign detection [42], scaling [43], and reduction [44] have been Defining Mi = and M = Mi , the previous mi i mi equation for Y becomes: h i proposed for specific moduli sets. Basis Extension (BE) is often used in those algorithms to extend the representation of * n + X an RNS moduli set (often called a basis) S1 to another S2. X = x M M −1 i i i Consider the two bases S1 = m1,1, , m1,n1 and S2 = × × { ··· }n1 i=1 M Y m2,1, , m2,n2 with dynamic ranges M1 = m1,i and and X is the unique solution in the ring M . { ··· } Z  i=1 7

n2 Y are most advantageous for RNS moduli that result in small M = m ,i, respectively. If an error of αM is tolerated, 2 2 1 and medium dynamic ranges [46]–[48]. i=1 the residues associated with the second basis can be computed Several RNS-based processors have been developed, for as in [44]: example, a fully programmable RISC Digital Signal Processor (RDSP) [167] and hardware/software RNS-based units [49]. * n1 + 0 X −1 The RDSP is a 5-stage pipeline processor with the RNS- x2,i = X m = Mi Mi xi (15) h i 2,i m1,i based AU partially located in the third and fourth pipeline i=1 m 2,i stages, as depicted in Fig. 6. The most popular 3-moduli set For applications that do not tolerate this error, a method to 2n 1, 2n, 2n + 1 is balanced by extending the binary { − 2n } perform the exact extension requires an extra modulus (me), channel to 2 ; the reverse converter is distributed by the with the corresponding residue xe = X , to enable the two aforementioned pipeline stages, while, for the multiply- h ime computation of a parameter α, for X < δM2, δ N, and accumulate (MAC) unit, the multiplier is located in the third ∈ xe 2(δ + n2 + 1) [166]: stage, and the accumulator operates in parallel to the load/store ≥ units in the fourth pipeline stage. The experimental results D 0 −1 E α = (X xe) M (16) i mi obtained for the RDSP in the CMOS 250 nm technology show − me that not only are the circuit area and the power consumption Once α is known, the basis extension is finally computed significantly lower but also a speedup is obtained in com- with (17). parison to a similar DSP using a data path based on binary arithmetic. * n1 + X −1 An AU based on a moduli set that comprises a larger x2,i = Mi Mi xi αM1 (17) m1,i − modulus 2k and all the other moduli of type 2n 1 assumes i=1 m 2,i that the selected moduli fit the typical word size− of a Central Performing efficient RNS arithmetic requires the selection Processing Unit (CPU), for example, 64 bits [49]. Adopting of the most suitable moduli set for the problem at hand. The a hardware-software approach, RNS adders, subtractors and size of the set and the width of the moduli are mainly defined multipliers are implemented in the hardware, while the reverse by the dynamic range and the desired parallelism. Features conversion and nonlinear operations are implemented in the usually considered to select the moduli are i) the efficiency of software. A minimum set of instructions includes not only the modular arithmetic for each channel (the term “channel” forward conversion (residue), modular addition (add m) and is usually adopted to designate a residue and the modular modular multiplication (mult m) but also instructions that arithmetic units associated with it), which typically leads to operate on positional number representations, such as add, values near a power of two; ii) the balancing of the moduli sub, AND and SHR (logical shift right). It supports changing set, since the slowest channel defines the overall performance; the dynamic range at runtime in a simple manner, which in and iii) the cost of the reverse converters, with some moduli turn can result in a reduction in both power consumption and sets allowing efficient arithmetic channels at the cost of more execution time. complex reverse converters and others tailored to the result Although the RNS was proposed mainly for fixed-point in simple reverse conversion circuits due to the mathematical arithmetic, it has also been used for floating-point units [50], relations between the moduli [45]. Altogether, the most used [51]. To operate on FP numbers, the mantissa and the exponent moduli sets are composed of modulus 2v k, with v n, 2n are individually converted and processed in the RNS domain. and k 1, 0, +1 [35]. ± ∈ { } For example, to multiply two FP numbers, both mantissas are Alternative∈ {− binary} codings can be explored for modular multiplied and exponents are added in the RNS domain. arithmetic in each RNS channel. Thermometer Coding (TC) A variant of RNS, the Redundant Residue Number System and One-Hot Coding (OHC) are among some of the most used (RRNS) addresses different purposes [52] but is mainly used ones [46]. The value of a number in TC is expressed by the for error detection and correction. Since the residues are number of 1’s in the string of bits [47], which means that it independent of each other, introducing redundant moduli in is a nonpositional redundant number representation. Typically, the set of original (information) moduli, usually called the for simplicity, a run-length of 1’s is placed at one end of the legitimate, the representation can be extended to the illegiti- string. For the OHC, the only valid combinations in a string of mate range to detect residue errors and, in some cases, correct bits are those with only a single bit ‘1’ and all the others ‘0’. them. If a result falls into the illegitimate range, it can be For the OHC, the bit ‘1’ position directly indicates the value of concluded that i) one or more residue digit errors exist as the number. k and k + 1 bits are required to represent integers long as the number of residue digit errors is not more than between 0 and k in TC and OHC, respectively. For example, twice the number of redundant moduli and ii) errors can for k = 7, the number 4 can be represented in TC by the be corrected by subtracting the error digits from the residue stream “0001111”, while in the OHC, it can be represented by digits, assuming that the number of residue errors does not the combination “00010000”. The TC is often used in Digital- exceed the number of redundant moduli [35]. For example, to-Analog Converters (DACs), as it leads to less glitching than applying the RRNS 3-moduli set 2n 1, 2n + 1, 2n+1 and binary-weighted codings. The OHC is used in Finite State considering the legitimate range [0{, 22n− 2], from the moduli} Machines (FSMs), as it avoids the use of decoders: a single subset 2n 1, 2n + 1 , a one-digit error− can be detected ‘1’ bit directly identifies the state. The TC and OHC codings in the illegitimate{ − range} [22n 1, 23n+1 2n+1 1]. It is − − − 8

Fig. 6: RDSP: RNS-based Reduced Instruction Set Computer (RISC) [167] –Address generation Unit (AGU); Logic Unit (LU)) worth noting that the independence of residue digits in RRNS prevents a residue error in one channel from propagating to another. A computationally redundant processing core with high energy efficiency has been proposed [53]. The computationally Fig. 7: Stochastic bitstream representing 0.6 in UR and 0.2 in redundant energy-efficient processing for y’all (CREEPY) core BR comprises microarchitecture-, ISA- and RRNS-centered algo- rithms and techniques to efficiently perform computational er- ror correction. This work has shown significant improvements over a non-error-correcting binary processing core. Although all research on the design of cores based on the RRNS, the (a) Multiplication on UR microarchitectures abstract away the memory hierarchy and do not consider the power-performance impact of RNS-based memory addressing. Compared with a non-error correcting core that addresses memory in binary, direct RNS-based mem- ory addressing schemes slow down in-order/out-of-order cores. Novel schemes for RNS-based memory access, extending low-power and energy-efficient RRNS-based architectures, are (b) Scaled adder on UR proposed in [54].

C. Stochastic number representation (c) Multiplication on BR A stochastic signal is the result of a continuous-time stochastic process with two possible values, represented by Fig. 8: Multiplication and addition on SC the symbols 0 and 1 [55]. Numbers are represented by probabilities associated with bitstreams: p is the value of a i on a nonpositional number representation. The randomization stochastic bitstream I defined as the ratio of 1 bits to the total i of the representation, associated with the mapping from space number of bits. Considering Unipolar Representation (UR), the to time, leads to SC arithmetic that can be implemented with stochastic bitstream values are bound [0 : 1], while for Bipolar simple logic circuits [56], [57]. Representation (BR), the values are bound by [ 1 : 1], a result For example, a single two-input logic gate can be used to of applying the scale factor 2 after a negative− bias term (18). compute the product of the probabilities p q (Fig. 8). It is easy For example, a stream I containing 60% 1s and 40% 0s, as to show that for the UR, it is an AND gate,× while for the BR, depicted in Fig. 7, represents the numbers 0.6 and 0.2 in the it is an eXclusive-NOR (XNOR) gate. A -bit multiplexer UR and BR, respectively. 2 : 1 can be used to add two numbers in UR, as depicted in Fig.8b. 1 The addition is scaled by 2 , as shown in (19). UR = p(i); BR = 2 (p(i) 0.5) . (18) × − p(O) = p(I1)p(I3) + (1 p(I3))p(I2) 1 − Neither the length or the structure of I needs to be fixed. The p(I3)= 2 number represented by a particular stream is not defined by the ⇒ 1 individual value of a bit in a specific position; thus, SC relies p(O) = (p(I ) + p(I )) (19) 2 × 1 2 9

Fig. 10: SC linear FSM with N = 2k distinct states (a) Binary-to-stochastic

is given by N = 2n; thus, the set of states can be implemented by n flip-flops. X is the input of the state machine, and Y is the output (not shown in this figure), with the latter (b) Stochastic-to-binary depending only on the state. Assuming that the input X is a stochastic bitstream, typically a Bernoulli sequence, the Fig. 9: Binary-to-SC and SC-to-Binary converters probability p(X) of a bit in the sequence being ‘1’ follows a binomial distribution. With p(Y ), the probability of a bit in the output stream Y being ’1’, and p , the probability of the The correctness of the final result is impacted by the correla- i state S given the input probability p(X), (22) can be derived. tion, or lack of randomness, between stochastic bitstreams. For i (22) states that the probability of transitioning from S to S example, in the extreme case where the same bitstream feeds i−1 i must equal the probability of transitioning from S to S and the two inputs of the AND gate in Fig. 8a, regardless of the i i−1 that the sum of p for all N states should be unitary. In (22), bit values in the bitstream, for UR, p(O) = p(I ) = p(I ) i 1 2 s takes the value 1 if the output Y = 1 in the current state instead of p(O) = p(I )2 = p(I )2. For the same bitstream i 1 2 S and 0 otherwise. feeding the two entries of the XNOR gate in Fig. 8c, the value i of the product is p(O) = 1 for the BR. Hence, it is important to use “good” random generators. pi(1 p(X)) = pi− p(X) (20) − 1 However, even in systems where stochastic bitstreams are N−1 X generated with high-quality random generators, after a few pi = 1 (21) operations, the bitstreams tend to become correlated due to i=0 N−1 clock synchronicity. One way to overcome this limitation is to X regenerate the bitstream [59]; however, the resource penalty p(Y ) = sipi (22) impacts the benefit introduced by the compact arithmetic i=0 operators. A less costly alternative is to replace the global Lemma 3 enunciates the value si such that tanh can be clock with individual ring oscillators for each synchronous computed with SC. component, which can be sensitive to the process, temperature Lemma 3. By inserting as in (23) into (22): and voltage variation [58]. si ( To use SC in typical computing systems, binary-to- 0, 0 i < N/2 si = ≤ , (23) stochastic and stochastic-to-binary converters are required to 1,N/2 i < N leverage SC in practical applications. These converters are ≤ presented in Fig. 9. The binary-to-stochastic conversion in tanh is computed based on SC with (24). Fig. 9a comprises the generation of an m-bit random binary N  number in each clock cycle by means of comparing the m- z = tanh x (24) 2 bit input binary number with the output of a (pseudo)random number generator. For UR, assuming that the random numbers are uniformly distributed over the interval [0, 1], the probability Proof. Starting from (20), assuming 0 < pi < 1 and p0 = 1, of a 1 being generated by the comparator for a given clock it is easy to derive (25): cycle is equal to the fractional number at the input of the  p(X) i converter. For the stochastic-to-binary conversion in Fig. 9b, pi = (25) 1 p(X) the value p is carried by the number of 1s in its bitstream; − thus, the conversion is performed by counting the numbers of and by applying (21) to normalize (25), we obtain (26): 1s to obtain p and normalizing the result by the number of  i bits in the stream. p(X) 1−p(X) Arithmetic functions, even nonpolynomial functions using a pi = (26) N−1  j MacLaurin expansion [60], can be implemented with SC with X p(X) simple combinatorial elements. However, highly nonlinear 1 p(X) j=0 − functions such as the tanh and max functions, widely used in artificial neural networks as discussed in Section IV-A, require Based on the configuration in (23), (22) can be rewritten as (27): FSM-based SC elements [61]. N−1 In the state machine in Fig. 10, states are organized in X p(Z) = sipi (27) sequence, as in a saturating counter [62]. The number of states i=N/2 10 and by substituting (26) in (27), we obtain (28): corresponding arithmetic units across rows or columns of stochastic PEs. Experimental results from the implementation N−1  i X p(X) of StoRM in CMOS TSMC 65 nm technology show that SC 1 p(X) allows a highly compact implementation, resulting in order- i=N/2 − p(Z) = of-magnitude less circuit area and power consumption. N−1  j X p(X) However, computation with SC requires multiple cycles, 1 p(X) equal to the number of bits in the stochastic bitstream. Since j=0 −  N/2  N the length of the bitstream grows with the precision of the data p(X) p(X) being represented, regardless of its lower power consumption, 1−p(X) − 1−p(X) = N the PE ends up consuming more energy than a conventional  p(X)  1 binary processing element. To mitigate this intrinsic limita- − 1−p(X) N/2 tion (increase in cycle count) of SC, vector processing and  p(X)  1−p(X) segmented stochastic processing are adopted [68]. Segmented (28) = N/2 stochastic processing is a hybrid representation that divides the  p(X)  1 + 1−p(X) binary data into equally sized segments that are represented by individual stochastic bitstreams so as to achieve a trade-off Adopting the bipolar coding format, which means that the between precision and bitstream length. ranges of real numbers x and z are extended to 1 α The following sections show how SC can be applied to x, z 1, and assuming that the probability of a− bit≤ in the≡ design invertible logic on the current CMOS technology, based stream≤ being one P (α = 1) = α+1 , (28) leads to (29): 2 on Boltzmann machines [78], and to homomorphic encryption. ( 1+x )N/2 1 z = 1−x − (29) D. Hyperdimensional representation and arithmetic ( 1+x )N/2 + 1 1−x The circuits of the brain are composed of billions of By using the Taylor expansion shown in (30), (29) is rewritten neurons, with each neuron typically connected to up to 10, 000 to compute tanh based on SC (31): other neurons. This situation leads to more than 100 trillion 1 + x ex ; 1 x e−x (30) modifiable synapses [69]. High-dimensional modeling of neu- ≈ − ≈ ral circuits not only supports artificial neural networks but N x − N x e 2 e 2 N also inspires parallel distributed processing. It explores the z = − = tanh( x) (31) N x − N x e 2 + e 2 2 properties of high-dimensional spaces, with applications in, and Lemma 3 is proved. for example, classification, pattern identification and discrim-  ination. The stochastic implementation of other functions, such HDC, inspired by the brain-like computing paradigm, is as the absolute value, the exponentiation, and the max/min supported by random high-dimensional vectors [70]. It looks functions can be carried out using linear FSMs [63] and [61]. at computing with very wide words, represented by high- These functions support, for example, the development of dimensional binary vectors, which are typically called hy- deep learning applications, as described in Section IV-A. perdimensional vectors, with a dimensionality on the order The characteristics of SC make it suitable, in particular, for of thousands of bits. It is an alternative to Support Vector designing embedded systems. It offers a lower circuit area Machines (SVMs) and CNN-based approaches for supervised and power consumption, as well as high error resilience, in classification. With Associative Memory (AM), a pattern X comparison to their binary system counterparts [64]. can be stored using another pattern A as the address, and X Since operations at the logic level are performed on ran- can be retrieved later from the memory address A. However, domized values, an attractive feature of SC is the degree of X can also be retrieved by addressing the memory with a tolerance to errors [65]. A stochastic bitstream of length m pattern A0 similar to A. represents numbers with a precision of 1/m. The error follows The high number of bits in HDC are not used to improve a normal distribution with zero mean and a variance that is in- the precision of the information to represent, as it would be versely proportional to the bitstream length N (σ √1 ) [56]. difficult to give an exact meaning to a 10, 000-bit vector in ≈ N For example, insufficient bitstream length or randomness can the same way a traditional computing model does. Instead, be leveraged in approximate computing [66], [67]. they ensure robustness and randomness by i) being tolerant to Several processors, more or less dedicated, have been devel- errors and component failure, which come from redundancy oped to apply SC to different applications, such as Artificial in the representation (many patterns mean the same thing and Neural Networks (ANNs), control systems, and the decoding thus are considered equivalent) and ii) having highly structured of modern error-correcting codes [67]. Stochastic Recognition information, as in the brain, to deal with the arbitrariness of and Mining processor (StoRM) uses SC to efficiently materi- the neural code. alize computational kernels to perform classification, search, HDC starts with vectors drawn randomly from hyperspace and inference on real data [68]. The StoRM architecture is a and uses AM to store and access them. A space of n = 2D array of stochastic Processing Elementss (PEs) (typically 10, 000-dimensional vectors and independent and identically 15 15) with a streaming memory hierarchy, mitigating the distributed (i.i.d.) components drawn from a normal distri- overhead× of binary-to-stochastic conversion by sharing the bution with mean µ = 0 can be considered as a typical 11 example. Points in the space can be viewed as corners of a These are useful properties of hyperdimensional arithmetic. 10, 000-dimensional unit hypercube, for which the distance By keeping the relation between the vectors, multiplication d(X,Y ) between two vectors X,Y is expressed using the preserves the relation between entities. Hamming metric. It corresponds to the length of the shortest Lemma 4. Multiplication is distributive over addition and path between the corner points along the edges (the distance implements a mapping that preserves distance, i.e., mapping is often normalized to the number of dimensions so that the two vectors (X,Y ) with the same vector M in the hyperdi- maximum distance, when the values of all bits differ, is 1). mensional space yields: Assuming that the distance (k) from any point of the space to a randomly drawn point follows a binomial distribution, with d(XM ,YM ) = d(X,Y ) (34) p(0) = 0.5 and p(1) = 0.5: 10, 000 10, 000! f(k) = 0.5k0.510,000−k = 0.510,000 Proof. If the bipolar ( 1, 1) representation is adopted, the k k!(10, 000 k)! − − (32) XOR operator is replaced by regular multiplication. Since it distributes over ordinary addition, it also does so in this bipolar with mean µ = 10, 000 0.5 = 5, 000 and variance V AR = case. The symbol in (35) denotes normalization. 10, 000 0.5 0.5 = 2,×500 (the standard deviation σ = 50). × × | | This means that if vectors are randomly selected, they differ M X + Y = M X + M Y (35) by approximately 5, 000 bits, with a normalized distance of ∗ | | | ∗ ∗ | 0.5. These vectors are therefore considered “unrelated” [70]. By mapping two vectors (X,Y ) with the same vector M, the Although it is intuitive that half the space has a normalized relation between the vectors is maintained as shown in (36): distance from a point of less than 0.5, this statement is XM YM = (M X) (M Y ) = X Y not true if we observe that it is only less than a thousand- ∗ ∗ ∗ ∗ ∗ XM YM d(XM ,YM ) = d(X,Y ) (36) millionth closer than 0.47 and another thousand-millionth k ∗ k ⇒ farther than 0.53. This distribution provides robustness to the and Lemma 4 is proved.  hyperdimensional space, given that “nearly” all of the space Permutation changes the order of vector components. is concentrated around the 5, 000 mean distance (0.5). Hence, The permutation can be described as a list of integers a 10, 000-bit vector representing an entity may see a large (1, 2, 3, , 10000) in the permuted order or represented as number of bits, e.g., a third, changing their values, by errors multiplication··· by a special kind of matrix. This permutation or noise, and the resulting vector still identifies the correct one matrix will have all ‘0’s except exactly one ‘1’ in each because it is far from any “unrelated” vector. row and each column. Similar to vector multiplication, when The vector representation of patterns enables the use of the a permutation is applied to different vectors, the distance body of knowledge on vector and matrix algebra to implement between them is maintained. As with sets, elements of a hyperdimensional arithmetic. For example, the componentwise sequence can be represented by a single hypervector in a addition of a set of vectors results in a vector with the same process called flattening. However, if sequences are flattened dimensionality that may represent that set. The sum-vector by the sum, the order of the elements will be lost. Thus, values can be normalized to yield a mean vector. Moreover, the elements must be “labeled” according to their position in a binary vector can be obtained by applying a threshold. If the sequence so that the value of X appears under different there are no duplicated elements in a set, the sum-vector is a variants depending on the position in the sequence, which possible representative of the set that makes up the sum [70]. can be done with permutations [70]. Sequences in time are Another important arithmetic operation in HDC is vector important to introduce memory in the systems, which can also multiplication, which, as for SC with BR, can be implemented be represented with pointer chains or linked lists [70]. with the bitwise logic XNOR operator (see Fig. 8c). Reverting The HDC has several known applications. For example, the order of the BR from (1, -1) to (0, 1), the ordinary sequence prediction based on sparse hyperdimensional coding multiplication of binary vectors X and Y can be implemented is used for online learning and prediction [71], which can by the bitwise eXclusive-OR (XOR). It is easy to show that be useful for predicting mobile phone use patterns, includ- XOR is commutative and that each vector is its own inverse ing the prediction of the next launched application and the (X X = 0), where 0 is the unit vector (X 0 = X). next GPS location of a user. Another example of an HDC Multiplication∗ can be considered as a mapping of∗ points in application is the learning and classification of biosignals, the space: multiplying vector X by M maps the vector into such as electromyography (EMG), electroencephalography a new vector X as far from X as the number of 1s in M (EEG), and electrocorticography (ECoG) [72]. A full set of M (represented by M ) (33). If M is a random vector, Hyper-Dimensional (HD) network templates comprehensively with half of the bitsk taking,k on average, a value 1, X M encodes body potentials and brain neural activity recorded is “unrelated” to X, and it can be said that multiplication from different electrodes within a single HD, processed as randomizes it. a single entity for learning and robust classification purposes. The diagram of a programmable processor based on HDC d(XM ,X) = XM X = M X X = M (33) k ∗ k k ∗ ∗ k k k for supervised classification [73] is presented in Fig. 11. This Lemma 4 shows that multiplication is distributive over HDC processor encompasses three main components, each addition, and it implements a mapping that preserves distance. corresponding to processing steps: the Item Memory (IM), 12

up the execution of applications, such as in cryptography, on traditional binary-based programmable systems (e.g., the mas- sively parallel Graphic Processing Units (GPUs)). However, RNS is not suitable for designing comparison units, dividers and other nonlinear operations. HDC is inspired by the attributes of neuronal circuits, in- cluding hyperdimensionality and (pseudo)randomness. When employed on tasks such as learning and classification, HDC involves the manipulation and comparison of large patterns (typically 10, 000-bit patterns) within memory. It allows sys- Fig. 11: The Generic HDC-based Processor: major compo- tems to retain memory, instead of computing everything they nents and dataflow [73]– Data Processing Unit (DPU) sense, pointing more to a kind of Processing In-Memory (PIM). The representation of the patterns, and the associated arithmetic, does not assign any weight to the “digits” according the Encoder and the AM. The IM stores a sufficiently large to their position in the pattern; thus, it is also a nonpositional collection of random hypervectors called items. This memory representation. A key attribute of hyperdimensional computing maps symbols to items in the inference phase, as it was learned is its robustness to the imperfections associated with the in the training phase. The Data Processing Units (DPUs) of the technology on which it is implemented. HDC improves with Encoder combines the input hypervector sequence according the advances in memory technologies, replacing computation to the application-specific algorithm to compose a single with the manipulation of patterns in memory. hypervector for each class. Finally, the AM stores the trained SC has shown promising results for low-power area-efficient class of hypervectors and delivers the best prediction according hardware implementations, even though existing stochastic to the Hamming distance (dh). The inputs of the processor algorithms require long streams that negatively impact the comprise the assignment of symbols to an item in the IM, and latency. Being another nonpositional representation, SC im- the outputs comprise the assignment of class labels to the AM proves the robustness of circuits subject to random faults address, all of which is enabled by the HD mapper. or component variability. Since with SC, multiplications and An Application Specific (ASIC) for the additions can be calculated using AND gates and multiplexers, HDC-based programmable processor, with 2048-bit vectors, significant reductions in power and the hardware footprint was synthesized and implemented using a 28 nm process can be achieved in comparison to the conventional binary with a high-k dielectric coupled with metal gate electrodes arithmetic implementations. The significant savings in power (HK/MG) [73]. The chip showcases high energy efficiency, consumption and hardware resources allow the design of less than 1.5 J/prediction for a comprehensive benchmark at autonomous systems, such as embedded devices or for com- clock cycle of approximately 2.5 ns. It rivals other state-of- putation on the edge. the-art processors targeting supervised learning, in terms of SC is a good example of a representation that easily allows both accuracy and energy efficiency. trading off the computation quality with the performance and energy consumption by adjusting the bitstream length. The idea of trading off accuracy with performance and energy effi- E. Discussion and approximate arithmetic techniques ciency led to the approximate computing paradigm for design- The LNS swaps the cost of addition/subtraction with the ing circuits and systems [74]. This paradigm has been explored cost of the multiplication/division operations, as well as pow- at several levels, from basic circuits to components, processing ers and roots, but it still leads to a large cost difference units, and programs. In particular, approximate arithmetic between these two operations categories. Although LNSs circuits have been proposed, including adders, multipliers and provide a similar range and precision to those of FP, beyond dividers [75]. Both error characteristics, such as the “error the simplification of multiplication and division to fixed-point rate” and the “error distance”, and circuit measurements, such addition and subtraction, the FP number systems have become as the delay, cost and power dissipation, should be considered a standard, while the LNS is mainly applied in niches. This for approximate circuits [75]. Approximate computing has been occurs, for example, in signal processing, most recently in considered for several applications with error resilience, such machine learning, as shown in the next sections. as image processing and machine learning [76], [77]. Although RNS belongs to the class of nonpositional representations most of the research on approximate computing up to now exposing parallelism at a very fundamental computing level has been focused on binary representation, the paradigm can to achieve high performance and at the same time supports be applied to the different representations discussed in this the design of reliable systems (fault tolerance). Although the section. usage of RNS allows the design of more energy-efficient systems, RNS does not guarantee a decrement in cost or power III.ARITHMETIC: NEWTECHNOLOGIESAND consumption, since the reduction operation has to be applied ALTERNATIVE COMPUTING PARADIGMS and the sum of the residue bit-widths is typically the bit-width As referred to in the previous sections, although arith- of the equivalent fixed-point binary number. It can be applied metic units and processors have been designed in CMOS- not only to designing hardware devices but also to speeding based ASICs and Field Programmable Gate Arrays (FPGAs), 13

(a) FG (CNOT) (b) TG (c) HNG Fig. 12: Examples of reversible gates nonconventional number representations are also fundamental for developing computing systems based on new technologies and circuits, such as superconductor devices [79], integrated nanophotonics [94], nanoscale devices and hybrid memories [98], [105], and invertible logic devices [78]. More- Fig. 13: Circuit of an AQFP gate– Josephson junctions (J1 over, new paradigms for designing computers with emergent and J2) technologies also cross nonconventional arithmetics to cover a wide range of applications. This section introduces not only those emergent technologies but also two of the most mented, for example, on DNA molecular structures [86], [87] representative new computing paradigms, QC and DNA-based or on quantum computing devices [88]. Moreover, reversible computing [80], [81]. This survey considers both the technol- gates and QCA have been used to implement arithmetic units ogy and the new computing paradigms from the perspective based on RNS [81], [89]. of computer arithmetic. The lowest limit for the energy required to process one bit of A. Superconductor logic devices information is Eb = kB T ln 2, where kB is the Boltzmann The AQFP superconductor logic family was proposed as × × constant and T the temperature in Kelvin (approximately 3 a breakthrough technology towards building energy-efficient − ◦ × 10 21 Joule for a temperature of 25 C) [82]. However, it has computing systems [90]. The dynamic energy dissipation of been shown that this energy can be reduced, achieving even AQFP logic can be drastically reduced due to the adiabatic near-zero energy dissipation, by using reversible gates [82]. switching operations by using AC bias/excitation currents for Reversible logic requires not only the reverse mode operation both the clock signal and power supply. but also a one-to-one mapping between inputs and outputs. The circuit in Fig. 13, from [91], represents a typical Reversible gates have been implemented in nanotechnolo- simple AQFP gate that acts as a buffer. The Quantum-Flux- gies. Some of the most representative reversible gates are Parametron (QFP) is a two-junction Superconducting Quantum presented in Fig. 12 [81]. The Feynman gate (FG), also known Interference Device (SQUID). It is based on superconducting as the Controlled-NOT (CNOT) gate, is a 2 2 reversible gate loops containing two Josephson junctions (J and J ) and two × 1 2 described by the equations P = A and Q = A B, where (L and L ), which are shunted by an L ⊕ 1 2 q A and B operate as control and data bits, respectively. The in the center node. An excitation current Ix is applied, which Toffoli gate (TG) is a 3 3 reversible logic gate, with P = A, changes the potential energy of the QFP from a single well × Q = B and R = C (A B). A 1-bit Full Adder (FA) can be to a double well. Consequently, the final state of the QFP, ⊕ ∧ built with a single Haghparast and Navi gate (HNG) gate by depending only on the direction of the input current Iin, is considering A, B, Ci and D = 0. The carry out (Co) bit that one of two states: the “0” state with an SFQ in the left loop results from the addition of A, B and Ci is easily computed or the “1” state with an SFQ in the right loop (see Fig. 13). A by (37), since the result of the XOR of three terms is equal large output current, in the direction defined by the final state to the logic OR whenever the number of terms that assume of the QFP, is generated in Lq. a value of one is different from two, a situation that never One major advantage of AQFP is the energy efficiency occurs in (37). potential. For example, experimental results show that the energy dissipation per junction of an 8-bit AQFP-based Carry

Co = (A⊕B)∧C⊕A∧B = A∧C⊕B∧C⊕A∧B = A∧C+B∧C+A∧B Look-ahead Adder (CLA) is only 24kBT [92] (approximately (37) 100 10−21 Joule for a temperature of 25◦C), being a correct× operation demonstrated at a 1-GHz clock frequency. Efficient reversible circuits have been implemented on The robustness of AQFP technology has been demonstrated QCA-based technologies [80]. Binary information is repre- against circuit parameter variations, as well as the potential sented by means of Quantum Dots (QDs), semiconductor towards implementing very large-scale integrated circuits us- particles a few nanometers in size having optical and electronic ing AQFP devices. A more systematic and automatic synthesis properties that are central to nanotechnology [84]. When framework, as well as synthesis results on a larger number illuminated by ultraviolet light, an electron in the QD can be of benchmark circuits, is presented in [90]. The automatic excited to a state of higher energy, while for a semiconducting synthesis on 18 benchmark circuits, including 11 circuits quantum dot, an electron can transit from the valence band to from the ISCAS-85 benchmark suite and a 32-bit RISC-V the conductance band. Reversible QCA-based gates have been Arithmetic Logic Unit (ALU), is performed using a standard applied to design arithmetic units [85], which can be imple- cell library of AQFP technology, with 4-phase clock signals. 14

(a) Bar state (b) Cross state

Fig. 14: Diagram of states of a 2 HPP switch × (a) RNS shifting adder

The results compare an AQFP 10kA/cm2 standard cell library with a TSMC 12 nm FinFET one. The results show the consistent energy benefit of the AQFP technology, which is able to achieve an up to 10-fold improvement in energy consumption per clock cycle in comparison to the 12 nm (b) RNS All-to-all Sparse Directional (ASD) adder TSMC technology [90]. Recent research has highlighted the interest in nonconven- Fig. 15: On calculating 4 + 3 = 2 with modulo 5 adders h i5 tional arithmetic also for AQFP technology. In addition to the ultra-high energy efficiency, it is pointed out that this technol- ogy has two characteristics that make it especially suitable for number of levels in which the switches are in the cross state implementing SC [79]. One is its deep pipelining nature, since (in the example, levels 1 to 3 for adding 3). For this shift- each AQFP logic gate is connected to an AC clock signal, thus level selection, TC is used to enable level switching in the requiring a clock phase, which makes it difficult to avoid Read RNS channels, as presented in Section II-B. after Write (RAW) hazards in conventional binary computing. The ASD RNS adder uses a routing network as the com- The other is the opportunity to obtain a true Random Number puting device [93]. To compute x + y, x is represented by Generation (RNG) using single AQFP buffers. the input light source, and y, by the switch states defined by binary electrical signals. By storing the precalculated states B. Integrated nanophotonics that satisfy the all-to-all nonblocking routing network in an Look up Table (LUT), the input light is routed to the resulting Integrated nanophotonics is another emergent technology output port. Fig. 15b depicts a design for m = 5 and the that takes advantage of nonconventional arithmetic, in partic- light path used to calculate, as in the previous example, ular RNS [93], [94]. The inherent parallelism of RNS and 4 + 3 5 = 2, now through an ASD network. A modulo m the energy efficiency of integrated photonics can be explored ASDh RNSi adder requires (m 2)2/2 + 2 switches (for odd to build high-speed RNS-based arithmetic units supported m). If the number of switches− required by the ASD RNS on integrated photonics switches. RNS arithmetic is per- adder is less than that for the shifting RNS adder, then more formed through spatial routing optical signals in 2 2 Hybrid × complex electronic control circuitry, typically including LUTs, Photonic-Plasmonic (HPP) switches, following a computing- is required. in-the-network paradigm. The block diagram of these 2 2 × Integrated nanophotonic RNS-based arithmetic processors HPP switches is represented in Fig. 14. It is fabricated by exhibit a very short computational execution time, achieved by using indium tin oxide (ITO) as the active index modulation optical propagation through an integrated nanophotonic router, material [95]. In the bar state, the light will propagate straight on the order of dozens of picoseconds. In comparison to All- into the bar arm (Fig. 14a), while in the cross state, lights are Optical-Switch (AOS) technology, also used for RNS-based routed to the opposite (cross) arm output (Fig. 14b). A voltage arithmetic [96], integrated nanophotonics requires a relatively signal is applied to the switch to control the guidance of the small area due to the compact transistor size. The energy per input light to the desired output port. Theoretically, an HPP operation, on the order of fJ/bit, has been shown to be orders switch could be operated at 400 GHz. However, the speed of of magnitude lower [93]. the whole system is limited by the speed of the modulators and , as well as the electronic circuit used to control the routing [94]. C. Nanoscale memristors and hybrid memories As discussed before, RNS arithmetic is applied digitwise; Several alternatives to existing semiconductor memories, thus, independently for each digit, a number of those switches supported by advances in technologies and devices, have are interconnected to perform the computation. This is neces- recently been proposed [97]. This has led, for example, to sary to make systems scalable, since the number of switches the re-emergence of PIM architectures based on Non-Volatile grows with the square of the number of bits. For example, for Memory (NVM) and 3D monolithic technologies to meet modular addition, the shifting RNS adder explores the shifting the demands of extensive data processing for current and property: a modulo m adder requires a set of diagonally next-generation computing. This subsection discusses how aligned m 1 HPP switches per each of the (m 1) levels, with nonconventional arithmetic plays an important role in memory (m 1)2 −switches in total. For example, Fig.− 15a illustrates technologies, namely, to perform PIM. an adder− modulo m = 5, requiring 42 switches organized Hybrid memories combine CMOS with non-CMOS devices in 4 levels. When computing x + y, x can be represented in a single chip as one alternative to existing semiconduc- by the input light source, while y specifies electrically the tor memories [98]. These memories use nanowires, carbon 15 nanotubes, molecules and other technologies to design the These machines can be represented as networks or graphs arrays, while CMOS devices are used to form of simple processing elements interconnected (e.g., Fig. 16a the peripheral circuitry. One of the first articles to investigate for a two-input gate Y = A B) with signals taking the the application of RRNS to improve the reliability of hybrid binary values +1 or 1. Every∧ node in the undirected graph memories was published in 2011 [98]. A new RRNS code is fully connected to all others through bidirectional links, is proposed for a 6-moduli set, where two are nonredundant and an individual bias value is assigned to each node. The th moduli and four are redundant moduli and maximum likeli- binary state and i node output (mi) are computed by (38): hood decoding is applied to resolve the ambiguity that occurs after calculating the weighted sum of all input connections for less than 10% of the decoded data. (matrix J) and adding the bias terms (h), the nonlinear The CREEPY core, referred to in Section II-B, also adopts activation function (tanh) is applied. A noise source rnd(-1,+1) RRNS to achieve computational error correction for new (uniformly distributed random real number between 1 and − devices with signal energies approaching the kBT noise +1) is introduced to avoid getting trapped in local minima, and floor [53], such as the Ferroelectric Transistors (FEFET) [99]. finally, the sign function (sgn) provides the binary outputs +1 The 4T nonvolatile differential memory based on FEFETs is or 1. I0 is a scaling factor (an inverse pseudo-temperature). a good energy-efficient candidate for PIM [100]. States− are categorized as valid and invalid states. If the nodes Memristors, a contraction of the terms ‘memory’ and ‘re- are unconstrained, the system fluctuates among all possible sistor’, are devices that behave similar to a nonlinear valid states. A logic value of ‘0’ is assigned to mi(t) = 1, − with memory [101]. Memristors can be used to combine data and ‘1’ is assigned to mi(t) = 1. storage and computation in memory, thus enabling non-von Neumann PIM architectures [102]. CNNs have been fully mi(t + ∆t) = sgn(rnd( 1, +1) + tanh (Ii(t + ∆t))) hardware-implemented using memristor crossbars, which are X− cross-point arrays with a memristor device at each inter- Ii(t + ∆t) = I0(hi + Jijmj(t)) (38) section [103]. The processing-in-memory-and-storage (PIMS) j project also adopts the RRNS for memristor-based technolo- The values of h and J for the AND gate are shown in gies [104]. Moreover, it has been shown that SC is adequate Fig. 16b. The h vector elements correspond to the bias weights to perform PIM on nanoscale memristor crossbars, allowing close to the node in the graph (in the order h0 for A, h1 for the extension of the flow-based crossbar computing approach B, and h2 for Y ) (16a). Matrix J, which, as expected, is sym- to approximate stochastic computing [105]. Applications of metric, represents the weight of the bidirectional connections, in-memory stochastic computing to a gradient descent solver following also the order A, B, Y . The gate is invertible; thus, and a k-means clustering processor can be found in [57]. for example, when Y is clamped to 1, the only valid state is An end-to-end brain-inspired HDC nanosystem using the (A = 1,B = 1). heterogeneous integration of multiple emerging nanotechnolo- gies was proposed in [106]. It uses monolithic 3D integration of Carbon Nanotube Field-Effect Transistors (CNFETs) and   Resistive Random-Access Memory (RRAM). Due to their low h = +1 +1 −2 fabrication temperature, (1,952) CNFETs and (224) RRAM  0 −1 +2 cells are integrated with fine-grained and dense vertical con- J = −1 0 +2 +2 +2 0 nections between computation and storage layers. Integrating RRAM and CNFETs allows creating area- and energy-efficient (b) h and J (38) circuits for HDC. (a) Boltzmann structure Multi-level cell (MLC) is used to store multiple bits in a single RRAM cell to reduce the hardware overhead. However, Fig. 16: Boltzmann Machine structure for the AND gate [78]: the accuracy of ANNs deteriorates more due to resistance three nodes, those denoted A and B refer to binary inputs, variations, which is quite relevant in binary coding. The error- and Y the output tolerance feature of SC can also be explored to mitigate the reliability problem of RRAM MLC technology [107]. The organization of the nodes, which operate in parallel, and their type, simple processing elements with binary inputs and outputs, make Boltzmann machines well suited for the D. Invertible logic circuits SC nonconventional arithmetic discussed in the previous sec- Invertible logic provides a reverse mode for which the tion [78]. In Fig. 17, an FSM state, as shown in Fig. 10, is output is fixed and the inputs take on values consistent with stored as the accumulator: for the highest state of the FSM the output [78]. It is less restrictive than reversible logic. For (SN−1), a positive input will not cause any increase in the example, the two-input OR logic gate is invertible if when value stored in the accumulator; the inverse also holds true the output is kept at c = 1, the inputs (a, b) alternate equally, for negative input signals and the lowest state. The scaling with a probability of 33%, between the values (1, 1), (0, 1), values in (38) are included in the weight wi terms in Fig. 10. and (1, 0), but it is not reversible. SC can be used to implement arithmetic functions based Boltzmann machine structures [83] are underlying structures on Boltzmann machines. A graph of the Boltzmann machine that have been proposed to design invertible logic circuits [78]. for an invertible 1-bit FA is represented in Fig. 18; A, 16

Fig. 17: Simplified processing element using SC [78]: wi represent the factor associated to the Ii inputs; wrnd represents the weight associated to the noise source Fig. 19: Memory strands, stickers and DNA-based number representation

h = 0 0 0 0 0 More recently, models such as the Sticker-based DNA  0 −1 −1 +1 +2 model have been proposed [110]. The sticker-based DNA −1 0 −1 +1 +2 model is used to represent fixed-point and floating-point   J = −1 −1 0 +1 +2 numbers, overcoming the limitations of the previous models +1 +1 +1 0 −2 +2 +2 +2 −2 0 and allowing iterative and parallel processing to be carried out [111]. With the DNA sticker, a binary number is repre- (b) h and J (38) sented through two groups of single-stranded DNA molecules: i) the memory strand, a long DNA molecule that is subdivided (a) Boltzmann structure into nonoverlapping segments and ii) a set of stickers, i.e., Fig. 18: Boltzmann Machine structure for the 1-bit FA [78]: short DNA molecules, each with the length of a segment of five nodes, those denoted A, B, Ci refer to binary inputs, and the memory strand. Each one of the nonoverlapping segments S and Co the outputs represents a bit, and a sticker is complementary to one and only one of the nonoverlapping segments. As depicted in Fig. 19 a segment on the memory strand annealed to the B and Ci are the inputs, and Co and S are the outputs matching sticker represents a “1”; otherwise, the value of the of the 1-bit FA [78]. In this case, the h vector elements bit is “0”. in Fig. 18b are all zero. The symmetric matrix J follows Test tubes contain sets of strands representing binary num- the order A, B, Ci, S, Co. Invertible multipliers can also be bers, either the operands or the results of arithmetic operations. implemented, which can be used for factorization. A 5-by-5- The functional formulation of the operations on the Sticker- bit optimized invertible multiplier/divider/factorizer occupies based DNA model allows the setup of algorithms for per- 53, 818 µm2, which is 13 times the area of the conventional forming DNA-based computing [111]. Three main operations binary logic counterparts [78]. For the latency, the convergence are essentially considered to implement logic and arithmetic cycles are different depending on the outputs but are orders operations: Combine, Separate and Set. of magnitude faster than the alternatives for invertible opera- • Combine(Td,Ts1, ,Tsn) - pour the contents of tubes tions. This observation also demonstrates the ability to design ··· Tsi into Td and then empty Tsi; formally, using the set circuits capable of invertible operations for more complex S theory, Td Tsi ; Tsi . behaviors. ≡ i ≡ ∅ • Separate(Ts; i, B[1],B[0]) - Separate the contents of Ts based on the value of the ith bit: the DNA strands with E. DNA-based computing an ith bit equal to “1” are inserted into tube B[1], while The DNA found in living cells is composed of four bases: the other DNA strands, into tube B[0]. Adenine (A), Cytosine (C), Guanine (G), and Thymine (T). • Set(Tsd; i) - Set the ith bit of all DNA strands in tube Each base is attached to its neighbor base in the sequence via Tsd, which serve as the source and destination in this phosphate bonding. Two DNA sequences bond with each other operation. between complementary base pairs (A-T and C-G), forming Several logic and arithmetic operations on the Sticker-based the DNA double helix. During the formation of a DNA double DNA model have been proposed [111]; the bitwise AND strand, two complementary single strands bond with each other operation over two n bit vectors is presented in Algorithm 1, in antiparallel fashion. and the n-bit integer ADD, in Algorithm 2. In Algorithm 1, The first proposals for representing binary numbers with the output of the AND operation over the operands represented DNA and performing arithmetic operators emerged in the by the strands in source tubes Ts1 and Ts2 is made available nineties [108]. However, the first addition algorithms provide in the destination tube Td. After pouring the contents of tubes output in a different form than the input, thus preventing Ts1 and Ts2 into the auxiliary tube Ta (line 1), the strands the iterative application of operations to implement arithmetic in this tube are separated according to the value of the bits systems. Moreover, earlier DNA computing models do not in each of the n positions in tubes B[1] and B[0] (line 3); support addresses and operations on single strands, thus not al- whenever the value of the ith bit in the two strands is “1”, the lowing arithmetic operations to be performed through parallel ith bit of the DNA strand in tube Td is set (line 5). Strands processing [109]. are re-combined in tube Ta in line 7. 17

Algorithm 1 AND(Ts1,Ts2, n:in; Td:out) Algorithm 2 ADD(Ts1,Ts2, n:in; Td:out)

Require: Pour blank strand of n bits (0 ... 0) into Td Require: Pour blank strand of n bits (0 ... 0) into Td Ensure: bit stream in Td =bit stream in Ts1∧bit stream in Ts2 Ensure: bit stream in Td =bit stream in Ts1+bit stream in Ts2 1: Combine(Ta,Ts1,Ts2) {Ta: auxiliary Tube} 2: for all bit 0 ≤ i < n do 1: Combine(Tnc,Ts1,Ts2) {Tnc: No-carry auxiliary Tube} 3: Separate(Ta, i, B[1],B[0]) 2: for bits i = 0 to n − 1 do 4: if B[0] is empty then 3: if Tnc is not empty then 5: Set(Td, i) 4: Separate(Tnc, i, B[1],B[0]) 6: end if 5: if neither B[0] nor B[1] is empty then 7: Combine(Ta,B[1],B[0]) 6: Set(Td, i) {bits in position i have different values} 8: end for 7: Combine(Tnc,Ts1,Ts2) 8: else 9: if B[0] is empty then 10: Combine(Tc,Ts1,Ts2) {Tc: (with-)carry auxiliary Algorithm 2 applies the ADD operation over the n-bit Tube} operands represented by strands in source tubes Ts1 and Ts2, 11: else and the sum is deposited into the destination tube T . Although 12: Combine(Tnc,Ts1,Ts2) {bits in position i are both d 0} the addition algorithm is supported by binary arithmetic, 13: end if we present it to illustrate how arithmetic is implemented in 14: end if biological substrates. The auxiliary tubes Tnc and Tc are 15: else used to register the absence or existence of the carry bit, 16: Separate(Tc, i, B[1],B[0]) 17: if neither B[0] nor B[1] is empty then respectively. By pouring the contents of tubes Ts1 and Ts2 into 18: Combine(Tnc,Ts1,Ts2){bits in position i have different the auxiliary tube Tnc, it is considered that the initial carry-in values} is equal to “0” (line 1). For the Least Significant Bit (LSB), 19: else strands in the Tnc tube (line 3) are separated according to the 20: if B[0] is empty then 21: T , i { i 1} value of this bit in tubes B[1] and B[0] (line 4). If neither B[1] Set( d ) bits in position are both and B are empty, this means that the LSBs have different 22: Combine(Tc,Ts1,Ts2) [0] 23: else values (line 5). Since carry-in is not considered, the LSB of 24: Set(Td, i) {bits in position i both 0} the DNA strand in tube Td is set (line 6). No carry-out is 25: Combine(Tnc,Ts1,Ts2) {bits in position i are both generated; thus, both strands are placed back in the Tnc tube 0} (line 7). On the other hand, if both LSBs take the value 1 26: end if ((line 9), the sum bit is “0”, and we only have to put both 27: end if 28: end if strands in the Tc (line 10) to signal that a carry-in exists for 29: end for the next bit. For the last condition in line 11, both LSBs are “0”, and we only have to return both strands to Tnc (line 12). The second half of Algorithm 2, from line 15, is similar to the first but for the case where a carry-in exists, which means tube (PCR); PAGE can also be followed by gel imaging. The results Tnc is empty and tube Tc contains the input strands. All this processing is iteratively repeated from the LSB to the Most in [112] show that it is possible to integrate components to Significant Bit (MSB) (line 2). implement DNA computer systems. Leakage is the main chal- Current DNA computational systems are constrained by lenge, especially when the size scales up. The limited purity their poor integration capability, complex device structures of commercial chemosynthetic strands and DNA components or limited computational functions. However, the ability to is the primary source of leakage. Moreover, although any kind build efficient DNA logic gates and cascade circuits based on of DNA strand can be synthesized using biological methods, polymerase-mediated strand displacement synthesis has been the preparative step requires a long time in practice, as well shown [112]. The pillars are a single-rail AND gate built with as the preparation of inputs [109]. three strands and a single-rail OR gate built with two strands. Although the DNA manipulation required in the models has Dual-rail DNA logic gates were built by parallelizing a single- already been realized in the laboratory and the procedures rail AND gate and a single-rail OR gate to construct a logical have been implemented in practice, some defects exist in the expression. The NOT gate, fundamental to the construction of procedures, thus hindering practical use. This is an example general logic systems, is difficult to build but can be achieved of the application of nonconventional arithmetic to DNA- by reversing the definition of the strands and constructing a based computing. The RRNS has been applied to overcome single-rail AND gate and a single-rail OR gate [112]. the negative effects caused by the defects and instability of A DNA ALU, with four 1-bit functions (FA, AND, OR the biochemical reactions and errors in hybridizations for and NAND) and the decoding and controlling logic, was DNA computing [113]. Applying the RRNS 3-moduli set constructed [112]. It was built with 16 equivalent logic gates 2n 1, 2n + 1, 2n+1 to the DNA model in [114], as dis- and consists of 27 DNA species and 74 DNA strands. After {cussed− in Section II-B,} leads to one-digit error detection. The purifying the strands with PolyAcrylamide Gel Electrophoresis parallel RRNS-based DNA arithmetic improves the reliability (PAGE), the logic gates and circuits are tested, and the results of DNA computing while at the same time simplifies the DNA are visualized with the real-time Polymerase Chain Reaction encoding scheme [113]. 18

1 1    H √1 1 0 0 0 2 1 1 0 1 0 0 −   0 0 0 1 (a) Hadamard gate 0 0 1 0 1 0 (a) Control qubit in the top, S 0 j target qubit in the bottom (b) Matrix representation Fig. 21: CNOT: matrix and circuit representation (b) Phase gate 1 0  T 0 expjπ/4 A two-qubit system can be represented by a vector in the Hilbert space C 2 C 2, with denoting the tensor product, ⊗ 4 ⊗ 2 2 (c) π/8 gate which is isomorphic to C . Thus, the basis of C C can be written as: ⊗ 0 1 X 1 0 0 0 , 0 1 , 1 0 , 1 1 | i ⊗ | i | i ⊗ | i | i ⊗ | i | i ⊗ | i and a b is often expressed as a b or ab . Generally, (d) Pauli-X gate n the state| i ⊗ of | ani n-qubit system is expressed| i | i by| (40)i in 2 . 0 j C Y − j 0 X X 2 Υ = cb b ; cb = 1 (40) | i | | b∈{0,1}n b (e) Pauli-Y gate with the state of an n-particle system being represented in a 1 0  2n-dimensional space. The exponentially large dimensionality Z 0 1 − of this space makes quantum computers much more powerful than classical analogue computers, the state of which is (f) Pauli-Z gate described only by a number of parameters proportional to the size of the system. By contrast, 2n complex numbers are Fig. 20: Single qubit gates: symbols and unitary matrices required to keep track of the state of an n-qubit system. If the qubits are allowed to interact, then the closed system includes both qubits together, meaning that the qubits are F. Quantum computing entangled. When two or more qubits are entangled, there QC cannot stand alongside DNA computing or any other exists a special connection between them: the outcome of the type of classical computation referred to so far in this paper. measurement of one qubit is correlated to the measurement Classical computing operates over bits, and even DNA-based of the other qubits. The quantum version of the CNOT gate computing refers only to the substrate on which computation in Fig. 12a is an example of the interaction between 2 qubits over these bits is performed. The number of logical states that operate on the same basis: for an n-bit representation is 2n, and Boolean operations on the individual bits are sufficient to realize any deterministic 00 00 , 01 01 , 10 11 , 11 10 transformation. A quantum bit (qubit), by contrast, typically | i → | i | i → | i | i → | i | i → | i a microscopic unit, such as an atom or a nuclear spin, is a The CNOT gate flips the state of the target (t) qubit superposition of basis states, orthogonal and typically repre- according to the state of the first control (c) qubit. If the sented by 0 and 1 . In Dirac notation, also known as bra-ket control qubit is set to 1 , then the target qubit is flipped; notation, a| keti such| i as x refers to a vector representing a otherwise, nothing is done| i with the target qubit, which action state of a quantum system.| i A qubit, represented by the vector can be represented as: c t c c t . Fig. 21 provides x , corresponds to a linear combination, the superposition of the CNOT gate matrix and| i | circuiti → | representations.i | ⊕ i |thei basis vectors with coefficients α and β defined in a unit A that implements a 1-bit FA, acting on complex vector space called the Hilbert space (C 2) (39). the superposition of states, is presented in Fig. 22. With the x = α 0 + β 1 ; α 2 + β 2 = 1 . (39) application of the equations of the classic reversible logic | i | i | i | | | | (Fig. 12) to the circuit in Fig. 22, the output equations for Regarding measurement, the superposition α 0 + β 1 cor- the full adder are achieved: P = A, Q = B, S = A B Ci, 2 | i | i ⊕ ⊕ responds to 0 with probability α and 1 with probability and Co = AB + ACi + BCi. Thus, if, for example, the four 2 | i | | | i β . qubits of the input are set to the state ABCiX = 1010 , | | Common simple qubit gates are represented in Fig. 20. then the system goes, with a probability| equal 1i, to| a statei Since operations on a qubit preserve the norm of the vectors, that provides an output P QSCo = 1001 . | i | i the gates are represented by 2 2 unitary matrices. Some Quantum gates and quantum computers have already been algebraic properties, such that H ×= (X +Z)/√2 and S = T 2, developed, for example, TGs based on quantum mechanics are useful to correlate some of these quantum gates. were successfully realized more than ten years ago at the 19

current security measures of cryptographic protocols.

IV. APPLICATIONSINEMERGINGAREAS Postquantum cryptography and machine learning, in partic- ular DL, have been selected as case studies of applications that use nonconventional arithmetic supported by the technologies Fig. 22: Quantum 1-bit FA: Diagram of a circuit using CNOT and systems presented in the previous sections. The criterion and TG reversible gates– behaviour simulated on the Quantum for selection was based on the importance of these applications Inspire (QI) platform [168] in emerging areas, particularly those that can benefit the most from nonconventional arithmetic.

A. Deep learning A type of DL dominant in a broad spectrum of applications is the CNN [120], a class of the general Deep Neural Network (DNN). The CNN encompasses two main different phases of operation: training and inference. The training is an offline process, usually performed in powerful servers, for learning features from (typically massive) databases. The inference ap- plies the trained model to predict and classify according to the features “learned” by the neural network. A CNN architecture consists of layers of convolution, activation, pooling, and fully connected neural networks [120]. The convolution kernels, represented by the weight w in the general equation (41), are applied to identify features by also adding a scalar bias b to (41). For the activation, nonlinear functions, such as the sig- moid or the Rectified Linear Unit (ReLU) in (42), are applied to obtain the values of the feature maps. The pooled layer maps the features into a lower resolution space, for example, with the function in (43). Fully connected neural networks compute over reduced spaces, applying weights to the inputs of artificial neurons that “fire” based on nonlinear functions. The large Fig. 23: The transmon qubit and the quantum chip [169]: a amount of computation and the high memory requirements type of superconducting charge qubit; the superconductors are are challenges faced in the development of efficient CNNs, capacitatively shunted in order to decrease the sensitivity to especially in devices with autonomy and resource constraints. charge noise It will be shown how RNS, LNS, associated with fixed-point and FP representations, and SC can be used to face these challenges. University of Innsbruck [115]. The IBM Q systems are phys- X ically supported by the transmon qubit represented in Fig. 23. conv → wiyi + b (41) i The transmon qubit circuit corresponds to Josephson junctions, 1 act → ht = tanh y; hs = ; hRLU = max(0, y) (42) kept at a very low temperature, shunted by an additional large 1 + e−y capacitance and matched by another comparably large gate qX pool → p = max(y); p = y2 (43) capacitance (Fig. 1 [116]). Although the transmon qubit is max L2 i closely related to the Cooper PairBox (CPB) qubit [117], it is operated at a significantly different ratio of Josephson energy Although several proposals have been made to apply RNS for the charging energy. It overcomes the main weakness of to compute a CNN, only a few can be fully computed the CPB by featuring an exponential gain in the insensitivity in the RNS domain [121]–[123]. These complete solutions to charge noise [116]. IBM Q processors are composed of mitigate the overhead imposed by the nonlinear functions of transmon qubits that are coupled and addressed through mi- the activation and pooling layers by carefully selecting the crowave resonators, as depicted in Fig. 23). Several quantum RNS moduli sets. In [121], the 4-tuple 2n 1, 2n+1 1 computers have been constructed based on that technology, the takes advantage of the fact that M in (9){ is odd.± Thus,± [41]} most powerful and recent one being a 53-qubit system [118]. can be applied to perform the comparison required to compute In the next section, applications of the nonconventional the max function in (42) and (43). The fully RNS-based CNN arithmetic, new technologies and systems to emerging applica- accelerator in [123], i.e., Res-DNN, is supported by the Eyeriss tions are discussed. While QC algorithms have been derived to architecture [124] and the RNS arithmetic for the 3-moduli solve linear systems of equations with low complexity [119], set 2n 1, 2n, 2n+1 1 , preventing overflow by using the in the next section, we refer the capacity of QC to attack the BE{ algorithm− referenced− in} II-B (extending 2n to the modulo 20

X x˜i+w ˜i X conv = 2 = BitShift(1, x˜i +w ˜i) (44) i i In (44), the accumulation is still performed in the linear domain, with BitShift(1, z) denoting the function that bit- Fig. 24: Computation of the dot product and ReLU in the LNS shifts 1 by an integer z in the fixed-point arithmetic. To reach domain the method in Fig. 24, the accumulation is also performed in

the log-domain by using the approximation log2(1 + x) x for 0 x < 1. By considering the accumulation of only≈ 2n+e). A dynamic range partitioning approach is adopted to two terms,≤ i.e., p˜ =x ˜ +w ˜ and p˜ =x ˜ +w ˜ , (44) can implement the comparison operation in the RNS domain [125], 1 1 1 2 2 2 be rewritten as (45). (45) can be easily generalized for any while for sign detection in the ReLU unit, the MRC method number of terms, i and p˜ , leading to the complete computation is adopted [126]. i in the logarithmic domain of the convolution layers and the Apart from the large amount of computing, CNNs also re- remaining layers of the CNN [12]. Moreover, an end-to-end quire high memory bandwidth due to the large volume of data training procedure based on the logarithmic representation is that must be moved between the memory and processing units. proposed in [12]. CNNs fully computed in the RNS domain are also proposed for PIM (RNSnet) [122]. RNSnet simplifies the fundamental p˜2 p˜1 p˜2 p˜1 2 CNN operations and maps them with the support of RNS to log2(conv) = log2(2 + 2 ) = log2[2 (1 + )] 2p˜1 in-memory addition and data access. The nonlinear activation 2p˜2 =p ˜1 + log(1 + ) ≈ p˜1 + BitShift(1, p˜2 − p˜1) (45) functions are approximated using a few of the first terms 2p˜1 of the Taylor expansion, and the comparison for pooling is implemented only for addition by taking advantage of the char- SC is another example of nonconventional arithmetic ap- acteristics of the traditional 3-moduli set 2n 1, 2n, 2n + 1 . { − } plied to design CNNs [130], [131], [133]–[137]. In [130], By following a different approach, the fact that weights and the polling layers are implemented with an improved ver- activations in a CNN naturally have nonuniform distributions sion of the SC max unit based on the FSMs presented in can be used to reduce the bit-width of a data representation and Section II-C [132]. The SC-based ReLU, SReLU, is also simplify the arithmetic [12], [127]. Inspired by the Kulisch ap- implemented through an FSM that mimics the three key proach [128], which removes the sources of rounding errors in operations of the artificial neuron: integration, output gen- a dot product through an exact multiplier and wide fixed-point eration, and decrement. To improve the signal-to-noise ratio, accumulator, products and accumulation can be performed weight normalization and upscaling are performed, while the in both the logarithmic and the linear domains [127]. This overhead of binary-to-stochastic conversion is reduced by approach works for the FP and fixed-point representations, sharing number generators. In [131], the linear rectifier used reducing the data bit-width through nonuniform quantization. for the activation function is realized through an FSM, and Assuming that translating the number m.f from the loga- pooling is achieved by simply aggregating streams to compute rithmic to the linear domains is easier than computing log2(1 either the average or the max function. Compared with the x ± 2 ) (6), Exact Log-linear Multiply Add (ELMA) [127] per- existing CNN deterministic architectures, the stochastic-based forms multiplications in the logarithmic domain and then architecture can achieve, at the cost of a slight degradation in approximates results to the linear domain floating-point repre- computing accuracy, significantly higher energy efficiency at sentation for accumulation. Translating m imposes a multipli- low cost. The SC-DCNN is another design and optimization m cation by 2 in the linear domain, a floating-point exponent framework of SC-based CNNs, which adopts a bottom-up addition or a fixed-point bit shift. If the number of bits of the approach [133]. Connecting different basic function blocks fractional part f is small, then LUTs can be used in practice with joint optimization and applying efficient weight stor- to map from the logarithmic to the linear domain and from age methods, the area and power (energy) consumption are the linear back to the logarithmic domain. reduced while maintaining high network accuracy. Integral The tapered floating point, for which the exponent and SC representations, supported on integer stochastic streams significand field sizes vary according to a third field, such as obtained by the summation of two or more binary stochas- the posit number system [129], has also been adopted to design tic streams, have also been proposed for designing efficient CNNs. The posit number system is characterized by (N, s), the stochastic-based DNN architectures [134]. Hybrid stochastic- word length in bits (N) and the exponent scale (s). The results binary neural networks, particularly for near-sensor comput- show that the accuracy of the multiply and add operations with ing, have been proposed [135]. Combining signal acquisition the proposed ELMA (N = 8, s = 1, α = 5, β = 5, γ = 7) is and SC in the first ANN layer, binary arithmetic is applied to approximately the same as that of the posit (8, 1) and leads to the remaining layers to avoid compounding accuracy losses. very efficient hardware implementations. It is also shown that retraining these remaining ANN layers Fig. 24 shows the samples and the weights for can compensate for precision loss introduced by SC. Improved convolution represented in the log-domain [12] SC-based multipliers and their vector extension have also been as x˜i = Quantize(log2(xi)) (3-bit width) and proposed for application to DNNs [136]. Moreover, the SC- w˜i = Quantize(log2(wi)) (4-bit width), respectively. based convolution can be improved in CNNs by producing 21 the result of the first multiplication and performing the sub- sequent multiplications with the differences of the successive weights [137]. An SC-based DL framework was implemented with AQFP ~ technology [79], a superconducting logic family that achieves b0 ~c high energy efficiency, as described in Section III-A. Through ~b1 = ~r1 an ultraefficient stochastic number generator and a high- ~c mod (B) P accuracy subsampling (pooling) block, both in AQFP, the ~r0 ~c mod (R) implemented CNN achieves a four-orders-of-magnitude higher (R) P energy efficiency than the CMOS-based implementation while P maintaining a 96% accuracy for the MNIST dataset. (B) P B. Cryptography Difficult mathematical problems are used to ensure the security of public-key cryptographic systems. These systems are designed so that the derivation of the private key from the public key is as difficult as computing the solution to Fig. 25: Two basis ~r0, ~r1 and ~b0,~b1 of the same lattice, along problems such as the factorization of large integers (e.g., with the corresponding parallelepipeds in red and grey, are RSA [138]) or the discrete logarithm (e.g., Elliptic Curve (EC) represented; ~c is reduced modulo the two parallelepipeds [147] cryptography [170]). The number of steps required to solve these problems is a function of the length of the input; thus, the dimension of keys is enlarged to make the systems secure A lattice can be seen as a vector space generated by all (e.g., private keys larger than 512 and 1024 bits). linear combinations with integer coefficients of a set = Modular reduction (modulo N) is a fundamental arithmetic m R ~r0, . . . , ~rn−1 , with ~ri R , of linearly independent vectors, operation in public cryptography for mapping results back to as{ defined in (46);} the rank∈ of the lattice is n, and its dimension the set of representatives 0, ,N 1 . The Montgomery { ··· − } is m. Two vectors in span(R) are congruent if their difference algorithm for modular multiplication replaces the costly reduc- is in (R). tion for a general value N with a reduction for L, typically L (n−1 ) a power of 2 [139]. However, since in RNS, the reduction X ( ) = zi~ri : zi Z (46) modulo power of 2 is inefficient, an RNS variation of the L R ∈ Montgomery modular multiplication algorithm relies on the i=0 BE method presented in II-B by choosing L = M1 and Each basis is associated with the parallelepiped: GCD(M1,M2) = 1 [140], [141]. Similarly, the reduction (n−1 ) modulo L corresponds to the reduction modulo of the dynamic X  1 1 (R) = wi~ri : wi , (47) range M1, which automatically relies on the modulo operation P ∈ −2 2 i=0 in each one of the RNS channels of S1. Moreover, N M1 and M exist, which are required for the precomputedh i For any point ~y = ~w + ~x span(R), where ~w (R) 1 M2 ∈ ∈ L constantsh i used in the modified Montgomery modular multipli- and ~x (R), the reduction in the ~y modulo (R) is ∈ P P cation algorithm [142]. defined as ~x = ~y mod (R). The modular reduction has P A survey on the usage of RNS to support these classic a different meaning for each basis, since they are associated public-key algorithms can be found in [142]. Multimod- with different parallelepipeds. An example of this is featured uli RNS architectures have also been proposed to obfuscate in Fig. 25, where the point ~c is reduced modulo both (R) and P secure information. The random selection of moduli is pro- (B), producing two different points, which are represented P posed for each key bit operation, which makes it difficult to as triangles, while (R) = (B). L L obtain the secret key, preventing power analysis while still Lattice Based Cryptography (LBC) is supported by the providing all the benefits of RNS [143]. Closest Vector Problem (CVP). This problem consists of a However, if quantum computers become available, this type given base R Rn×m, and ~y Rm, finding ~x (R) such ∈ ∈ ∈ L of public-key cryptography, based on the factorization of that ~y ~x = min~z∈L(B) ~y ~z . The private basis is pro- large integers or the discrete logarithm, will no longer be duced|| as− a|| rotated nearly orthogonal|| − || basis, such that Babai’s secure [144]. Therefore, there is a need to develop arith- round-off [146] provides accurate solutions to the CVP. metic techniques for alternate systems, usually designated Rose’s cryptosystem uses bases of an Optimal Hermite Normal as postquantum cryptography [145]. Quite recently, another Form (OHNF) as the public key, a subclass of Hermite Normal survey was published on the application of RNS and SC Forms (HNFs), where all but the first column are trivial. The to this type of emergent cryptography [147]. We present decryption algorithm is modified for its implementation with two representative examples of the application of RNS and the RNS. The operation ~cR−1 ( denotes rounding to SC to the Goldreich-Goldwasser-Halevi (GGH) postquantum the nearest integer) can be replacedb·e by the approximation ~µ cryptography and Fully Homomorphic Encryption (FHE), re- of γ~cR−1 through an RNS Montgomery reduction [139], spectively. whereb ~c =c (c, 0,..., 0) and the scaling by γ enables the 22 detection and correction of the errors resulting from the ap- stochastic representations are more widely applicable, while a proximate RNS Montgomery reduction. γ~cR−1 is rewritten fixed-point representation provides better performance [151]. using integer arithmetic as in (48), whereb Rˆ =cR−1d is an A quantum computer efficiently factors an arbitrary inte- integer and d = det(R). ger [153], which makes the widely used RSA public-key D E cryptosystem insecure. It is based on reducing this compu- γ~cRˆ γ~cRˆ tation to a special case of a mathematical problem known γ~cR−1 = − d (48) as the Hidden Subgroup Problem (HSP) [154], [155] (HSP b c d is a group theoretic generalization of the problem of period- It is shown that the usage of RNS enables parallelizing the icity determination). The Abelian quantum hidden subgroup decryption in Rose’s cryptosystems to significantly speed up algorithms in [153], [156] are based on the Quantum Fourier its computation in both CPUs and GPUs [147]. Transform (QFT). The QFT, the counterpart to the Discrete Homomorphic encryption allows performing computations Fourier Transform (DFT) in the classical computing model, is directly on ciphertexts, generating encrypted results as if the the basis of many quantum algorithms [152], [157], [158]. operations had been performed on the plaintext and then encrypted. FHE provides malleable ciphertexts, such that V. CONCLUSIONSANDFUTURERESEARCH given two ciphertexts representing the operands, it will be This survey covers the main types of nonconventional arith- possible to produce a ciphertext encrypting its product or metic from a holistic perspective, highlighting their practical sum. Structured lattices underpin classes of cryptosystems interest. Several different classes of nonconventional arith- supporting FHE [148]. There is noise associated with the metic are reviewed, such as LNS, RNS, SC and HDC, and ciphertexts that grows as homomorphic operations are applied; their usage in various emergent applications, with different thus, bootstrapping has been proposed, a technique in which features and based on new technologies, is discussed. We ciphertexts are homomorphically decrypted [148]. Modern show the importance of nonconventional number represen- FHE systems rely on Ring Learning with Errors (RLWE), for tation and arithmetic not only to implement fast, reliable, which techniques have been proposed that limit the need for and efficient systems but also to shape the usage of the bootstrapping [147]. technology. For example, SC is useful for FHE, enabling the Batching [149] improves the performance of FHE based processing of multiple bits in parallel for batching, but is also on the CRT by allowing multiple bits to be encrypted in a important to perform approximate computing on nanoscale single ciphertext so that one can carry out AND and XOR memristor crossbars and to overcome the deep pipelining sequences of bits using a single homomorphic multiplication nature of AQFP logic devices. Similarly, RNS is suitable or addition. For example, in an RLWE cryptosystem, binary for machine learning and cryptographic applications, but it polynomials are homomorphically processed in a cyclotomic is also quite useful to mitigate the instability of biochemical ring. By noticing that certain cyclotomic polynomials factor reactions for DNA-based computing and to reduce the cost of modulo two onto a set of polynomials with the same degree, arithmetic systems for integrated nanophotonics technology. one may take advantage of the CRT to associate a plaintext bit Hyperdimensional representation and arithmetic, inspired by with each one of these polynomials. Homomorphic additions brain-like computing, adopt high-dimensional binary vectors and multiplications then add and multiply the bits modulo their to introduce randomness and achieve reliability in HDC. This respective polynomial, achieving coefficientwise XOR and type of nonconventional number representation offers a strong AND operations. Rotations of these bits may be accomplished basis for in-memory computation, being a promising candidate with [150]. for taking advantage of the physical attributes of nanoscale The operations that arise from FHE are evaluated, and ef- memristive devices to perform online training, classification ficient algorithm-hiding systems are designed for applications and prediction. that take advantage of those operations in [151]. First, numbers One of the main conclusions that can be drawn from this can be represented as sequences of bits through stochastic survey is that the investigation of nonconventional arithmetic representations. Since most FHE schemes support batching as must take into account all the dimensions of the systems, an acceleration technique for the processing of multiple bits which includes not only computer arithmetic theory but also in parallel, one can map stochastic representations to batch- technology advances and the demands of emergent applica- ing slots to efficiently implement multiplications and scaled tions. For example, the results of further investigations of additions (19). Unlike traditional Boolean circuits, the amount the randomness characteristics of SC and HDC may also of bits required for stochastic computing is flexible, allowing impact applications that might benefit from approximate com- for a flexible correspondence between the unencrypted and putations [66]. At the technological level, these classes of encrypted domains. arithmetic may benefit from the ease of nanotechnologies in The homomorphic systems based on SC in [151] target generating the required random numbers [79] while improving the approximation of continuous functions. These functions the robustness to noise [171]. The efficiency offered by SC are first approximated with Bernstein polynomials in a black- and HDC results in important savings both in terms of the box manner. Hence, function development can take place with integration area and hardware cost, which make these number traditional programming paradigms, providing an automated representations suitable for the development of autonomous way to produce FHE circuits. A stochastic sequence of bits cyberphysical [161] and Internet of Things (IoT) proces- is encrypted in a single cryptogram. It was concluded that sors [73]. 23

Further research on LNS and RNS should target difficult [5] S. H. Park, Y. Liu, N. Kharche, M. Salmani Jelodar, G. Klimeck, M. S. operations in the respective domains, namely, addition and Lundstrom, and M. Luisier, “Performance comparisons of III-V and Strained-Si in planar FETs and nonplanar at ultrashort gate subtraction in the logarithmic domain and division and com- length (12 nm),” IEEE Transactions on Electron Devices, vol. 59, no. 8, parison in the RNS domain. The accuracy and the imple- pp. 2107–2114, 2012. mentation cost under different technologies are key aspects [6] TSMC, “7nm technology,” https://www.tsmc.com/english/ dedicated- Foundry/technology/7nm.htm, 2019. for the practical usage of this nonconventional arithmetic. In [7] I. SEMATECH, “International technology roadmap for semiconductor: that respect, the development of tools such as the Computing Emerging research devices,” Tech. Rep., 2003. with the Residue Number System (CRNS), which automates [8] M. LaPedus, “Big trouble at 3nm,” https://semiengineering.com/big- trouble-at-3nm/, 2018. the design of systems based on RNS for exploring the in- [9] S. I. Association, “International technology roadmap for semiconductor creasing parallelism available in computing devices, is also 2.0: Beyond CMOS,” Tech. Rep., 2015. important [162]. The redundancy of the RRNS and the ran- [10] N. G. Kingsbury and P. J. Rayner, “Digital filtering using logarithmic arithmetic,” Electronics Letters, vol. 7, p. 5658, 1971. domness of SC and HDC make them suitable for improving [11] E. Swartzlander and A. Alexopoulos, “The sign/logarithm number reliability [113] and supporting the heterogeneous integration system,” IEEE Transactions on Computers, vol. C-24, no. 12, p. of multiple emerging nanotechnologies [106]. Moreover, it 12381242, 1975. [12] D. Miyashita, E. H. Lee, and B. Murmann, “Convolutional neu- will be very useful to investigate efficient converters between ral networks using logarithmic data representation,” CoRR, vol. nonconventional systems that allow them to be used together abs/1603.01025, 2016. in different components of heterogeneous systems and for [13] F. J. Taylor, R. Gill, J. Joseph, and J. Radke, “A 20 bit logarithmic number system processor,” IEEE Transactions on Computers, vol. 37, different applications. no. 2, pp. 190–200, Feb 1988. New paradigms of computation will also become the focus [14] F. de Dinechin and A. Tisserand, “Multipartite table methods,” IEEE of research on nonconventional computer arithmetic. The Transactions on Computers, vol. 54, no. 3, pp. 319–330, March 2005. [15] J. Coleman and E. Chester, “A 32-bit logarithmic arithmetic unit and less disruptive, more straightforward path of research is the its performance compared to floating-point,” in IEEE Symposium on combination of nonconventional arithmetic with approximate Computer Arithmetic, 1999, pp. 142–151. computing. While for SC, this combination is natural, for the [16] D. Lewis, “Complex logarithmic number system arithmetic using high- radix redundant CORDIC algorithms,” in 14th IEEE Symposium on other nonpositional number representations, a research effort Computer Arithmetic, 1999, pp. 194–203. focused on this purpose is required. For the other types of [17] Chichyang Chen, Rui-Lin Chen, and Chih-Huan Yang, “Pipelined new paradigms of computation, such as QC, the quantum computation of very large word-length LNS addition/subtraction with polynomial hardware cost,” IEEE Transactions on Computers, vol. 49, computer platforms publicly available, i.e., IBM Q [169] and no. 7, pp. 716–726, July 2000. QI from QuTech [168], provide both simulators and real [18] Chichyang Chen and Rui-Lin Chen, “Performance-improved computa- quantum computers at the back end. They are quite useful for tion of very large word-length LNS addition/subtraction using signed- digit arithmetic,” in IEEE International Conference on Application- performing an investigation of QC and introduce this topic Specific Systems, Architectures, and Processors (ASAP), June 2003, with a practical component in university curricula. They are pp. 337–347. valuable instruments for simulating new quantum algorithms [19] J. Coleman, E. Chester, C. Softley, and J. Kadlec, “Arithmetic on the european logarithmic microprocessor,” IEEE Transactions on Comput- and circuits for several applications, extending their usage far ers, vol. 49, no. 7, pp. 702–715, july 2000. beyond the domain of quantum mechanics. [20] B. Lee and N. Burgess, “A parallel look-up logarithmic number system addition/subtraction scheme for FPGA,” in IEEE International Conference on Field-Programmable Technology (FPT), Dec 2003, pp. ACKNOWLEDGMENTS 76–83. [21] H. Fu, O. Mencer, and W. Luk, “FPGA designs with optimized This work was partially supported by the European Union logarithmic arithmetic,” IEEE Transactions on Computers, vol. 59, Horizon 2020 research and innovation program under grant no. 7, pp. 1000–1006, July 2010. [22] J. N. Coleman and R. Che Ismail, “LNS with co-transformation com- agreement No. 779391 (FutureTPM) and by national funds petes with floating-point,” IEEE Transactions on Computers, vol. 65, through Fundac¸ao˜ para a Cienciaˆ e a Tecnologia (FCT) un- no. 1, pp. 136–146, Jan 2016. der project number UIDB/50021/2020. I would also like to [23] Y. Popoff, F. Scheidegger, M. Schaffner, M. Gautschi, F. K. Grkaynak, and L. Benini, “High-efficiency logarithmic number unit design based acknowledge the contributions of all MSc and PhD students on an improved cotransformation scheme,” in Design, Automation Test that I have supervised over the years. They have been a source in Europe Conference Exhibition (DATE), March 2016, pp. 1387–1392. of inspiration and challenges in the adventure that research on [24] D. Yu and D. M. Lewis, “A 30-b integrated logarithmic number system processor,” IEEE Journal Solid-State Circuits, vol. 26, pp. 1433–1440, computer arithmetic has been for me. 1991. [25] M. Arnold, “A VLIW architecture for logarithmic arithmetic,” in Euromicro Symposium on Digital System Design, 2003, pp. 294–302. REFERENCES [26] J. Coleman, C. Softley, J. Kadlec, R. Matousek, M. Tichy, Z. Pohl, A. Hermanek, and N. Benschop, “The european logarithmic micropro- [1] D. Bishop, “Nanotechnology and the end of moore’s law?” Bell Labs cesor,” IEEE Transactions on Computers, vol. 57, no. 4, pp. 532–546, Technical Journal, vol. 10, no. 3, pp. 23–28, 2005. 2008. [2] SocialMediaToday, “How much data is generated every minute?” [27] R. Chassaing, DSP Applications Using C and the TMS320C6x DSK. 2018. [Online]. Available: https://www.socialmediatoday.com/news/ John Wiley & Sons, Inc., 2002. how-much-data-is-generated-every-minute-infographic-1/525692/ [28] M. G. Arnold and C. Walter, “Unrestricted faithful rounding is good [3] P. Martins, J. Eynard, J. Bajard, and L. Sousa, “Arithmetical improve- enough for some LNS applications,” in 15th IEEE Symposium on ment of the round-off for cryptosystems in high-dimensional lattices,” Computer Arithmetic ARITH, 2001, pp. 237–246. IEEE Transactions on Computers, vol. 66, no. 12, pp. 2005–2018, [29] M. Gautschi, A. Traber, A. Pullini, L. Benini, M. Scandale, A. Di Fed- 2017. erico, M. Beretta, and G. Agosta, “Tailoring instruction-set extensions [4] T. Tay and C. Chang, “A non-iterative multiple residue digit error for an ultra-low power tightly-coupled cluster of OpenRISC cores,” in detection and correction algorithm in RRNS,” IEEE Transactions on IFIP/IEEE International Conference on Very Large Scale Integration Computers, vol. 65, no. 2, pp. 396–408, 2016. (VLSI-SoC), 2015, pp. 25–30. 24

[30] R. C. Ismail, M. K. Zakaria, and S. A. Z. Murad, “Hybrid logarithmic International Symposium on High Performance Computer Architecture number system arithmetic unit: A review,” in 2013 IEEE International (HPCA), 2018, pp. 696–709. Conference on Circuits and Systems (ICCAS), Sep. 2013, pp. 55–58. [55] B. R. Gaines, “Stochastic computing,” in AFIPS Spring Joint Computer [31] A. Omondi and B. Premkumar, Residue Number Systems: Theory and Conference, vol. 30, 1967, pp. 149–156. Implementations. Imperial College Press, 2007. [56] A. Alaghi and J. P. Hayes, “Survey of stochastic computing,” ACM [32] H. L. Garner, “The residue number system,” IEEE Circuits and Systems Transactions on Embedded Computing Systems (TECS), vol. 12, no. 2s, Magazine, vol. 8, no. 2, pp. 140–147, 1959. pp. 1–19, May 2013. [33] S. Lang, Algebra, 3rd ed., ser. Graduate Texts in Mathematics. [57] W. Gross and V. Gaudet, Eds., Stochastic Computing: Techniques and Springer, 2002, vol. 211. Applications. Springer, 2019. [34] A. Molahosseini and L. Sousa, Embedded systems design with special [58] R. Duarte, M. Vestias, and H. Neto, “Enhancing stochastic computa- arithmetic and number systems. Springer, 2017, ch. Introduction to tions via process variation,” in 25th International Conference on Field Residue Number System: Structure and Teaching Methodology, pp. Programmable Logic and Applications (FPL), 2015. 3–17. [59] T. Chen and J. P. Hayes, “Analyzing and controlling accuracy in [35] C. Chang, A. Molahosseini, A. Zarandi, and T. Tay, “Residue number stochastic circuits,” in IEEE 32nd International Conference on Com- systems: A new paradigm to datapath optimization for low-power and puter Design (ICCD), 2014, pp. 367–373. high-performance digital signal processing applications,” IEEE Circuits [60] W. Qian and M. D. Riedel, “The synthesis of robust polynomial arith- and Systems Magazine, vol. 15, no. 4, pp. 26–44, 2015. metic with stochastic logic,” in 45th ACM/IEEE Design Automation [36] A. Molahosseini, L. Sousa, and C.-H. Chang, Eds., Embedded Systems Conference, 2008, pp. 648–653. Design with Special Arithmetic and Number Systems. Springer, 2017. [61] M. Lunglmayr, D. Wiesinger, and W. Haselmayr, “Design and analysis [37] F. J. Taylor, “Residue arithmetic: a tutorial with examples,” IEEE of efficient maximum/minimum circuits for stochastic computing,” Transactions on Computers, vol. 17, no. 5, pp. 50–62, 1984. IEEE Transactions on Computers, vol. 69, no. 3, pp. 402–409, 2020. [38] P. V. Mohan, Arithmetic Circuits for DSP Applications. John Wiley [62] B. Brown and H. Card, “Stochastic neural computation. i. computa- & Sons, Ltd, 2017. tional elements,” IEEE Transactions on Computers, vol. 50, no. 9, pp. [39] A. Hiasat and H. Abdel-Aty-Zohdy, “Design and implementation of 891–905, 2001. an RNS division algorithm,” in Proceedings 13th IEEE Symposium on [63] P. Li, D. J. Lilja, W. Qian, M. D. Riedel, and K. Bazargan, “Logical Computer Arithmetic, 1997, pp. 240–249. computation on stochastic bit streams with linear finite-state machines,” [40] S. Bi and W. Gross, “The mixed-radix chinese remainder theorem IEEE Transactions on Computers, vol. 63, no. 6, pp. 1474–1486, 2014. and its applications to residue comparison,” IEEE Transactions on [64] V. T. Lee, A. Alaghi, R. Pamula, V. S. Sathe, L. Ceze, and M. Oskin, Computers, vol. 57, no. 12, 2008. “Architecture considerations for stochastic computing accelerators,” [41] L. Sousa, “Efficient method for magnitude comparison in RNS based IEEE Transactions on Computer-Aided Design of Integrated Circuits on two pairs of conjugate moduli,” in IEEE Symposium on Computer and Systems, vol. 37, no. 11, pp. 2277–2289, Nov 2018. Arithmetic, 2007, pp. 240–250. [65] W. Qian, X. Li, M. D. Riedel, K. Bazargan, and D. J. Lilja, “An [42] S. Kumar and C. Chang, “A new fast and area-efficient adder-based architecture for fault-tolerant computation with stochastic logic,” IEEE sign detector for RNS 2n −1, 2n, 2n +1,” IEEE Transactions on Very Transactions on Computers, vol. 60, no. 1, pp. 93–105, 2011. Large Scale Integration Systems, vol. 24, no. 7, pp. 2608–2612, 2016. [66] S. Gupta and K. Gopalakrishnan, “Position paper,revisiting stochastic [43] L. Sousa, “2n RNS scalers for extended 4-moduli sets,” IEEE Trans- computation : Approximate estimation of machine learning kernels,” actions on Computers, vol. 64, no. 12, pp. 3322–3334, 2015. in Position paper, Workshop on Approximate Computing Across the [44] J.Bajard and L. Imbert, “A full RNS implementation of RSA,” IEEE System Stack (WACAS), 2014. Transactions on Computers, vol. 53, no. 6, pp. 769–774, June 2004. [67] A. Alaghi, W. Qian, and J. P. Hayes, “The promise and challenge of [45] A. Zarandi, A. Molahosseini, L. Sousa, and M. Hosseinzadeh, “An stochastic computing,” IEEE Transactions on Computer-Aided Design efficient component for designing signed reverse converters for a of Integrated Circuits and Systems, vol. 37, no. 8, pp. 1515–1531, 2018. class of RNS moduli sets of composite form {2k, 2P − 1},” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 25, [68] V. K. Chippa, S. Venkataramani, K. Roy, and A. Raghunathan, no. 1, pp. 48–59, 2017. “StoRM: A stochastic recognition and mining processor,” in IEEE/ACM International Symposium on Low Power Electronics and Design [46] F. Jafarzadehpour, A. Molahosseini, A. Zarandi, and L. Sousa, “Ef- (ISLPED), 2014, pp. 39–44. ficient modular adder designs based on thermometer and one-hot coding,” IEEE Transactions on Very Large Scale Integration (VLSI) [69] T. H. M. 2010-2020, “The human memory: Brain neurons & synapses,” Systems, vol. 27, no. 9, pp. 2142–2155, 2019. https://human-memory.net/brain-neurons-synapses/, 2020. [47] C. H. Vun and B. Premkumar, “Thermometer code based modular [70] P. Kanerva, “Hyperdimensional computing: An introduction to com- arithmetic,” in Spring Congress on Engineering and Technology, 2012. puting in distributed representation with high-dimensional random [48] C. Vun, A. Premkumar, and W. Zhang, “Sum of products: Computation vectors,” Cognitive Computation, vol. 1, p. 139159, 2009. using modular thermometer codes,” in International Symposium on [71] O. J. Rasanen and J. P. Saarinen, “Sequence prediction with sparse Intelligent Signal Processing and Communication Systems (ISPACS), distributed hyperdimensional coding applied to the analysis of mobile 2014, pp. 141–146. phone use patterns,” IEEE Transactions on Neural Networks and [49] P. Patronik and S. J. Piestrak, “Hardware/software approach to design- Learning Systems, vol. 27, no. 9, pp. 1878–1889, Sep. 2016. ing low-power RNS-enhanced arithmetic units,” IEEE Transactions on [72] A. Rahimi, P. Kanerva, L. Benini, and J. M. Rabaey, “Efficient biosig- Circuits and Systems I: Regular Papers, vol. 64, no. 5, pp. 1031–1039, nal processing using hyperdimensional computing: Network templates 2017. for combined learning and classification of exg signals,” Proceedings [50] A. Ghosh, S. Singha, and A. Sinha, “Floating point RNS: A new of the IEEE, vol. 107, no. 1, pp. 123–143, 2019. concept for designing the MAC unit of digital signal processor,” [73] S. Datta, R. A. Antonio, A. R. Ison, and J. M. Rabaey, “A pro- SIGARCH Computer Architecture News, vol. 40, no. 2, pp. 39–43, grammable hyper-dimensional processor architecture for human-centric May 2012. IoT,” IEEE Journal on Emerging and Selected Topics in Circuits and [51] R. Dhanabal, V. Barathi, S. K. Sahoo, N. R. Samhitha, N. A. Cherian, Systems, vol. 9, no. 3, pp. 439–452, Sep. 2019. and P. M. Jacob, “Implementation of floating point MAC using residue [74] S. Mittal, “A survey of techniques for approximate computing,” ACM number system,” in International Conference on Reliability Optimiza- Computing Surveys, vol. 48, no. 4, Mar. 2016. tion and Information Technology (ICROIT), 2014, pp. 461–465. [75] H. Jiang, F. J. H. Santiago, H. Mo, L. Liu, and J. Han, “Approximate [52] B. Parhami, “RNS representations with redundant residues,” in Confer- arithmetic circuits: A survey, characterization, and recent applications,” ence Record of Thirty-Fifth Asilomar Conference on Signals, Systems Proceedings of the IEEE, 2020. and Computers, vol. 2, Nov 2001, pp. 1651–1655 vol.2. [76] F. S. Snigdha, D. Sengupta, J. Hu, and S. S. Sapatnekar, “Optimal [53] B. Deng, S. Srikanth, E. Hein, T. Conte, E. Debenedictis, J. Cook, design of jpeg hardware under the approximate computing paradigm,” and M. Frank, “Extending moores law via computationally error- in 53nd ACM/EDAC/IEEE Design Automation Conference (DAC), tolerant computing,” ACM Transactions on Architecture and Code 2016. Optimization, vol. 15, no. 1, Mar. 2018. [77] Y. Wang, H. Li, and X. Li, “Real-time meets approximate computing: [54] S. Srikanth, P. G. Rabbat, E. Hein, B. Deng, T. Conte, E. DeBenedictis, An elastic cnn inference accelerator with adaptive trade-off between J. Cook, and M. Frank, “Memory system design for ultra low power, QoS and QoR,” in ACM/EDAC/IEEE Design Automation Conference computationally error resilient processor microarchitectures,” in IEEE (DAC), 2017. 25

[78] S. Smithson, N. Onizawa, B. Meyer, W. Gross, and T. Hanyu, “Efficient [102] S. Kvatinsky, “Real processing-in-memory with memristive memory CMOS invertible logic using stochastic computing,” IEEE Transactions processing unit (mMPU),” in IEEE 30th International Conference on on Circuits and Systems I: Regular Papers, vol. 66, no. 6, pp. 2263– Application-specific Systems, Architectures and Processors (ASAP), 2274, 2019. 2019, pp. 142–148. [79] R. Cai, A. Ren, O. Chen, N. Liu, C. Ding, X. Qian, J. Han, W. Luo, [103] P. Yao, H. Wu, B. Gao, J. Tang, and J. Y. H. Q. Q. Zhang, W. Zhang, N. Yoshikawa, and Y. Wang, “A stochastic-computing based deep learn- “Fully hardware-implemented memristor convolutional neural net- ing framework using adiabatic quantum-flux-parametron superconduct- work,” Nature, vol. 577, p. 641646, 2020. ing technology,” in ACM/IEEE 46th Annual International Symposium [104] J. Cook, “PIMS: Memristor-based processing-in-memory-and-storage,” on Computer Architecture (ISCA), 2019, pp. 567–578. Sandia National Laboratories,, Albuquerque, New Mexico, Tech. Rep. [80] A. Chabi, A. Roohi, H. Khademolhosseini, S. Sheikhfaal, S. Angizi, SAND2018-2095, Feb. 2018. K. Navi, and R. DeMara, “Towards ultra-efficient QCA reversible [105] S. Raj, D. Chakraborty, and S. K. Jha, “In-memory flow-based circuits,” Microprocessors and Microsystems, vol. 49, pp. 127 – 138, stochastic computing on memristor crossbars using bit-vector stochastic 2017. streams,” in IEEE 17th International Conference on Nanotechnology [81] A. Molahosseini, A. Asadpoor, A. Zarandi, and L. Sousa, “Towards (NANO), 2017, pp. 855–860. efficient modular adders based on reversible circuits,” in IEEE Inter- [106] T. Wu, H. Li, P. Huang, A. Rahimi, J. Rabaey, H. Wong, M. Shulaker, national Symposium on Circuits and Systems (ISCAS), 2018. and S. Mitra, “Brain-inspired computing exploiting carbon nanotube [82] C. H. Bennett, “Logical reversibility of computation,” IBM Journal of FETs and resistive RAM: Hyperdimensional computing case study,” in Research and Development, vol. 17, no. 6, pp. 525–532, Nov 1973. IEEE International Solid-State Circuits Conference (ISSCC), 2018, pp. [83] T. S. G. Hinton and D. Ackley, “Boltzmann machines: Constraint 492–494. satisfaction networks that learn,” Dept. Computer Science, Carnegie- [107] C. Ma, Y. Sun, W. Qian, Z. Meng, R. Yang, and L. Jiang, “Go unary: A Mellon University, Pittsburgh, PA, USA, Tech. Rep. CMUCS-84-119, novel synapse coding and mapping scheme for reliable ReRAM-based Conference on Design, Automation & 1984. neuromorphic computing,” in Test in Europe (DATE), 2020, pp. 1432–1437. [84] S. Sheikhfaal, S. Angizi, S. Sarmadi, M. Moaiyeri, and S. Sayedsalehi, [108] F. Guarnieri, M. Fliss, and C. Bancroft, “Making DNA add,” Science, “Designing efficient QCA logical circuits with power dissipation anal- vol. 273, no. 5272, pp. 220–223, 1996. ysis,” Microelectronics Journal, vol. 46, no. 6, pp. 462 – 471, 2015. [109] A. Fujiwara, K. Matsumoto, and W. Chen, “Procedures for logic and [85] S. Oskouei and A. Ghaffari, “Designing a new reversible ALU by QCA arithmetic operations with DNA molecules,” International Journal of for reducing occupation area,” The Journal of Supercomputing, vol. 75, Foundations of Computer Science, vol. 15, no. 03, pp. 461–474, 2004. no. 8, p. 51185144, 2019. [110] Z. Ignatova, I. Martnez-Prez, and K.-H. Zimmermann, DNA Comput- [86] M. Sarkar, P. Ghosal, and S. Mohanty, “Minimal reversible circuit ing. Springer, 2008. synthesis on a dna computer,” Natural Computing, vol. 16, pp. 463– [111] M. Sarkar, P. Ghosal, and S. P. Mohanty, “Exploring the feasibility of a 472, 2017. DNA computer: Design of an ALU using sticker-based DNA model,” [87] B. Tsukerblat, A. Palii, J. Clemente-Juan, N. Suaud, and E. Coronado, IEEE Transactions on NanoBioscience, vol. 16, no. 6, pp. 383–399, “Quantum cellular automata: a short overview of molecular problem,” 2017. Acta Physica Polonica A, 2018. [112] H. Su, J. Xu, Q. Wang, F. Wang, and X. Zhou, “High-efficiency and in- [88] C. Bennettand and D. DiVincenzo, “Quantum information and compu- tegrable dna arithmetic and logic system based on strand displacement tation,” Nature, vol. 404, pp. 247–255, March 2000. synthesis,” Nature Communications, vol. 10, no. 5390, 2019. [89] N. Kazemifard and K. Navi, “Implementing rns arithmetic unit through [113] X. Zheng, B. Wang, C. Zhou, X. Wei, and Q. Zhang, “Parallel DNA single electron quantum-dot cellular automata,” International Journal arithmetic operation with one error detection based on 3-moduli set,” of Computer Applications, vol. 163, pp. 20–27, 2017. IEEE Transactions on NanoBioscience, vol. 15, no. 5, pp. 499–507, [90] O. Chen, R. Cai, Y. Wang, F. Ke, T. Yamae, R. Saito, N. Takeuchi, and 2016. N. Yoshikawa, “Adiabatic quantum-flux-parametron: Towards building [114] W. Chang, “Fast parallel DNA-based algorithms for molecular com- extremely energy-efficient circuits and systems,” Scientific Reports, putation: Quadratic congruence and factoring integers,” IEEE Transac- vol. 9, no. 10514, 2019. tions on NanoBioscience, vol. 11, no. 1, pp. 62–69, 2012. [91] N. Takeuchi, D. Ozawa, Y. Yamanashi, and N. Yoshikawa, “An adi- [115] T. Monz, K. Kim, W. Hansel, M. Riebe, A. Villar, P. Schindler, abatic quantum flux parametron as an ultra-low-power logic device,” M. Chwalla, M. Hennrich, and R. Blatt, “Realization of the quantum Superconductor Science and Technology, vol. 26, no. 3, p. 035010, jan toffoli gate with trapped ions,” Physical Review Letters, vol. 102, p. 2013. 040501, Jan 2009. [92] N. Takeuchi, T. Yamae, C. Ayala, H. Suzuki, and N. Yoshikawa, “An [116] J. Koch, T. Yu, J. Gambetta, A. Houck, D. Schuster, J. Majer, A. Blais, adiabatic superconductor 8-bit adder with 24kbt energy dissipation per M. Devoret, S. Girvin, and R. Schoelkopf, “Charge-insensitive qubit junction,” Applied Physics Letters, vol. 114, no. 4, p. 042602, 2019. design derived from the cooper pair box,” Physical Review A, vol. 76, [93] J. Peng, S. Sun, V. K. Narayana, V. J. Sorger, and T. El-Ghazawi, no. 4, 2007. “Residue number system arithmetic based on integrated nanophoton- [117] V. Bouchiat, D. Vion, P. Joyez, D. Esteve, and M. H. Devoret, ics,” Optics Letters, vol. 43, no. 9, pp. 2026–2029, May 2018. “Quantum coherence with a single cooper pair,” Physica Scripta, vol. [94] J. Peng, S. Sun, V. Narayana, V. Sorger, and T. El-Ghazawi, “Integrated T76, no. 1, p. 165, 1998. photonics architectures for residue number system computations,” in [118] IBM, “IBM will soon launch a 53-qubit quantum computer,” 2019. IEEE International Conference on Rebooting Computing (ICRC), 2019. [Online]. Available: https://techcrunch.com/2019/09/18/ibm-will-soon- launch-a-53-qubit-quantum-computer/ [95] S. Sun, V. K. Narayana, I. Sarpkaya, J. Crandall, R. A. Soref, H. Dalir, [119] A. Harrow, A. Hassidim, and S. Lloyd, “Quantum algorithm for linear T. El-Ghazawi, and V. J. Sorger, “Hybrid photonic-plasmonic non- systems of equations,” Physical Review Letters, vol. 103, no. 15, Oct blocking broadband 5×5 router for optical networks,” IEEE Photonics 2009. Journal, vol. 10, no. 2, pp. 1–12, 2018. [120] R. C. Gonzalez, “Deep convolutional neural networks [lecture notes],” [96] L. Bakhtiar, E. Yaghoubi, S. Hamidi, and M. Hosseinzadeh, “Optical IEEE Signal Processing Magazine, vol. 35, no. 6, pp. 79–87, Nov 2018. International Journal of Computer Appli- RNS adder and multiplier,” [121] M. Abdelhamid and S. Koppula, “Applying the residue number system cations in Technology , vol. 52, no. 1, pp. 71–76, 2015. to network inference,” arXiv, eprint 1712.04614, 2017. [97] Y. Xie and J. Zhao, “Emerging memory technologies,” IEEE Micro, [122] S. Salamat, M. Imani, S. Gupta, and T. Rosing, “RNSnet: In-memory vol. 39, no. 1, pp. 6–7, 2019. neural network acceleration using residue number system,” in IEEE [98] N. Haron and S. Hamdioui, “Redundant residue number system code International Conference on Rebooting Computing (ICRC), Nov 2018, for fault-tolerant hybrid memories,” ACM Journal on Emerging Tech- pp. 1–12. nologies in Computing Systems, vol. 7, no. 1, Jan. 2011. [123] N. Samimi, M. Kamal, A. Afzalli-Kusha, and M. Pedram, “Res- [99] S. Thirumala and S. Gupta, “Reconfigurable ferroelectric transistorpart DNN: A residue number system-based dnn accelerator unit,” IEEE I: Device design and operation,” IEEE Transactions on Electron Transactions on Circuits and Systems I: Regular Papers, vol. 67, no. 2, Devices, vol. 66, no. 6, pp. 2771–2779, 2019. pp. 658–671, 2020. [100] I. C. . D. Lab, “Ferroelectric transistor technologies,” [124] Y. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An energy- https://engineering.purdue.edu/ICDL/research/ferroelectrics, 2019. efficient reconfigurable accelerator for deep convolutional neural net- [101] L. Chua, “Memristor-the missing circuit element,” IEEE Transactions works,” IEEE Journal of Solid-State Circuits, vol. 52, no. 1, pp. 127– on Circuit Theory, vol. 18, no. 5, pp. 507–519, 1971. 138, Jan 2017. 26

[125] Z. Torabi and G. Jaberipur, “Low-power/cost RNS comparison via [148] C. Gentry, “Fully Homomorphic Encryption Using Ideal Lattices,” in partitioning the dynamic range,” IEEE Transactions on Very Large Proceedings of the Forty-first Annual ACM Symposium on Theory of Scale Integration (VLSI) Systems, vol. 24, no. 5, pp. 1849–1857, May Computing, 2009, pp. 169–178. 2016. [149] N. Smart and F. Vercauteren, “Fully Homomorphic SIMD Operations,” [126] M. Xu, Z. Bian, and R. Yao, “Fast sign detection algorithm for the Cryptology ePrint Archive, Report 2011/133, 2011. RNS moduli set {2n+1 − 1, 2n − 1, 2n},” IEEE Transactions on Very [150] C. Gentry, S. Halevi, and N. Smart, Fully Homomorphic Encryption Large Scale Integration (VLSI) Systems, vol. 23, no. 2, pp. 379–383, with Polylog Overhead. Springer Berlin Heidelberg, 2012, pp. 465– Feb 2015. 482. [127] J. Johnson, “Rethinking floating point for deep learning,” CoRR, [151] P. Martins and L. Sousa, “A methodical FHE-based cloud computing vol. abs/1811.01721, 2018. [Online]. Available: http://arxiv.org/abs/ model,” Future Generation Computer Systems, vol. 95, pp. 639 – 648, 1811.01721 2019. [128] U. Kulisch, Computer Arithmetic and Validity. De Gruyter, 2012. [152] S. Jordan, “Quantum algorithm zoo,” 2011. [Online]. Available: [129] J. Gustafson and I. Yonemoto, “Beating floating point at its own https://quantumalgorithmzoo.org/ game: Posit arithmetic,” Supercomputing Frontiers and Innovation: [153] P. Shor, “Algorithms for quantum computation: discrete logarithms and International Journal, vol. 4, no. 2, p. 7186, Jun. 2017. factoring,” in 35th Annual Symposium on Foundations of Computer [130] J. Yu, K. Kim, J. Lee, and K. Choi, “Accurate and efficient stochastic Science, 1994, pp. 124–134. computing hardware for convolutional neural networks,” in IEEE [154] R. Jozsa, “Quantum factoring, discrete logarithms, and the hidden International Conference on Computer Design (ICCD), 2017, pp. 105– subgroup problem,” Computing in Science & Engineering, vol. 3, pp. 112. 34–43, 1996. [131] M. Alawad and M. Lin, “Stochastic-based deep convolutional networks [155] P. Kaye, R. Laflamme, and M. Mosca, An Introduction to Quantum with reconfigurable logic fabric,” IEEE Transactions on Multi-Scale Computing. Oxford University Press, Inc., 2007. Computing Systems, vol. 2, no. 4, pp. 242–256, 2016. [156] D. R. Simon, “On the power of quantum computation,” in Proceedings [132] P. Li and D. J. Lilja, “Using stochastic computing to implement 35th Annual Symposium on Foundations of Computer Science, 1994, digital image processing algorithms,” in 2011 IEEE 29th International pp. 116–123. Conference on Computer Design (ICCD), 2011, pp. 154–161. [157] R.Jozsa, “Quantum algorithms and the fourier transform,” Proceedings [133] A. Ren, Z. Li, C. Ding, Q. Qiu, Y. Wang, J. Li, X. Qian, and B. Yuan, of the Royal Society of London. Series A: Mathematical, Physical and “SC-DCNN: Highly-scalable deep convolutional neural network using Engineering Sciences, vol. 454, no. 1969, p. 323337, Jan 1998. stochastic computing,” in Twenty-Second International Conference on [158] M. Nielsen and L. Chuang, Quantum Computation and Quantum Architectural Support for Programming Languages and Operating Information: 10th Anniversary Edition. Cambridge University Press, Systems (ASPLOS), 2017, p. 405418. 2011. [134] A. Ardakani, F. Leduc-Primeau, N. Onizawa, T. Hanyu, and W. J. [159] R. Cleve, A. Ekert, C. Macchiavello, and M. Mosca, “Quantum Gross, “VLSI implementation of deep neural network using integral algorithms revisited,” Proceedings of the Royal Society of London. stochastic computing,” IEEE Transactions on Very Large Scale Inte- Series A: Mathematical, Physical and Engineering Sciences, vol. gration (VLSI) Systems, vol. 25, no. 10, pp. 2688–2699, 2017. 454, no. 1969, p. 339354, Jan 1998. [Online]. Available: http: [135] V. Lee, A. Alaghi, J. Hayes, V. Sathe, and L. Ceze, “Energy-efficient //dx.doi.org/10.1098/rspa.1998.0164 hybrid stochastic-binary neural networks for near-sensor computing,” [160] A. Asfaw, L. Bello, Y. Ben-Haim, S. Bravyi, L. Capelluto, A. Vazquez, in Conference on Design, Automation & Test in Europe (DATE), 2017, J. Ceroni, R. Chen, A. Frisch, J. Gambetta, S. Garion, L. Gil, p. 1318. S. Gonzalez, F. Harkins, T. Imamichi, D. McKay, A. Mezzacapo, [136] H. Sim and J. Lee, “A new stochastic computing multiplier Z. Minev, R. Movassagh, G. Nannicni, P. Nation, A. Phan, M. Pistoia, with application to deep convolutional neural networks,” in 54th A. Rattew, J. Schaefer, J. Shabani, J. Smolin, K. Temme, M. Tod, ACM/EDAC/IEEE Design Automation Conference (DAC), 2017, pp. S. Wood, and J. Wootton, “Learn quantum computation using qiskit,” 1–6. 2020. [Online]. Available: http://community.qiskit.org/textbook [137] R. Hojabr, K. Givaki, S. R. Tayaranian, P. Esfahanian, A. Khonsari, [161] R. Duarte, H. Neto, and M. Vestias, “Xtokaxtikox: A stochastic D. Rahmati, and M. H. Najafi, “SkippyNN: An embedded stochastic- computing-based autonomous cyber-physical system,” in 2016 IEEE computing accelerator for convolutional neural networks,” in 56th International Conference on Rebooting Computing (ICRC), 2016, pp. ACM/IEEE Design Automation Conference (DAC), 2019, pp. 1–6. 1–7. [138] R. Rivestand, A. Shamir, and L. Adleman, “A Method for Obtaining [162] S. Antao and L. Sousa, “The CRNS framework and its application to Digital Signatures and Public-key Cryptosystems,” Communications of programmable and reconfigurable cryptography,” ACM Transactions on the ACM, vol. 21, no. 2, pp. 120–126, 1978. Architecture and Code Optimization, vol. 9, no. 4, 1 2013. [139] P. L. Montgomery, “Modular multiplication without trial division,” [163] D. Boneh, C. Dunworth, R. Lipton, and J. Sgall, “On the computational Mathematics of Computation, vol. 44, no. 170, pp. 519–521, 1985. power of DNA,” Discrete Applied Mathematics, vol. 71, no. 1, pp. 79– [140] J. Bajard, L. Didier, and P. Kornerup, “Modular multiplication and base 94, 1996. extensions in residue number systems,” in 15th IEEE Symposium on [164] D. Lewis, “An accurate lns arithmetic unit using interleaved memory Computer Arithmetic ARITH, June 2001, pp. 59–65. function interpolator,” in Symposium Computer Arithmetic (ARITH), [141] S. Antao, J. Bajard, and L. Sousa, “Elliptic curve point multiplication 1993. on GPUs,” in 21st IEEE International Conference on Application- [165] W. Mathworld, “Modular inverse,” specific Systems, Architectures and Processors (ASAP), 2010, pp. 192– https://mathworld.wolfram.com/ModularInverse.html. 199. [166] J. Bajard, J. Eynard, A. Hasan, and V. Zucca, “A full RNS variant of FV [142] L. Sousa, S. Antao, and P. Martins, “Combining residue arithmetic to like somewhat homomorphic encryption schemes,” Cryptology ePrint design efficient cryptographic circuits and systems,” IEEE Circuits and Archive, Report 2016/510, 2016, https://eprint.iacr.org/2016/510. Systems Magazine, vol. 16, no. 4, pp. 6–32, 2016. [167] R. Chaves and L. Sousa, “RDSP: a RISC DSP based on residue number [143] J. Ambrose, H. Pettenghi, and L. Sousa, “DARNS:a randomized multi- system,” in Euromicro Symposium on Digital System Design, 2003, modulo RNS architecture for double-and-add in ECC to prevent power 2003, pp. 128–135. analysis side channel attacks,” in Asia and South Pacific Design [168] QuTech, “Quantum inspire (QI),” https://www.quantum-inspire.com/. Automation Conference (ASP-DAC), 2013, pp. 620–625. [169] IBM, “Scalable quantum systems,” https://www.ibm.com/quantum- [144] P. W. Shor, “Algorithms for quantum computation: discrete logarithms computing/learn/what-is-ibm-q/. and factoring,” in Proceedings of the 35th Annual Symposium on [170] N. Koblitz, “Elliptic curve cryptosystems,” Mathematics of Computa- Foundations of Computer Science, Nov 1994, pp. 124–134. tion, vol. 48, no. 177, pp. 203–209, 1987. [145] NIST, p. 25, 2016. [Online]. Available: https: [171] R. Akhtar and F. A. Khanday, “Stochastic computing: Systems, appli- //csrc.nist.gov/CSRC/media/Projects/Post-Quantum-Cryptography/ cations, challenges and solutions,” in 3rd International Conference on documents/call-for-proposals-final-dec-2016.pdf Communication and Electronics Systems (ICCES), 2018, pp. 722–727. [146] L. Babai, “On Lovasz’´ Lattice Reduction and the Nearest Lattice Point Problem (Shortened Version),” in Proc. of the 2nd Symposium of Theoretical Aspects of Computer Science, 1985, pp. 13–20. [147] P. Martins and L. Sousa, “The role of non-positional arithmetic on efficient emerging cryptographic algorithms,” IEEE Access, vol. 8, pp. 59 533–59 549, 2020. 27

Leonel Sousa received a Ph.D. degree in Electrical and Computer Engineering from the Instituto Supe- rior Tecnico´ (IST), Universidade de Lisboa (UL), Lisbon, Portugal, in 1996, where he is currently a Full Professor. He is also a Senior Researcher with the R&D Instituto de Engenharia de Sistemas e Computadores (INESC-ID). His research interests include computer arithmetic, VLSI architectures, parallel computing and signal processing. He has contributed to more than 200 papers in journals and international conferences, for which he has received several awards, including the DASIP13 Best Paper Award, SAMOS11 Stama- tis Vassiliadis Best Paper Award, DASIP10 Best Poster Award, and Honorable Mention Award UTL/Santander Totta for the quality of his publications in 2009 and 2016. He has contributed to the organization of several international conferences, typically as program chair and as a general and topic chair, and has given keynotes in some of them. He has edited four special issues of international journals; he is currently the Associate Editor of the IEEE Transactions on Computers, IEEE Access, Springer JRTIP, and Senior Editor of the Journal on Emerging and Selected Topics in Circuits and Systems . He has co-edited the books Circuits and Systems for Security and Privacy (CRC, 2018) and Embedded Systems Design with Special Arithmetic and Number Systems (Springer, 2017). He is a Fellow of the IET, a Distinguished Scientist of the ACM and a Senior Member of the IEEE.