2019 International Siberian Conference on Control and Communications (SIBCON) The Pipelined QR Decomposition Hardware Architecture Based On CORDIC Algorithm

A.V. Sokolovskiy V.N. Tyapkin E.A. Veisov Scientific and Educational Scientific and Educational Scientific and Educational Laboratory "Systems of Navigation, Laboratory "Systems of Navigation, Laboratory "Systems of Navigation, Control and Communication" Control and Communication" Control and Communication" Siberian Federal University Siberian Federal University Siberian Federal University Krasnoyarsk, Russia Krasnoyarsk, Russia Krasnoyarsk, Russia [email protected] [email protected] [email protected]

Yu.L. Fateev Scientific and Educational Laboratory "Systems of Navigation, Control and Communication" Siberian Federal University Krasnoyarsk, Russia [email protected]

Abstract— QR decomposition plays a huge role in the compute a several complex functions by applying the adaptive filtering, control systems and a computation modeling appropriate pre-processing to input parameters, configuring of the physical processes. This paper concerns the issue of a compute mode and post-processing to results. Pre-processing QR decomposition hardware implementation features based on and post-processing to a components also Givens rotation technique. For speed-up of the computation implemented based on CORDIC algorithm. This allows to purposes used a pipelined architecture and CORDIC simplify the design process. algorithm. The hardware costs and speed of a computation is evaluated for a different adders architectures in the CORDIC algorithm. The Givens rotation technique is used because it II. QR DECOMPOSITION BY GIVENS ROTATION TECHNIQUE have a simple architecture and optimized hardware Any non-singular matrix A, sized of MxN is allowing to implementation. decompose to a multiplication of the Q to Keywords— hardware algorithm, QR decomposition, the upper- R. Given a matrix A (1). CORDIC, pipelined architecture, Givens rotation, FPGA, adders architecture.

QQrr11 1kn 11 1 I. INTRODUCTION ==. (1) AQR  QR decomposition is the linear equation solving QQ0 r technique and it's having a huge role in a such issues as a mmkkn1 adaptive signal filtering, control systems, robotics and computation modeling of a physical processes. The Givens In this case, matrix Q is allowing equation (2) rotation technique is used for hardware implementation of the QR decomposition because it's having an effective design QTQ = I. (2) based on CORDIC algorithm. The Givens rotation technique based on upper-triangular Thus, taking into account the expression (1) and matrix computation by a sequential rotating an adjacent expression (2) the matrix R is calculating by the equation (3). components in a rows of the input matrix.

The CORDIC algorithm use an simple add/subtract T qqaa operations with a shift register to rotate input vector. It's 11 1kn 11 1 ==T  . (3) widely used in a digital signal processing systems that RQA  constrained by a low available dimensions for embedding,  qqaammkkkn11 especially in a multiple-input and multiple-output (MIMO) systems [7]. In other hand, the pipelined architecture gives an In the issues [1, 2] described a technique to calculate the opportunity to use this algorithm in a high-speed digital upper-triangular matrix based on Givens Q communication systems. (4, 5). Nevertheless, that by default the CORDIC algorithm was proposed to compute a trigonometric functions besides Q = Q Q …Q = 1, (4) implementing a multiplier/divider operator it's allowing to 1 2 k

978-1-5386-5142-1/19/$31.00 © 2019 IEEE 2019 International Siberian Conference on Control and Communications (SIBCON) Mode of the CORDIC algorithm is determined by the μ csmn,, k mn k (i)  0 parameter and by a selecting of e function accordingly to an −sc mn,, k mn k equation (8). Q =  , (5) k  01 ()i −−1 i  μ ==1,e tan 2 ( Circular Rotation )  ()  μ ==0,ei 2−i ( Linear Rotation ) (8) () where μ =−1,ei = tanh−−1 2i ( Hyperbolic Rotation )

Since the CORDIC algorithm actually implements a a a + ==mn (1)mn . pseudo-rotation instead of a really rotation according to the csmn, mn 22 22 chosen function, then a result is stretched by the factor aa++++ aa mn(1) m n mn (1) m n K ≈ 1,646760258121. Thus a pre-processing and a post- processing must be applied to the input and output data After than the equation (3) is calculated, the matrix R can respectively. be expressed by the equation (6). The Givens matrix calculation for an every matrix A member is implements according to the scheme in the Fig. 1. 22+ aa11 21 X On the scheme 1 depicted that CORDIC algorithms is  X used in two modes sequentially [3]. In a step one the input 0 aa22+ 22 32 data pre-processing is implements in the rotation mode with R =  , (6) a configuration (9).  00  a mn (i) (i) –i  μ = 0, di = sign(z ), e = 2 . (9)   For align an matrix rows amplitude ratios in the upper- where X – is a member of the upper-triangular matrix set triangular matrix the pre-processing is used for scaling a at above a main diagonal. matrix A members by the using a scaling diagonal matrix S (10). The calculation costs of an member values iterative calculation of the upper-triangular matrix by taking into account the expression (5) primary relates to a multiply K −M 0 operator and a square root operator calculations. The QR  0 decompose design throughput based on an iterative −M 0 K calculation technique is a matrix size dependent. This −−() K M 1 0 approach is not acceptable in the multi-input and multi- S =  , (10) output (MIMO) high-speed systems. 0  −1 In this paper proposed the calculation blocks pipelined 0  K architecture based on the CORDIC algorithm.   III. HARDWARE ARCHITECTURE BASED ON THE CORDIC This operation excludes an iterated rotate calculation ALGORITHM results delay for each matrix A rows. The CORDIC algorithm is widely used thanks to its In a step two calculation of the matrix R (11) is efficiency and a simple implementation capabilities for a implements in vectoring mode (12). trigonometric, a hyperbolic and other complex functions calculation based on add/subtract operation and shift registers. It is a valuable reason for those digital signal processing systems that's a multi-channel and a high-speed of amm CORDIC CORDIC CORDIC (Rotation (Vectoring (Rotation rmm processing with a limited hardware and dimension resources K-(M-m) mode) mode) mode) is a crucial. a CORDIC The generalized CORDIC algorithm is described by an (m+1)m (M-m+1) (Rotation K equation (7). K-(M-m) mode) m = n

(1)ii+−()ii () =− amn CORDIC CORDIC CORDIC xxdyμ i 2, (Rotation (Vectoring (Rotation rmn -(M-m) (1)ii+−=+()ii () , (7) K mode) mode) mode) yydxi 2, (1)i+ ()ii () =− CORDIC zzdei . a(m+1)n (M-m+1) (Rotation dmm K K-(M-m) mode) m ≠ n

Fig. 1. The Givens rotation structural scheme. 2019 International Siberian Conference on Control and Communications (SIBCON)

−−() KaaXM 1 22+ 11 21  X 0  =  , (11) R 

0  amm   Fig. 2. The Givens rotation compute pipeline. Another one of the most effective architecture is based on (i) (i) (i) –1 –i a sum of terms pre-compute for two different input carry bits. μ = 1, di = –sign(x y ), e = tan 2 . (12) The true result of the sum computation is selected depending The calculations is ending by the post-processing in the on a result computed carry bit. The main difference of this rotation mode with a configuration (9). The post-processing architecture from a carry look-ahead architecture is a group also used for scaling a R (11) matrix members by using a of bits sum computation is implemented base on a sequential diagonal matrix S' (13), then matrix (6) is a result of this sum computation of an each bit terms. The computation operation. delay relates with a composition of bit terms. The maximum computation delay relates with a length of terms and with a carry look-ahead block length. ()− K M 1 0 Another kind of an adder architectures relates with a  0 ()M −2 probability of a bit-group occurrence with a generated or 0 K propagated bit group. The computation for this group may be ′ =  , (13) S  dropped. This architecture includes capabilities of a carry look-ahead architecture and a carry-select architecture. The 01 delay for this architecture relates with a length and a composition of bit sequences.  For minimizing a CORDIC algorithm computation delay where K – the stretching coefficient, m – the row number, is used a computing pipeline. As a result the computation that a diagonal matrix R member is located. delay of an algorithms becomes as computation delay of a The Givens rotation compute pipeline (Fig. 2) for each a one adder component with selected architecture. matrix A row consist of a delay lines (Delay), a pre- The characteristics of the above mentioned architectures processing block (PreProc), a Givens rotation processing is presented in a table 1. block (Proc) and a post-processing block (PostProc). The necessary hardware resources for an implementation for a matrix with size 4×4 pipelined QR decomposition is IV. COMPARISON OF THE QR DECOMPOSITION HARDWARE presented in a table 2. COSTS WITH DIFFERENT ADDERS ARCHITECTURE There are many hardware architectures of the adders, that TABLE I. THE ADDITION COMPUTATION DELAY FOR 14-BIT LENGTH are different from each other by a compute speed and a TERMS. hardware costs. Min. delay, clock Max. delay, clock Architecture The simplest in implementation is a ripple-carry adder cycles cycles architecture, with a sequential bit summing of a binary Ripple-Carry adder 3 15 numbers [3]. The base compute cell of this architecture is a Carry-Lookahead 2 4 full adder (FA). The full adder consist of input signals – a adder carry in signal, a two terms signals and a output signals – a Carry-Select adder 4 5 carry-out signal, a sum signals. The delay of a sum calculation for this architecture is relates with a composition Carry-Skip adder 4 15 of bit sequences. The maximum computation delay is relates with a length of terms, because carry-out signal must be TABLE II. NECESSARY HARDWARE RESOURCES PROCESSING evaluated on the each compute iteration. BLOCKS The most compute speed effective architecture is a carry Based on the following Flip-flops Processes look-ahead architecture. In this architecture the carry-out bit architecture propagation and generation for each bit position is evaluated Ripple-Carry adder 1420335 232875 in the special compute unit. Based on values of propagation Carry-Lookahead and generation bits the pre-computing may be done for 1666170 333855 several group of bits, that is proportional reduces the sum adder computing time. The sum computing delay also relates with Carry-Select adder 1843695 403650 a composition of bit sequences. But in difference with a Carry-Skip adder 1352025 302805 ripple-carry sum architecture for that architecture the maximum computation speed is relates with a length of terms and with length of a carry look-ahead computation The main advantage of the hardware implementation of a block. pipelined QR decomposition architecture may be noted a high compute speed, that allow to use QR decomposition in a 2019 International Siberian Conference on Control and Communications (SIBCON) high speed telecommunication MIMO systems. The REFERENCES disadvantage of the above mentioned implementation in a [1] V.I. Zhigan, “Adaptive signal filtering: theory and algorithms,” fixed-point terms is a decreasing computation accuracy after Moscow: Technosphere, 2013. a scaling pre-processing step, that is proportional relates with [2] Semih Aslan, Sufeng Niu, Jafar Saniie, “FPGA implementation of stretching coefficient K of the CORDIC algorithm. fast QR decomposition based on Givens rotation,” IEEE 55th International Midwest Symposium on Circuits and Systems Choosing an more effective solution for a (MWSCAS), 2012, pp. 470-473. doi: implementation algorithm taking into account the hardware 10.1109/MWSCAS.2012.6292059 features of the implementation device is allowing to increase [3] Behrooz Parhami, “Computer arithmetic: algorithms and hardware bandwidth of the digital signal processing device based on designs,” NY, Oxford: Oxford University Press, 2000. QR decomposition. The fastest QR decomposition may be [4] A.V. Sokolovskiy, A.B. Gladyshev, D.D. Dmitriev, and V.N. implemented based on the carry look-ahead adders. At the Ratushniak, “Hardware diagram computing devices navigation equipment consumers SRNS,” 2017 Dynamics of Systems, same time the using of a hardware DSP-slices allow to Mechanisms and Machines (Dynamics), 2017, pp 1-4. doi: valuable increase speed of QR decomposition and to 10.1109/Dynamics.2017.8239510 optimize compute architecture, taking into account a [5] 1076 IEEE Standard VHDL Langeage Reference Manual, 2002, p. hardware capabilities. 300. [6] Xilinx UG1046, UlatraFast embedded Design Methodology Guide, 2015, p 231. ACKNOWLEDGMENT [7] Hyukyeon Lee, Kyungmook Oh, Minjeong Cho, Yunseok Jang, and This work was supported by the Ministry of Education Jaeseok Kim, “Efficient low-latency implementation of CORDIC- and Science of the Russian Federation in the framework of based sorted QR decomposition for multi-Gbps MIMO systems,” the Federal target program «Research and development on IEEE transactions on circuits and systems – II: express briefs, vol. 65, priority directions of development of the scientific- no. 10, 2018, pp. 1375-1379. doi: 10.1109/TCSII.2018.2853099 technological complex of Russia for 2014-2020» (agreement № 14.578.21.0247, unique ID project RFMEFI57817X0247).