electronics

Article Design and VLSI Implementation of a Reduced-Complexity Sorted QR Decomposition for High-Speed MIMO Systems

Lu Sun 1,2 , Bin Wu 1,* and Tianchun Ye 1

1 The Intelligent Manufacturing Electronics RD Center, the Institute of Microelectronics, Chinese Academy of Science, Beijing 100029, China; [email protected] (L.S.); [email protected] (T.Y.) 2 School of Electronics, Electrical and Communication Engineering, the University of Chinese Academy of Sciences, Beijing 100049, China * Correspondence: [email protected]; Tel.: +86-010-8299-5566

 Received: 24 September 2020; Accepted: 9 October 2020; Published: 12 October 2020 

Abstract: In this article, a low-complexity and high-throughput sorted QR decomposition (SQRD) for multiple-input multiple-output (MIMO) detectors is presented. To reduce the heavy hardware overhead of SQRD, we propose an efficient SQRD algorithm based on a novel modified real-value decomposition (RVD). Compared to the latest study, the proposed SQRD algorithm can save the computational complexity by more than 44.7% with similar bit error rate (BER) performance. Furthermore, a corresponding deeply pipelined hardware architecture implemented with the coordinate rotation digital computer (CORDIC)-based (GR) is designed. In the design, we propose a time-sharing Givens rotation structure utilizing CORDIC modules in idle state to share the concurrent GR operations of other CORDIC modules, which can further reduce hardware complexity and improve hardware efficiency. The proposed SQRD processor is implemented in SMIC 55-nm CMOS technology, which processes 62.5 M SQRD per second at a 250-MHz operating frequency with only 176.5 kilo-gates. Compared to related studies, the proposed design has the best normalized hardware efficiency and achieves a 6-Gbps MIMO data rate which can support current high-speed wireless communication systems such as IEEE 802.11ax.

Keywords: sorted QR decomposition; real-value decomposition; Givens rotation; multiple-input multiple-output detector

1. Introduction Multiple-input multiple-output (MIMO) is widely employed in current wireless communication systems, such as IEEE 802.11ax [1], to achieve high data throughput. A MIMO detector is used to recover the original signal from the mixed multi-dimensional data streams, which has a great impact on system performance. In the MIMO detector, QR decomposition (QRD) serves as a preprocessor dealing with a channel to facilitate subsequent MIMO detection. Thus, QRD needs massive arithmetic operations and takes up a considerable part of the hardware complexity of the MIMO detector [2,3]. The sorted QRD (SQRD) can improve the bit error rate (BER) performance of detection through inserting sorting procedures into the original QRD [4]. Due to the addition of sorting modules, however, the hardware overhead further increases. In view of this, it is necessary to design an SQRD processor with high throughput and reduced complexity. There are three well-known algorithms used for QR decomposition: Householder transformation (HT) [5], modified Gram–Schmidt (MGS) [6], and Givens rotation (GR) [7]. HT is rarely used in the QR decomposition because of its huge computational complexity. Compared with GMS, GR can be

Electronics 2020, 9, 1657; doi:10.3390/electronics9101657 www.mdpi.com/journal/electronics Electronics 2020, 9, 1657 2 of 15 realized with a coordinate rotation digital computer (CORDIC), which can simplify the computational complexity. In addition, GR is performed row-wise and the processing of different rows can be done at the same time. Thus, GR can achieve high parallelism to improve the throughput. Given this, many studies adopt CORDIC-based GR to implement SQRD [8,9]. However, all the published studies are suffering from high computational complexity of SQRD and cost a lot of hardware overhead in hardware design. On the other hand, in the hardware design of SQRD, the iterative sorting destroys the original tight and consecutive pipelined GR process and creates a large number of “bubbles”, which are the idle states of the CORDIC-based GR structure. Therefore, the adopted CORDIC-based GR structure is underused, which makes the SQRD hardware inefficient. In this article, we propose an efficient SQRD algorithm with a novel modified real-value decomposition (RVD), which can greatly reduce the number of CORDIC operations and simplify the computational complexity compared to previous studies. Furthermore, the corresponding deeply pipelined hardware architecture with low latency and low hardware overhead is designed and implemented. In the hardware design, we adopt a time-sharing GR structure utilizing certain CORDIC modules in idle state to perform the concurrent GR operations of other CORDIC modules, which can save hardware cost and improve hardware efficiency. The comparisons of the implementation results show that the proposed SQRD processor overmatches the other related designs in normalized hardware efficiency and achieves up to 6 Gbps MIMO data throughput. The contributions of this study are as follows:

We propose an efficient reduced-complexity SQRD algorithm based on a novel modified RVD. • Compared to the latest related study, the computational complexity of the proposed SQRD algorithm is greatly reduced by more than 44.7%. In addition, the proposed SQRD algorithm has a competitive BER performance and is implementation-friendly; We design a deeply pipelined SQRD hardware architecture with a time-sharing GR structure for • 4 4 MIMO systems. The proposed time-sharing GR structure cleverly utilizes the CORDIC × modules in idle state to process certain rotation operations that should have been handled by additional CORDIC module. Therefore, additional hardware is saved and the hardware efficiency of the proposed SQRD design is improved.

Notations: ( ) and ( ) denote the real and imaginary parts of the argument, respectively. A < · = · i,j denotes the element in the ith row and jth column of a matrix A. Ai,: and A:,j denote the ith row and jth column of a matrix A, respectively. v denotes the ith element of a vector v. ( )T denotes the transpose i · of the argument. ( )H denotes the Hermitian transpose of the argument. aI and aR denote the real and · imaginary parts of a complex number a, respectively. The rest of this article is organized as follows. Section2 presents the background of this study by reviewing the MIMO detection model and related studies about SQRD. Section3 describes the proposed SQRD algorithm based on a novel modified RVD. The proposed SQRD hardware architecture with time-sharing GR structure for 4 4 MIMO detectors is introduced in Section4. In Section5, × the implementation results of the proposed SQRD processor and other related studies are discussed and compared. Finally, conclusive remarks are presented in Section6.

2. Background

2.1. MIMO Detection Model The MIMO system under consideration consists of N transmit antennas and N receive antennas. The complex-valued matrix H of size N N represents the MIMO channel; the complex-valued N 1 T × T × vector s = [s1, s2, , sN] and the complex-valued N 1 vector y = [y1, y2, , yN] denote the ··· × ··· T transmit signal and the receive signal, respectively; the N 1 vector n = [n , n , , nN] describes × 1 2 ··· Electronics 2020, 9, 1657 3 of 15

  the zero-mean i.i.d. complex additive Gaussian noise, where n 0, δ2 and i = 1, 2, , N. i ∼ N ··· The complex-valued MIMO system model is given by

y = Hs + n. (1)

The purpose of MIMO detection is to recover the transmit signal s from the receive signal y. According to reference [10,11], this can be realized by the maximum likelihood (ML) principle. Therefore, the optimum solution of MIMO detection can be obtained by

^ 2 sML = argmin y Hs , (2) s Ω k − k ∈ where Ω is the solution space. The derivation process of Equation (2) is given in AppendixA. To solve the least square problem like Equation (2), QR decomposition [12] and Cholescky factorization [13] can be performed on the matrix H to simplify the solution procedure. However, the detection procedure based on Cholesky factorization consists of successive forward substitution and backward substitution [14], which is incompatible with column iterative sorting. Whereas, when performed with QR decomposition, the detection procedure can be executed with only backward substitution, which is suitable for column iterative sorting. On the other hand, performing QR decomposition on H can obtain a Q and an upper R, i.e., H = QR and pre-multiplying y Hs − by a unitary matrix QH(QH = QT) does not change its norm and thereby has no impact on the MIMO detection. Thus, Equation (2) can be reformulated as

^ H 2 sML = argmin Q y Rs . (3) s Ω k − k ∈ Therefore, QR decomposition is preferred in MIMO detection. With SQRD performed on the channel matrix, the columns of R are rearranged by iterative sorting to try to ensure that the detection of the signals in different layers of s is conducted in the order of signal-to-noise ratio (SNR) from large to small. In this way, the error propagation in the detection process is alleviated, which could improve the detection performance.

2.2. Related Studies In recent years, there were many studies focusing on the implementation of SQRD. In reference [15–21], the SQRD is performed directly on the complex-valued matrix H for the complex-valued MIMO detection model. However, the subsequent MIMO detection in a complex-valued model is more complicated than the one in a real-valued model [22]. Thus, many studies [3,8] develop SQRD based on conventional RVD [23], with which the complex-valued model in Equation (1) is converted into a real-valued model as follows " # " # " # " # (y) (H) (H) (s) (n) < = < −= < + < . (4) (y) (H) (H) · (s) (n) = = < = = Before performing iterative sorting and GR operations, the N N complex channel matrix is firstly × converted into its real counterpart of size 2N 2N with conventional RVD [23]. As the size of the matrix × is enlarged by four times, the number of sorting procedures and operations of GR both dramatically increase, which leads to huge hardware overhead and lengthy processing latency. The design in reference [9] adopts a SQRD algorithm based on a modified RVD and utilizes the symmetry of adjacent columns of the RVD matrix to reduce the number of CORDIC operations as well as sorting procedures. However, the computational complexity of this scheme still stays at a relatively high level and thereby brings about considerable hardware cost. To alleviate this problem, we propose an efficient SQRD algorithm, which can significantly reduce the computational complexity. Electronics 2020, 9, 1657 4 of 15

3. Proposed SQRD Algorithm with a Novel Modified RVD

3.1. Proposed Modified RVD With the proposed modified RVD, the complex-valued system model in Equation (1) can be reformulated as

! ! ! !  (y )   h h h h h h h h  (s )   (n )   1   1,1 1,2 1,1 1,2 1,N 1 1,N 1,N 1 1,N  1   1   <   − −  <   <   (y2)   < h2,1 h2,2 −= h2,1 h2,2 · · · < h2,N 1 h2,N −= h2,N 1 h2,N  (s2)   (n2)   <   ! ! − ! − !  <   <   (y )   h h h h h h h h  (s )   (n )   1   1,1 1,2 1,1 1,2 1,N 1 1,N 1,N 1 1,N  1   1   =   − −  =   =   (y2)   = h2,1 h2,2 < h2,1 h2,2 · · · = h2,N 1 h2,N < h2,N 1 h2,N  (s2)   (n2)   =   − −  =   =   .   . . . . .  .   .   .  =  ......  .  +  . . (5)           ! ! ! !      (yN 1)   hN 1,1 hN 1,2 hN 1,1 hN 1,2 hN 1,N 1 hN 1,N hN 1,N 1 hN 1,N  (sN 1)   (nN 1)   < −   − − − − − − − − − −  < −   < −   (yN)   < h hN −= h hN · · · < h hN N −= h hN N  (sN)   (nN)     N,1 ,2 N,1 ,2 N,N 1 , N,N 1 ,      <   ! ! − ! − !  <   <   (yN 1)   hN 1,1 hN 1,2 hN 1,1 hN 1,2 hN 1,N 1 hN 1,N hN 1,N 1 hN 1,N  (sN 1)   (nN 1)   = −   − − − − − − − − − −  = −   = −  (yN) = hN,1 hN,2 < hN,1 hN,2 · · · = hN,N 1 hN,N < hN,N 1 hN,N (sN) (nN) = − − = = It can be suggested that the proposed modified RVD is a permuted version of conventional RVD and it takes 2 2 sub-matrices and 2 1 sub-vectors of original complex matrix and vectors, × × respectively, as basic units to perform RVD. Thus, N is restricted to an even number.

3.2. The Proposed SQRD Algorithm Based on the modified RVD introduced above, we propose a reduced-complexity SQRD algorithm with CORDIC-based GR, which is shown in Algorithm 1. The proposed SQRD algorithm basically comprises three steps. In step 1, sorted complex Givens rotation (SCGR), which contains iterative sorting, and complex Givens rotation is applied to complex channel matrix H and receive signal y. Before the elimination process of every column, first, the sorting procedure finds the column with the smallest norm value and swaps it with the first column. Then, the reordered matrix is processed with complex Givens rotation to zero the elements below the diagonal in the first column. Repeat the two procedures until all the elements below the diagonal of H are eliminated and then the complex upper triangular matrix Rc is obtained. In the SCGR, the P records the column order of the iterative sorting for subsequent MIMO detection. In step 2, Rc is converted into its real counterpart S with the proposed modified RVD. Figure1 shows the diagram of the proposed modified RVD performed on Rc for N N (N is even) MIMO systems, where rR and rI denote the real and × imaginary part of r, respectively. Firstly, Rc are partitioned into 2 2 sub-matrices q2 2 and 2 1 × i,×j × sub-vectors q2 1, where i, j = 1, 2, , n and n = N/2. Then, RVD is performed on all the sub-matrices i × ··· q2 2 and sub-vectors q2 1 to obtain their corresponding extended versions p4 4 of size 4 4 and p4 1 i,×j i × i,×j × i × of size 4 1 by Equations (6) and (7), respectively. Due to the feature of complex Givens rotation [24], × the diagonal elements of the upper triangular matrix Rc are all real numbers and all the elements above its diagonal are complex numbers. After the modified RVD is performed, below the diagonal of every 4 4 sub-matrix qi,×j (in red dotted box) of S, there is only one non-zero element. Consequently, the number of non-zero elements that need to be eliminated in S is N/2 in total, which is quite small and does not need many elimination operations in the following real Givens rotation (RGR). It is clear that all the non-zero elements to be nullified in S are located close to the diagonal but in different rows far from each other. This will facilitate the eliminating operations of the following RGR in hardware design. Finally, in step 3, these non-zero elements below the diagonal of S are eliminated with RGR to get the desired upper triangular matrix R and QHy.        q2 2 q2 2   ij× ij×  p4 4 =  <  −=  , (6) ij×    q2 2 q2 2  = ij× < ij×

4 1 h  2 1  2 1i p × = q × ; q × . (7) i < i = i Electronics 2020, 9, 1657 5 of 15 Electronics 2020, 9, x FOR PEER REVIEW 6 of 16

44 44 S  p1, j p1,n rrRIRRIIRRIIR0  rrrrr   rrrry   c 1,1 1,2 1,2 1,2j 1 1,2 j 1,2 j  1 1,2 j 1, N  1 1, N 1, N  1 1, N 1 R  22 22 21 RRIIRRIIR q q q 0r2,2 0 0 r 2,21j r 2,2 j r 2,21 j   r 2,2 j r 2,1 N  r 2, N  r 2,1 N   r 2, N y 2 41 1, j 1,n 1 p 0 rIRIIRRI r r r r r r r r IRRIr r y 1 r1,1 r 1,2 r 1,2j 1 r 1,2 j r 1, N 1 r 1, N y 1  1,2 1,1 1,2 1,2j 1 1,2 j 1,2 j  1 1,2 j 1, N  1 1,NNN1, 1 1, 1 IIRRIIRRI   000r2,2 r 2,21j r 2,2 j r 2,21 j  r 2,2 j r 2,1 N  r 2, N r 2,1 N  r 2, N y 2 0 r22 r 2,2j 1 r 2,2 j r 2, N 1 r 2, N y 2   44 22 22 22 p1,1  q 21  RIRRIIR 1,1 q q q r r0  r r r  r  r y  ij, in, i  21,21ijij 21,2 21,2 ij  21,1 iNiNiN  21, 21,1  21, iNi 21 00 r r r r y RRIIR  21,21i j  21,2 i  j 21,1 i  N  21, i  N 21 i   44 0r2i ,2 j 0 0 44 r 2 i , N 1 r 2 i , N  r 2i, N 1ry 2 i , N 2 i p p 41 04×4 ij, IRIIRRIin, pi  0 0 0 r r r y  0 r21,2ijij r 21,21  r 21,2 ij r 21,1 iN  r 21, iNiN r 21,1  r 21, iN y 21 i  2i ,2 j 2 i , N 1 2 i , N 2 i    IIRRI 22 21 0 0 0 r2,2ij r 2,1 iN r 2, iN r 2,1 iN r 2, iN y 2 i  qnn, qn      0 0 0 0 rNN1, 1 rNNN1, y 1 RIR   rNN1, 1 rNNNNN1,0  r  1, y  1 0 0 0 0 0 ry R  NNN,  44 0ryNNN, 0 0 41 p p 04×4 04×4 nn, 0 r IRIr r y n NN1, NN1, 1 NNN1, 1 0 0 0 ryI NNN, FigureFigure 1. Schematic 1. Schematic diagram diagram of the expansion of the expansion of Rc with proposed of Rc with modified proposed real-value modified decomposition real-value (RVD). decomposition (RVD). Algorithm 1 Proposed SQRD algorithm

3.3.INPUT: PerformanceHN N, y ENvaluation1, P = IN of the Proposed SQRD Algorithm × × H OUTPUT: R2N 2N, Q y2N 1,P For SQRD×, which is ×implemented with CORDIC, the number of required CORDIC operations canStep be 1: regarded Sorted complex as a measurement Givens rotation of (SCGR) the computational complexity of the SQRD algorithm. Within c the1: R proposed= [H y ] SQRD algorithm for NN ( N is even) MIMO systems, the CORDIC operations = needed2: for j in SCGR1 : N stage are given by = c 2 3: αj R:,j k k N 4: end 3 N k 1 N  k  1  N  k 1  2( N  k ) = N . (8) 5: for i = 1 : N k 1 6: ki = arg min αl The ldetailed=i,...,N derivation of Equation (8) is shown in Appendix B, whereas in the following RGR, c c 7: Swap R , P:,i, and αi with R , P:,k , and αk , respectively the CORDIC:,i operations consumed:,ki i are as followsi : c 8: Perform complex Givens rotation in vectoring mode on the elements of R:,i in pairs N /2 9: for m = i + 1 : N + 1 2 (4k 1)= N N  2 . c (9) 10: Perform complex Givens rotation in rotationk 1 mode on the elements of R:,m in pairs 11: end 12: forTablej = 1i :listsN the number of CORDIC operations required in the proposed SQRD algorithm as well as a latest related 2 study. For a clear comparison, the number of CORDIC operations needed and 13: α = α Rc j j − i,j the14: complexityend reduction of the proposed compared to the one in reference [9] for different matrix size15:send are presented in Figure 2a. It is clear that the proposed SQRD algorithm has a huge advantage overStep the 2: Proposed one in reference modified RVD[9] in terms of computational complexity. For 4 × 4 MIMO systems, the number16: for iof= CORDIC1 : N/2 operations of the proposed algorithm is greatly reduced by 44.7% compared to the17: onefor inj = reference1 : N/2 [9]. In addition, as the size of the channel matrix increases, the reduction will be   Rc Rc further2 expanded2  2i 1,2 andj 1 gradually2i 1:2j  approach 50%. 18: q =  − − −  ij×  c c  R2i,2j 1 R2i,2j −      Table 1. Complexity2 2 of different2 2 Givens rotation (GR)-based QR decomposition schemes.  q × q ×   < ij −= ij  19: S4i 3:4i,4j 3:4j =       − − A lgorithm2 2 Number2 2  of CORDIC Operations If N  4  qij× qij×  = < 32 20: end [9] 2NNN 1 2  1 2 134 2 1 h c c i 21: q × = R + ; R + i 2i 1,Nh 1  2i,N 1  i 32 − 2 1 2 1 NNN+ 1 2 1 2 22: S4i 3:4i,2N+1 =The qproposed× ; q ×     74 − < i = i 23: end StepTo 3: Realevaluate Givens the rotation BER performance of the proposed SQRD algorithm in MIMO detection, it is 24: for i = 1 : N/2 simulated with a K-best detector and a maximum likelihood (ML) detector in an uncoded 4 × 4 64- 25: Perform real Givens rotation in vectoring mode on S4i 2,4i 2, S4i 1,4i 2 QAM MIMO system along with the SQRD algorithm− in− reference− − [9]. Figure 2b shows that the 26: for j = 4i 1 : 2N + 1 proposed SQRD− algorithm achieves similar BER performance with the one in reference [9]. 27: Perform real Givens rotation in rotation mode on S4i 2,j, S4i 1,j − − 28: end 29: end

Electronics 2020, 9, 1657 6 of 15

3.3. Performance Evaluation of the Proposed SQRD Algorithm For SQRD, which is implemented with CORDIC, the number of required CORDIC operations can be regarded as a measurement of the computational complexity of the SQRD algorithm. Within the proposed SQRD algorithm for N N (N is even) MIMO systems, the CORDIC operations needed in × SCGR stage are given by

XN ((N k + 1)(N k + 1) + (N k)(1 + 2(N k))) = N3. (8) − − − − k=1

The detailed derivation of Equation (8) is shown in AppendixB, whereas in the following RGR, the CORDIC operations consumed are as follows:

NX/2   (4k 1) = N2 + N /2. (9) − k=1

Table1 lists the number of CORDIC operations required in the proposed SQRD algorithm as well as a latest related study. For a clear comparison, the number of CORDIC operations needed and the complexity reduction of the proposed compared to the one in reference [9] for different matrix sizes are presented in Figure2a. It is clear that the proposed SQRD algorithm has a huge advantage over the one in reference [9] in terms of computational complexity. For 4 4 MIMO systems, the number × of CORDIC operations of the proposed algorithm is greatly reduced by 44.7% compared to the one in reference [9]. In addition, as the size of the channel matrix increases, the reduction will be further expanded and gradually approach 50%.

Table 1. Complexity of different Givens rotation (GR)-based QR decomposition schemes.

Algorithm Number of CORDIC Operations If N=4 [9] 2N3 + (1/2)N2 (1/2)N 134 − Electronics 2020, 9, x FORThe PEER proposed REVIEW N3 + (1/2)N2 + (1/2)N 74 7 of 16

(a) (b)

FigureFigure 2. 2.Performance Performance comparisonscomparisons withwith thethedesign designin in reference reference [ 9[9]:]: ( a(a)) complexity complexity performance performance forfor didifferentfferent matrix matrix sizes; sizes; ( b(b)) uncoded uncoded bit bit error error rate rate (BER) (BER) performance performance in in 4 4 ×4 4 64-QAM 64-QAM multiple-input multiple-input × multiple-outputmultiple-output (MIMO) (MIMO) system. system.

4. ProposedTo evaluate SQRD the VLSI BER performanceArchitecture of the proposed SQRD algorithm in MIMO detection, it is simulated with a K-best detector and a maximum likelihood (ML) detector in an uncoded 4 4 64-QAM × MIMO4.1. Overview system of along the Proposed with the SQRD SQRD Hardware algorithm Architecture in reference [9]. Figure2b shows that the proposed SQRD algorithm achieves similar BER performance with the one in reference [9]. According to Algorithm 1, we designed the corresponding high-speed and low-complexity SQRD hardware architecture for a 4 × 4 MIMO system. To match the high throughput of current wireless communication systems, the proposed SQRD architecture is deeply pipelined and highly parallel, which can decompose one 4 × 4 channel matrix in four clock cycles and process one QyH per clock cycle. In other words, the processing cycles of SQRD (clock cycles needed for decomposing one H ) and (clock cycles needed for processing one ) are four clock cycles and one clock cycle, respectively, which are both decreased by 20% compared to the design of reference [9]. As throughput=clock frequency processing cyc les , both the SQRD and throughput of the proposed SQRD will be improved by 25% compared to the design of reference [9]. In addition, the time-sharing GR structure is designed to take advantages of the idle state of CORDIC modules caused by iterative sorting, which can further reduce the hardware overhead and improve hardware efficiency. The proposed SQRD hardware architecture is basically comprised of one norm calculator, three sorting modules, and four processing engines. Norm calculator and sorting modules are used for iterative sorting; processing engines with CORDIC-based Givens rotation structure are used for decomposing the channel matrix and processing . Figure 3 presents the dataflow of the proposed SQRD design for the 4 × 4 MIMO system. Sorted complex Givens rotation (SCGR) is performed on the 4 × 4 complex channel matrix and 4 × 1 complex receive signal vector y with all sorting modules and processing engines. After the iterative sorting procedure, every processing engine zeros the elements below the diagonal in the first column of the current input matrix in vectoring mode (introduced in Section 4.2) and rotates the elements of subsequent columns in rotation mode (introduced in Section 4.2). Thus, the SQRD processing cycles of SCGR are four clock cycles. After the process of all the four columns is done, the complex upper triangular matrix Rc is obtained. According to Algorithm 1, next, the 4 × 4 intermediate is converted into its 8 × 8 real version S for the following RGR, which is given by

Electronics 2020, 9, 1657 7 of 15

4. Proposed SQRD VLSI Architecture

4.1. Overview of the Proposed SQRD Hardware Architecture According to Algorithm 1, we designed the corresponding high-speed and low-complexity SQRD hardware architecture for a 4 4 MIMO system. To match the high throughput of current × wireless communication systems, the proposed SQRD architecture is deeply pipelined and highly parallel, which can decompose one 4 4 channel matrix in four clock cycles and process one × QHy per clock cycle. In other words, the processing cycles of SQRD (clock cycles needed for decomposing one H) and QHy (clock cycles needed for processing one QHy) are four clock cycles and one clock cycle, respectively, which are both decreased by 20% compared to the design of reference [9]. As throughput =clock frequency/processing cycles, both the SQRD and QHy throughput of the proposed SQRD will be improved by 25% compared to the design of reference [9]. In addition, the time-sharing GR structure is designed to take advantages of the idle state of CORDIC modules caused by iterative sorting, which can further reduce the hardware overhead and improve hardware efficiency. The proposed SQRD hardware architecture is basically comprised of one norm calculator, three sorting modules, and four processing engines. Norm calculator and sorting modules are used for iterative sorting; processing engines with CORDIC-based Givens rotation structure are used for decomposing the channel matrix and processing QHy. Figure3 presents the dataflow of the proposed SQRD design for the 4 4 MIMO system. Sorted complex Givens rotation (SCGR) is performed on the × 4 4 complex channel matrix H and 4 1 complex receive signal vector y with all sorting modules and × × processing engines. After the iterative sorting procedure, every processing engine zeros the elements below the diagonal in the first column of the current input matrix in vectoring mode (introduced in Section 4.2) and rotates the elements of subsequent columns in rotation mode (introduced in Section 4.2). Thus, the SQRD processing cycles of SCGR are four clock cycles. After the process of all the four columns is done, the complex upper triangular matrix Rc is obtained. According to Algorithm 1, next, the 4 4 intermediate Rc is converted into its 8 8 real version S for the following RGR, which is × × givenElectronics by 2020, 9, x FOR PEER REVIEW 8 of 16

 R I R R I I R  r1,1 r RIRRIIR0 r r r r r y  r1,1 r1,2 1,20  r 1,21,2 r 1,31,3 r 1,41,4  r 1,3 1,3 r 1,4 y 11,4 1   − R R − I − I R   00rr2,2 00 0 0 rRRIIRr rr  rr r yr y   2,2 2,32,3 2,42,4 2,3− 2,3 2,4− 22,4 2     I IRIIRRR I I R RI I  r r rr rr ry r y  00 rr rr1,1 rr rr rr rr rr y y   1,1 1,2 1,11,3 1,21,4 1,31 1,4 1  1,21,2 1,1 1,21,2 1,31,3 1,41,4 1,31,3 1,41,4 1 1     IIRRII I R R I   0 r2,2 0r2,3rr2,4 ry2 r Modified y Modified 0000 0 0 rr2,2 rr rr rr rr y y   2,2 2,3 2,4 2  2,2 2,32,3 2,42,4 2,32,3 2,42,4 2 2 . (10)    RIRR I . R (10)  0 0 00r3,3 r3,4ry3 r yRVDRVD  00 0 0 0 0 0 rr3,3 rr 0 0 r yr y   3,3 3,4 3  3,3 3,43,4 3,4− 33,4 3  r y  R R  0 00 0 04,4 0 4ry→  0 0 0 0 0 r4,4 0 0 y  4,4 4  0 0 0 0 0ry4,4 0 0 4 4   I R I   0 0 0 0 0 IRIr r3,3 r y   0 0 0 0 0 r3,43,4 r 3,3 r 3,4 y3,4 3 3   I  0 0 0 0 0 0 0 r4,4I y 0 0 0 0 0 0 0 ry4,4 4 4

S H H 1:4,: R1:4,: Q y 1:4 S5:8,: R5:8,: Q y 5:8 Zero Real number

Being processed Complex number RGR

Norm cal. & Processing engine#1 sorting#2 Processing engine#2 sorting#3 Processing engine#3 Processing engine#4 sorting#1

Hy Rc

SCGR FigureFigure 3. 3.Dataflow Dataflow diagramdiagram of the proposed proposed sorted sorted QR QR decomposition decomposition (SQRD (SQRD)) hardware hardware architecture architecture forfor 4 4 ×4 4 multiple-inputmultiple-input multiple-outputmultiple-output ( (MIMO)MIMO) systems. systems. × Then, RGR performed with processing engine#3–4 eliminates the two non-zero elements below the diagonal of S , as shown in Figure 4. It can be suggested that the elements of the upper two rows of Rc are obtained after the processing in sorting#3 because the elimination of the first two columns of Hy have been completed. As the upper four rows of are derived from the upper two rows of according to Equation (10), they can be obtained immediately after the process of sorting#3 is I done. Therefore, the elimination processes of the non-zero element S3,2 (i.e., r1,2 ) in RGR and the third column of in SCGR are performed with processing engine#3 at the same time when the relevant elements are output from sorting#3. This will improve the parallelism of the SQRD hardware I architecture and shorten the processing latency. As for the non-zero element S7,6 (i.e., r3,4 ), it will

be zeroed with processing engine#4 when r4,4 of is obtained.

RIRRIIR RIRRIIR r1,1 r 1,20  r 1,2 r 1,3 r 1,4  r 1,3  r 1,4 y 1 r1,1 r 1,20  r 1,2 r 1,3 r 1,4  r 1,3  r 1,4 y 1 RRIIR  0r2,2 0 0 r 2,3 r 2,4 r 2,3 r 2,4 y 2 0 a2,2 a 2,3 a 2,4 a 2,5 a 2,6 a 2,7 a 2,8 b 2 0 rIRIIRRI r r r r r r y 00a a a a a a b 1,2 1,1 1,2 1,3 1,4 1,3 1,4 1 3,3 3,4 3,5 3,6 3,7 3,8 3 IIRRI RGR IIIII 000r2,2 r 2,3 r 2,4 r 2,3 r 2,4 y 2 0 0 0 r2,2 r 2,3 r 2,4 r 2,3 r 2,4 y 2 0 0 0 0r rRIR 0  r y 0 0 0 0r rRIR 0  r y 3,3 3,4 3,4 3 3,3 3,4 3,4 3 R 0 0 0 0 0ry4,4 0 0 4 0 0 0 0 0 a6,6 a 6,7 a 6,8 b 6 IRI 0 0 0 0 0 r r r y 0 0 0 0 0 0 a a b 3,4 3,3 3,4 3 7,7 7,8 7 0 0 0 0 0 0 0 ryI I 4,4 4 0 0 0 0 0 0 0 ry4,4 4 : vectoring mode : rotation mode

Figure 4. The processing diagram of real Givens rotation (RGR) for 4 × 4 MIMO system.

4.2. Processing Engines Figures 5–7 present the block diagram of the proposed processing engines (PEs). All these PEs comprise three kinds of CORDIC-based Givens rotation structures: processing unit a (PUa),

Electronics 2020, 9, x FOR PEER REVIEW 8 of 16

RIRRIIR r1,1 r 1,20  r 1,2 r 1,3 r 1,4  r 1,3  r 1,4 y 1 RRIIR 0r2,2 0 0 r 2,3 r 2,4 r 2,3 r 2,4 y 2 IRIIRR I r1,1 r 1,2 r 1,3 r 1,4 y 1 0 r1,2 r 1,1 r 1,2 r 1,3 r 1,4 r 1,3 r 1,4 y 1 IIRRI 0 r r r y Modified 000r r r r r y 2,2 2,3 2,4 2 2,2 2,3 2,4 2,3 2,4 2 . (10) 00r r y RVD 0 0 0 0r rRIR 0  r y 3,3 3,4 3 3,3 3,4 3,4 3 0 0 0 ry R 4,4 4 0 0 0 0 0ry4,4 0 0 4 0 0 0 0 0 rIRI r r y 3,4 3,3 3,4 3 I 0 0 0 0 0 0 0 ry4,4 4

S H H 1:4,: R1:4,: Q y 1:4 S5:8,: R5:8,: Q y 5:8 Zero Real number

Being processed Complex number RGR

Norm cal. & Processing engine#1 sorting#2 Processing engine#2 sorting#3 Processing engine#3 Processing engine#4 sorting#1

Hy Rc

SCGR Figure 3. Dataflow diagram of the proposed sorted QR decomposition (SQRD) hardware architecture for 4 × 4 multiple-input multiple-output (MIMO) systems. Electronics 2020, 9, 1657 8 of 15 Then, RGR performed with processing engine#3–4 eliminates the two non-zero elements below the diagonalThen, RGR of performedS , as shown with in Figure processing 4. It can engine#3–4 be suggested eliminates that the the elements two non-zero of the upper elements two below rows c theof R diagonal are obtained of S, as shownafter the in processing Figure4. It in can sorting#3 be suggested because that the the elimination elements of of the the upper first two two columns rows of Rofc areHy obtained have after been the completed. processing As inthe sorting#3 upper four because rows of the elimination are derived of from the first the upper two columns two rows of   c of[H y] haveaccording been completed.to Equation As (10), the they upper can four be obtained rows of Simmediatelyare derived after from the the process upper two of sorting#3 rows of R is according to Equation (10), they can be obtained immediately after the process ofI sorting#3 is done. done. Therefore, the elimination processes of the non-zero element S3,2 (i.e., r1,2 ) in RGR and the Therefore, the elimination processes of the non-zero element S (i.e., rI ) in RGR and the third column third column of in SCGR are performed with processing3,2 engine#31,2 at the same time when the of Rc in SCGR are performed with processing engine#3 at the same time when the relevant elements relevant elements are output from sorting#3. This will improve the parallelism of the SQRD hardware are output from sorting#3. This will improve the parallelism of the SQRD hardware architectureI architecture and shorten the processing latency. As for the non-zero elementI S7,6 (i.e., r3,4 ), it will and shorten the processing latency. As for the non-zero element S7,6 (i.e., r3,4), it will be zeroed with be zeroed with processing engine#4c when r of is obtained. processing engine#4 when r4,4 of R is obtained.4,4

RIRRIIR RIRRIIR r1,1 r 1,20  r 1,2 r 1,3 r 1,4  r 1,3  r 1,4 y 1 r1,1 r 1,20  r 1,2 r 1,3 r 1,4  r 1,3  r 1,4 y 1 RRIIR  0r2,2 0 0 r 2,3 r 2,4 r 2,3 r 2,4 y 2 0 a2,2 a 2,3 a 2,4 a 2,5 a 2,6 a 2,7 a 2,8 b 2 0 rIRIIRRI r r r r r r y 00a a a a a a b 1,2 1,1 1,2 1,3 1,4 1,3 1,4 1 3,3 3,4 3,5 3,6 3,7 3,8 3 IIRRI RGR IIIII 000r2,2 r 2,3 r 2,4 r 2,3 r 2,4 y 2 0 0 0 r2,2 r 2,3 r 2,4 r 2,3 r 2,4 y 2 0 0 0 0r rRIR 0  r y 0 0 0 0r rRIR 0  r y 3,3 3,4 3,4 3 3,3 3,4 3,4 3 R 0 0 0 0 0ry4,4 0 0 4 0 0 0 0 0 a6,6 a 6,7 a 6,8 b 6 IRI 0 0 0 0 0 r r r y 0 0 0 0 0 0 a a b 3,4 3,3 3,4 3 7,7 7,8 7 0 0 0 0 0 0 0 ryI I 4,4 4 0 0 0 0 0 0 0 ry4,4 4 : vectoring mode : rotation mode

Figure 4. The processing diagram of real Givens rotation (RGR) for 4 4 MIMO system. Figure 4. The processing diagram of real Givens rotation (RGR) for 4 ×× 4 MIMO system. 4.2. Processing Engines 4.2. Processing Engines Figures5–7 present the block diagram of the proposed processing engines (PEs). All these PEs Figures 5–7 present the block diagram of the proposed processing engines (PEs). All these PEs comprise three kinds of CORDIC-based Givens rotation structures: processing unit a (PUa), processing comprise three kinds of CORDIC-based Givens rotation structures: processing unit a (PUa), unit b (PUb), and processing unit c (PUc). Figure8 shows their architecture based on CORDIC module, all of which can work in vectoring mode (VM) and rotation mode (RM). In VM, elimination operation is applied to the leading column elements of the current input matrix; in RM, the rotation directions generated in the vectoring mode are retrieved for rotation operations of subsequent elements. PUa processes paired complex rows, zeroing the lower one and turning the upper one into a real number [25]. In vectoring mode, the upper-left three CORDIC modules are used, whereas in rotation mode, all of the four CORDIC modules are occupied. PUb is used for the paired rows of which the leading elements are real numbers and the rest are complex numbers. In vectoring mode, the upper CORDIC module processes the leading paired real elements and zeros the lower one; in rotation mode, the two CORDIC modules are used for the rotation operations of the following complex elements. PEc with only one CORDIC module processes paired real inputs. As these processing units have different processing delays (PUa has two CORDIC stages, whereas PUb and PUc have one), delay elements (DEs) are used to align the data sequences for the subsequent processing unit or sorting module. Among all these Givens rotation structures, all the processing units in PE#1 and PE#2, PUa#4 in PE#3, and PUc#2 in PE#4 are used for SCGR, whereas PUc#3 in PE#3 and PUc#4 in PE#4 are dedicated to RGR. The processing units with multiplexers (MUXs) at input and output ports, including PUb#2 and PUc#2–4, form the time-sharing Givens rotation structure, which will be introduced in the next subsection. Electronics 2020, 9, x FOR PEER REVIEW 9 of 16 ElectronicsElectronics 20 202020, 9, ,9 x, xFOR FOR PEER PEER REVIEW REVIEW 9 9of of 16 16 processing unit b (PUb), and processing unit c (PUc). Figure 8 shows their architecture based on processingprocessing unit unit b b ( PUb(PUb),) ,and and processing processing unit unit c c( PUc(PUc).) .Figure Figure 8 8 shows shows their their architecture architecture based based on on CORDIC module, all of which can work in vectoring mode (VM) and rotation mode (RM). In VM, CORDICCORDIC module module, ,all all of of which which can can work work in in vectoring vectoring mode mode (VM) (VM) and and rotation rotation mode mode (RM) (RM). .In In VM VM, , elimination operation is applied to the leading column elements of the current input matrix; in RM, eliminationelimination operation operation is is applied applied to to the the leading leading column column element elements sof of the the current current input input m maatrixtrix; ;in in RM RM, , the rotation directions generated in the vectoring mode are retrieved for rotation operations of thethe rotation rotation directions directions generated generated in in the the vectoring vectoring mode mode are are retrieved retrieved for for rotation rotation operations operations of of subsequent elements. PUa processes paired complex rows, zeroing the lower one and turning the subsequentsubsequent element elements.s. P PUUaa processes processes paired paired complex complex rows rows, ,zeroing zeroing the the lower lower one one and and turning turning the the upper one into a real number [25]. In vectoring mode, the upper-left three CORDIC modules are used, upperupper one one into into a a real real number number [ 2[255].]. In In vectoring vectoring mode, mode, the the upper upper-left-left three three CORDIC CORDIC modules modules are are used used, , whereas in rotation mode, all of the four CORDIC modules are occupied. PUb is used for the paired whereaswhereas in in rotation rotation mode mode, ,all all of of the the four four CORDIC CORDIC modules modules are are occupied. occupied. P PUUbb is is used used fo for rthe the paired paired rows of which the leading elements are real numbers and the rest are complex numbers. In vectoring rowsrows of of which which the the leading leading element elements sare are real real numbers numbers and and the the rest rest are are complex complex numbers. numbers. In In vectoring vectoring mode, the upper CORDIC module processes the leading paired real elements and zeros the lower mode,mode, the the upper upper CORDIC CORDIC module module processes processes the the leading leading paired paired real real elements elements and and zeros zeros the the lower lower one; in rotation mode, the two CORDIC modules are used for the rotation operations of the following oneone; in; in rotation rotation mode, mode, the the two two CORDIC CORDIC modules modules are are used used for for the the rotation rotation operations operations of of the the following following complex elements. PEc with only one CORDIC module processes paired real inputs. As these complexcomplex eleleementsments. . PEc PEc with with only only one one CORDIC CORDIC module module processes processes paired paired real real inputs. inputs. AsAs thesethese processing units have different processing delays (PUa has two CORDIC stages, whereas PUb and processingprocessing units units have have different different processing processing delays delays ( PUa(PUa has has two two CORDIC CORDIC stages stages, ,whereas whereas P PUbUb and and PUc have one), delay elements (DEs) are used to align the data sequences for the subsequent PUcPUc have have one), one), delay delay el elementements s (DE(DEs)s ) areare usedused to to align align the the datadata sequencessequences for for the the subsequentsubsequent processing unit or sorting module. Among all these Givens rotation structures, all the processing processingprocessing unit unit or or sorting sorting module. module. Among Among all all these these Givens Givens rotation rotation structures, structures, all all the the processing processing units in PE#1 and PE#2, PUa#4 in PE#3, and PUc#2 in PE#4 are used for SCGR, whereas PUc#3 in unitsunits in in PE#1 PE#1 and and PE#2 PE#2, ,PUa# PUa#44 in in PE#3 PE#3, ,and and PUc#2 PUc#2 in in PE#4 PE#4 are are used used for for SCGR SCGR, ,whereas whereas PUc#3 PUc#3 in in PE#3 and PUc#4 in PE#4 are dedicated to RGR. The processing units with multiplexers (MUXs) at PE#3PE#3 and and PUc#4 PUc#4 in in PE#4 PE#4 are are dedicated dedicated to to RGR. RGR. T Thehe processing processing units units with with multiplexers multiplexers (MUXs) (MUXs) at at input and output ports, including PUb#2 and PUc#2–4, form the time-sharing Givens rotation inputElectronics and output9 ports, including PUb#2 and PUc#2–4, form the time-sharing Givens rotation structure,2020 which, , 1657 will be introduced in the next subsection. 9 of 15 structure,structure, which which will will be be introduced introduced in in the the next next subsection. subsection.

H H H1,: H1,: S6:7,8 R6:7,8 H 1,: H 1,: SS6:7,8 RR6:7,8

1,: PUa#1 DE#1 1,: 6:7,8 PUc#3 6:7,8 MUX

PUa#1 DE#1 MUX PUc#3 MUX

H S MUX R MUX H2,: 2,: S 2:3,: MUX 2:3,: H2,: H2,: S 2:3,: R2:3,: H2,: H2,: 2:3,: R2:3,:

c c H3,: H3,: H 3,: Rc 3,: H3,: HH3,: H 3,: RR3,: H3,: PUa#2 PUb#1 3,: 3,: PUa#4 3,: PUa#2PUa#2 PUb#1PUb#1 PUa#4PUa#4 H4,: H4,: H H H4,: H4,: H 4,: H 4,: H4,: H4,: H 4,: H 4,: 4,: 4,: (a) (b) (a(a) ) (b(b) ) Figure 5. Block diagram of processing engines (PEs): (a) processing engine#1; (b) processing engine#3. FigureFigure 5. 5. Block BlockBlock diagram diagram diagram of of of processing processing processing engines engines (PEs) (PEs) (PEs):: (: a( ()aa )processing) processing processing engine#1; engine#1; engine#1; ( b( (b) )processing) processing processing engine#3. engine#3.

DE#3 H2,: H2,: PUc#1 DE#3 H2,: H2,: PUc#1 DE#3 H2,: H2,: PUc#1

DE#2 H3,: DE#2 H3,: H3,: DE#2 H3,: H3,: H3,: H4,: PUa#3 H4,: PUa#3 PUb#2 H4,:

H PUa#3 PUb#2PUb#2 MUX

H 4,: MUX R MUX

H 4,: MUX R 2:3,6 MUX 4,: MUX R 2:3,6 S 2:3,6 S 2:3,6 R2:3,8 S 2:3,6 R2:3,8 2:3,6 R2:3,8

Figure 6. Block diagram of processing engine#2. FigureFigure 6 6..6 Block. Block diagram diagram of of processing processing engine#2. engine#2.

S S 2:3,7 R2:3,7 S2:3,72:3,7 RR2:3,7

PUc#2 2:3,7 MUX

MUX PUc#2

MUX MUX MUX S6:7,: MUX R SS6:7,: RR6:7,:

6:7,: PUc#4 6:7,: MUX

MUX PUc#4 c MUX

MUX c MUX H 4,: DE#4 MUX Rc 4,: H4,:4,: DE#4 R4,:4,:

Electronics 2020, 9, x FOR PEER REVIEWFigure 7.7. Block diagram of processing engine#4. 10 of 16 FigureFigure 7 .7 Block. Block diagram diagram of of processing processing engine#4. engine#4.

CORDIC CORDIC CORDIC VM/RM VM/RM VM/RM CORDIC VM/RM CORDIC CORDIC CORDIC VM/RM RM RM

(a) (b) (c) Figure 8. ArchitectureArchitecture of of coordinate coordinate rotation rotation digital digital computer computer (CORDIC (CORDIC)-based)-based processing processing units: units: (a) (PUa;a) PUa; (b) (PbU) PUb;b; (c) (PUcc) PUc.. 4.3. Time-Sharing Givens Rotation Structure 4.3. Time-Sharing Givens Rotation Structure During the SCGR procedure for 4 4 MIMO system described above, the current input matrix During the SCGR procedure for 4 ×× 4 MIMO system described above, the current input matrix of PE#2 is of the size 3 3 after the elimination process of the first column of H with PE#1. With the of PE#2 is of the size 3 ×× 3 after the elimination process of the first column of H with PE#1. With the sorting procedures inserted in the column elimination processes, the Givens rotation structures of PE#2 sorting procedures inserted in the column elimination processes, the Givens rotation structures of process only three columns and stay in an idle state for one clock cycle within every processing cycle PE#2 process only three columns and stay in an idle state for one clock cycle within every processing (four clock cycles). Similarly, the periods of idle state of PE#3 and PE#4 in every processing cycles cycle (four clock cycles). Similarly, the periods of idle state of PE#3 and PE#4 in every processing are two and three clock cycles, respectively. As a result, quite a few processing units are operating cycles are two and three clock cycles, respectively. As a result, quite a few processing units are in an unsaturated state, which causes inefficiency of the SQRD hardware architecture. On the other operating in an unsaturated state, which causes inefficiency of the SQRD hardware architecture. On hand, the elimination process of non-zero elements below the diagonal of S in RGR needs many the other hand, the elimination process of non-zero elements below the diagonal of S in RGR needs Givens rotation operations which are independent of the process of SCGR. Given this, we design a many Givens rotation operations which are independent of the process of SCGR. Given this, we time-sharing (TS) Givens rotation structure to take advantage of these idle states mentioned above to design a time-sharing (TS) Givens rotation structure to take advantage of these idle states mentioned share parts of the Givens rotation operations in RGR, which can save hardware cost as well as improve above to share parts of the Givens rotation operations in RGR, which can save hardware cost as well hardware efficiency. as improve hardware efficiency. Figure 9 shows the deeply pipelined processing flow of the proposed SQRD processor with time- sharing Givens rotation structure. The index ij,  in the box denotes the element being processed in the i th row and j th column of the first during SCGR procedure or the first S during RGR procedure. In the SCGR procedure, after the process of sorting#2 is done, PUc#1 and PUa#3 in PE#2 stay idle at the first clock cycle of the processing cycles and process the three columns of the current input 3 × 3 matrix in the other three clock cycles; PUa#4 in PE#3 is in an idle state for the first half of every processing cycle and processes the two columns of the current input 2 × 2 matrix in the second

half; then, PUc#2 in PE#4 processes the remaining H4,4 at the last clock cycle of the processing cycles

and stays idle for the other three clock cycles. In the procedure of RGR, the elimination of S3,2 in the first performed by PUc#3 in PE#3 in vectoring mode starts at the 41st clock cycle when ,

I i.e., r1,2 , and S2,2 , i.e., r2,2 , are obtained after the processing in sorting#3; PUc#4 in PE#4 will

eliminate S7,6 in vectoring mode at the 55th clock cycle after S6,6 , i.e., r4,4 , is obtained from the process of PUc#2. With a time-sharing Givens rotation structure, part of the rotation operations of PUc#3 and PUc#4 performed on the subsequent elements shown in Figure 4 are reasonably assigned to the suitable processing units in an idle state. As illustrated in Figure 9, the rotation operations of RI IR rr2,4, 1,4  , i.e., SS2,6, 3,6  , and rr2,4, 1,4  , i.e., SS2,8, 3,8  , are separately assigned to the two

CORDIC modules in PEb#2 because PEb#2 is just idle when r1,4 and r2,4 are generated. For a IR similar reason, the rotation operations of rr2,3, 1,3  , i.e., SS2,7, 3,7  , 0, r1,1  , i.e., SS2,3, 3,3  , and R 0, r3,4  , i.e., SS6,8, 7,8  , are assigned to PUc#2, PUc#4, and PUc#3, respectively. If no time-sharing Givens rotation is adopted, as shown in Figure 10, an additional PUc#5 must be used for helping PUc#3 with the rotation operations to match the processing rate of SCGR because it will take seven

clock cycles for PUc#3 alone to finish the Givens rotation of the seven paired elements of S2:3,: , which exceeds the processing cycles of SCGR. Furthermore, it will cost more delay buffers (DBs) for the aligning of the relevant elements for modified RVD, because the process is extended with only three processing units (PUc#3, PUc#4, and PUc#5). Therefore, with time-sharing Givens rotation structure,

Electronics 2020, 9, 1657 10 of 15

Figure9 shows the deeply pipelined processing flow of the proposed SQRD processor with time-sharing Givens rotation structure. The index (i, j) in the box denotes the element being processed in the ith row and jth column of the first H during SCGR procedure or the first S during RGR procedure. In the SCGR procedure, after the process of sorting#2 is done, PUc#1 and PUa#3 in PE#2 stay idle at the first clock cycle of the processing cycles and process the three columns of the current input 3 3 matrix in the other three clock cycles; PUa#4 in PE#3 is in an idle state for the first half of every × processing cycle and processes the two columns of the current input 2 2 matrix in the second half; × then, PUc#2 in PE#4 processes the remaining H4,4 at the last clock cycle of the processing cycles and stays idle for the other three clock cycles. In the procedure of RGR, the elimination of S3,2 in the first I S performed by PUc#3 in PE#3 in vectoring mode starts at the 41st clock cycle when S3,2, i.e., r1,2, and S2,2, i.e., r2,2, are obtained after the processing in sorting#3; PUc#4 in PE#4 will eliminate S7,6 in vectoring mode at the 55th clock cycle after S6,6, i.e., r4,4, is obtained from the process of PUc#2. With a time-sharing Givens rotation structure, part of the rotation operations of PUc#3 and PUc#4 performed on the subsequent elements shown in Figure4 are reasonably assigned to the suitable processing ( R I ) ( ) units in an idle state. As illustrated in Figure9, the rotation operations of r2,4, r1,4 , i.e., S2,6, S3,6 , and ( rI , rR ), i.e., (S , S ), are separately assigned to the two CORDIC modules in PEb#2 because − 2,4 1,4 2,8 3,8 PEb#2 is just idle when r1,4 and r2,4 are generated. For a similar reason, the rotation operations of ( rI , rR ), i.e., (S , S ), (0, r ), i.e., (S , S ), and (0, rR ), i.e., (S , S ), are assigned to − 2,3 1,3 2,7 3,7 1,1 2,3 3,3 3,4 6,8 7,8 PUc#2, PUc#4, and PUc#3, respectively. If no time-sharing Givens rotation is adopted, as shown in Figure 10, an additional PUc#5 must be used for helping PUc#3 with the rotation operations to match the processing rate of SCGR because it will take seven clock cycles for PUc#3 alone to finish the Givens rotation of the seven paired elements of S2:3,:, which exceeds the processing cycles of SCGR. Furthermore, it will cost more delay buffers (DBs) for the aligning of the relevant elements for modified Electronics 2020, 9, x FOR PEER REVIEW 11 of 16 RVD,Electronics because 2020 the, 9, x processFOR PEER is REVIEW extended with only three processing units (PUc#3, PUc#4, and11PUc#5). of 16 Therefore, with time-sharing Givens rotation structure, one CORDIC module and eight DBs (a DB one CORDIC module and eight DBs (a DB contains Nbw bit registers; is the adopted bit-width) containsoneare CORDICsavedNbw bitand module registers; the hardware andN eightbw efficiencyis theDBs adopted (a DB of thecontains bit-width)entire SQRDNbw arebit is registers savedimproved. and; the is hardware the adopted effi bitciency-width) of the entireare SQRD saved isand improved. the hardware efficiency of the entire SQRD is improved. Processing cycles of first H or S TS RM of first H or S Idle state VM of SCGR RM of SCGR VM of RGR RM of RGR TS RM

Processing1,1 1,2 1,3 cycles1,4 of first H or S TS RM of first H or S Idle state VM of SCGR RM of SCGR VM of RGR RM of RGR TS RM PUa#1 2,1 2,2 2,3 2,4 3,1 3,2 3,3 3,4 PUa#2 1,14,1 1,24,2 1,34,3 1,44,4 PUa#1 2,1 2,2 2,3 2,4 1,1 1,2 1,3 1,4 PUb#1 3,1 3,2 3,3 3,4 3,1 3,2 3,3 3,4 PUa#2 4,1 4,2 4,3 4,4 PUc#1 2,2 2,3 2,4 2,6 PUb#1 1,1 1,2 1,3 1,4 3,6 PUa#3 3,1 3,2 3,3 3,4 3,2 3,3 3,4 2,8 PUc#1 2,24,2 2,34,3 2,44,4 2,63,8 PUb#2 2,2 2,3 2,4 3,6 PUa#3 3,2 3,3 3,4 3,2 3,3 3,4 2,8 4,2 4,3 4,4 3,83,3 3,4 PUa#4 2,2 2,3 2,4 4,3 4,4 PUb#2 3,2 3,3 3,4 2,3 4,4 PUc#2 3,33,3 3,4 PUa#4 2,2 4,32,5 4,42,4 6,8 PUc#3 2,3 3,2 3,5 3,4 4,4 7,8 PUc#2 3,32,7 6,6 6,7 PUc#4 2,2 2,53,7 2,4 7,6 6,87,7 PUc#3 3,2 3,5 3,4 7,8 2,7 6,6 6,7 PUc#4 3,7 7,6 7,7 0 6 7 8 9 10 11 12 13 14 15 16 17 23 24 25 26 27 28 29 30 31 32 33 34 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 6 clock cycles for Sorting#1 5 clock cycles for Sorting#2 5 clock cycles for Sorting#3 T/clock cycle 0 6 7 8 9 10 11 12 13 14 15 16 17 23 24 25 26 27 28 29 30 31 32 33 34 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 6 clock cycles for Sorting#1 5 clock cycles for Sorting#2 5 clock cycles for Sorting#3 T/clock cycle FigureFigure 9. Deeply 9. Deeply pipelined pipelined processing processing flow flow of the of proposed the proposed SQRD SQRD processor processor with withtime-sharing time-sharing Givens rotationFigureGivens structure. 9.rotation Deeply structure pipelined. processing flow of the proposed SQRD processor with time-sharing Givens rotation structure. Time/clock cycle Time/clock cycle Time/clock cycle Time/clock cycle T0 T0 1 T0  2 T0  3 T0  4 T1 T1 1 T1  2 T1  3 T0 T0 1 T0  2 T0  3 T TDB1 T DB 2 T  3 T  4 T TDB1 TDB 2 T  3 T TDB1 T DB 2 T  3 DB DB r011 0 0 0 0 r331 1 1 1 r110 0 0 0 r33 DB DBDB DB DBDB DB DB DBDB DB DB DBDB r11 r12 r33 r34 r11 r12 r33 r34 DB DB DB DB r12 DBr13 DBRB rr3444 r12 DBr13 DBr14 rr3444 DB DB flow Data

r13 DBRBr14 DB r44 r22 rr1323 rr1424 r44 Data flow Data Data flow Data DB DB r22 r23 rDB14 DB r22 r23 r24

Data flow Data DB DB r r r I R I R I 22 23 DB24 DB r r r r r PUc#4 r r DB PUc#3 22 , 12 23, 13 , 12 r44 , 34 , 33 r I R I R I I R R24 I DBR I I R PUc#2r r r , r R r I PUc#4 r ,r , r R PUc#3 r22 , r12 0 , r12 r23, r13 r24 , r14 PUc#4 r44 ,r34 0 , r33 0 ,r34 PUc#3 22 , 12 23 , r1311 r24, ,12r14 44 34 ,r3334 PUb#2 R PUc#5I R RI IR R I IR I R PUc#2 I R R I IR Givens Givens rotation PUc#3 PUc#4 r r r r22 , r12 00,,rr1211 r-23r23, ,rr1313r-24r24, ,rr14 r44 ,r34 0 , r33 0 ,r34 PUc#4 -r23, ,r1113r-24r24, ,r1414 PUc#3 , 34 14 rotation Givens PUb#2 PUc#5 I R I R I R I R Givens Givens rotation 0 , r11 -r23,r13 -r24,r PUc#4 -r23,r13 -r24,r14 PUc#3 Without time-sharing14 structure rotation Givens With time-sharing structure Without time-sharing structure With time-sharing structure FigureFigure 10. 10.Processing Processing flow flow of of time-sharing time-sharing Givens rotation rotation structure structure.. Figure 10. Processing flow of time-sharing Givens rotation structure. 4.4. CORDIC Architecture 4.4. CORDIC Architecture Figures 11 and 12 illustrate the detailed CORDIC architecture adopted in our design. The CORDICFigure modules 11 and has 12 eight illustrate micro - therotation detailed stages CORDIC and four architecture pipelined stages. adopted In the in ournormal design. CORDIC The

CORDICmodule shown module in has Figure eight 1 1microa, the- rotationrotation stagesdirection and calculated four pipelined by y istages.0: 1:1 In isthe used normal and CORDICstored in y 0: 1:1 moduleVM; and shown in RM, in the Figure rotation 11a, direction the rotation is retrieved. direction In calculated the CORDIC by modulei used is forused the and time stored-sharing in VM;Givens and rotation in RM, thestructure rotation shown direction in Figure is retrieved. 11b, a timeIn the-sharing CORDIC mode module (TSM) used is added for the to time process-sharing the Givensshared rotation structureoperation shown using thein Figure rotation 11b direction, a time- sharingfrom the mode corresponding (TSM) is added processing to process unit. Thethe 7 sharedscale factor rotation of CORDICoperation withusing eight the rotation micro-rotation directions is from the1 corresponding 1 22i 0.607259 processing. In our design, unit. The we i0   7 scale factor of CORDIC with eight micro-rotations is 1 1 22i 0.607259 . In our design, we 1  3  6  9 i0   approximate this number to 2 2  2  2  0.607421875 and the corresponding hardware 1  3  6  9 approximatearchitecture isthis presented number in Figure to 2 12.2  2  2  0.607421875 and the corresponding hardware architecture is presented in Figure 12.

xi yi xi yi xi yi xi yi yi>0:-1:1 yi>0:-1:1 >> i >> i >> i >> i yi>0:-1:1 yi>0:-1:1 >> i >> i >> i >> i TS rotation

direction MUX MUX TS rotation

+/- +/- +/- +/- direction MUX VM/RM VM/RM/TSMMUX +/- +/- +/- +/- xi+1 y REG xi+1 y REG i+1 VM/RM i+1 VM/RM/TSM x x i+1 (a)y i+1 REG i+1 (by) i+1 REG (a) (b) Figure 11. Architecture of micro-rotations of adopted CORDIC module: (a) micro-rotation for normal FigureCORDIC 11. module;Architecture (b) micro of micro-rotation-rotations for time of adopted-sharing CORDIC CORDIC module: module. ( a) micro-rotation for normal CORDIC module; (b) micro-rotation for time-sharing CORDIC module.

Electronics 2020, 9, x FOR PEER REVIEW 11 of 16

one CORDIC module and eight DBs (a DB contains Nbw bit registers; is the adopted bit-width) are saved and the hardware efficiency of the entire SQRD is improved.

Processing cycles of first H or S TS RM of first H or S Idle state VM of SCGR RM of SCGR VM of RGR RM of RGR TS RM

1,1 1,2 1,3 1,4 PUa#1 2,1 2,2 2,3 2,4 3,1 3,2 3,3 3,4 PUa#2 4,1 4,2 4,3 4,4 1,1 1,2 1,3 1,4 PUb#1 3,1 3,2 3,3 3,4 PUc#1 2,2 2,3 2,4 2,6 3,6 PUa#3 3,2 3,3 3,4 2,8 4,2 4,3 4,4 3,8 2,2 2,3 2,4 PUb#2 3,2 3,3 3,4 3,3 3,4 PUa#4 4,3 4,4 2,3 4,4 PUc#2 3,3 2,2 2,5 2,4 6,8 PUc#3 3,2 3,5 3,4 7,8 2,7 6,6 6,7 PUc#4 3,7 7,6 7,7

0 6 7 8 9 10 11 12 13 14 15 16 17 23 24 25 26 27 28 29 30 31 32 33 34 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 6 clock cycles for Sorting#1 5 clock cycles for Sorting#2 5 clock cycles for Sorting#3 T/clock cycle

Figure 9. Deeply pipelined processing flow of the proposed SQRD processor with time-sharing Givens rotation structure.

Time/clock cycle Time/clock cycle

T0 T0 1 T0  2 T0  3 T0  4 T1 T1 1 T1  2 T1  3 T0 T0 1 T0  2 T0  3 DB DB DB DB DB DB DB DB r11 r33 r11 r33 DB DB DB r12 DB r34 r12 DB DB r34 DB r13 DBRB r44 r13 r14 r44 DB flow Data r14 DB r22 r23 r24

Data flow Data DB r22 r23 DB DB r24 DB I R I R I PUc#3 r22 ,r12 r23, r13 , r12 PUc#4 r44 ,r34 , r33 I R R I R I I R R I R PUc#3 r , r 0 , r r , r r , r PUc#4 r ,r 0 , r 0 ,r PUc#2 , r r , r ,r 22 12 12 23 13 24 14 44 34 33 34 11 24 14 PUb#2 34 PUc#5 I R I R I R I R Givens Givens rotation 0 , r11 -r23,r13 -r24,r PUc#4 -r23,r13 -r24,r14 PUc#3

14 rotation Givens Without time-sharing structure With time-sharing structure Figure 10. Processing flow of time-sharing Givens rotation structure.

Electronics 2020, 9, 1657 11 of 15 4.4. CORDIC Architecture

Figure4.4. 11 and Architecture 12 illustrate the detailed CORDIC architecture adopted in our design. The CORDIC module has eight micro-rotation stages and four pipelined stages. In the normal CORDIC Figures 11 and 12 illustrate the detailed CORDIC architecture adopted in our design. The CORDIC module shown in Figure 11a, the rotation direction calculated by yi 0: 1:1 is used and stored in module has eight micro-rotation stages and four pipelined stages. In the normal CORDIC module VM; andshown in inRM, Figure the 11rotationa, the rotation direction direction is retrieved. calculated In bythey CORDIC> 0 : 1 : 1moduleis used andused stored for the in VM; time and-sharing in i − GivensRM, rotation the rotation structure direction shown is retrieved. in Figure In the 11b CORDIC, a time module-sharing used mode for the (TSM) time-sharing is added Givens to process rotation the sharedstructure rotation shown operation in Figure using 11 b,the a time-sharingrotation direction mode (TSM)from the is added corresponding to process theprocessing shared rotation unit. The 7 scale operation factor of using CORDIC the rotation with eight direction micro from-rotation the correspondings is 1 processing 1 22i 0.607259 unit. The. In scale our factor design, of we Q   i0   CORDIC with eight micro-rotations is 7 1/ √1 + 2 2i = 0.607259. In our design, we approximate 1  3i= 0 6  9 − approximatethis number this to 2 number1 + 2 3 to2 6 229 2= 0.607421875 2  2  0.607421875and the corresponding and the hardwarecorresponding architecture hardware is − − − − − − architecturepresented is p inresented Figure 12 in. Figure 12.

xi yi xi yi

yi>0:-1:1 yi>0:-1:1 >> i >> i >> i >> i

TS rotation

direction MUX +/- +/- +/- +/- MUX VM/RM VM/RM/TSM x x i+1 yi+1 REG i+1 yi+1 REG (a) (b)

ElectronicsFigure 2020Figure 11., 9, xArchitecture FOR 11. Architecture PEER REVIEW of micro of micro-rotations -rotations of of adopted adopted CORDIC CORDIC module:module: ( a(a)) micro-rotation micro-rotation for for normal normal 12 of 16 CORDICCORDIC module; module; (b) micro (b) micro-rotation-rotation for for time time-sharing-sharing CORDIC CORDIC module.module.

x8 y8

>>3 >>6 >>9 >>1 >>3 >>6 >>9 >>1

Figure 12. Architecture of scaling of adopted CORDIC module. Figure 12. Architecture of scaling of adopted CORDIC module. 4.5. Sorting 4.5. SortingThe iterative sorting procedure of the proposed SQRD processor is performed with norm calculator and sorting#1–3. Figure 13 shows the architectures of these modules. The complex elements of the 4 4 The iterative sorting procedure of the proposed SQRD processor is performed with× norm channel matrix H are delivered column by column as the input of norm calculator. First, the squares of calculatorthereal and part sorting#1 and imaginary–3. Figure part 13 of shows the four the elements architectures in one column of these are modules. calculated The with complex SQ modules elements of the 4and × 4 thenchannel these matrix squares H are are added delivered up in pairs column to obtain by column the norm as value the input of every of complexnorm calculator. element; First, the squareultimately,s of the all real the part norm and values imaginary are added part together of the with four the elements SUM module in one to column get the normare calculated value of with SQ modulethe currents and inputthen these column. squares In sorting#1, are added the norm up in update pairs to module obtain is the bypassed norm andvalue the of CS every module complex element;compares ultimately these, successiveall the norm norm values values are until add it findsed together the minimum with one. the Finally, SUM module the column to withget the norm value ofsmallest the current norm is input swapped column. with the In first sorting#1, column ofthe the n fourorm stored update in themodule column is bu bypassedffer. In a similar and the CS way, sorting#2 and sorting#3 perform the rest of the sorting procedures. However, what is different module compares these successive norm values until it finds the minimum one. Finally, the column from sorting#1 is that, before the CS module, they will first update the norm values from the former with thesorting smallest module norm according is swapped to line 13with in Algorithmthe first column 1. of the four stored in the column buffer. In a similar way, sorting#2 and sorting#3 perform the rest of the sorting procedures. However, what is different from sorting#1 is that, before the CS module, they will first update the norm values from the former sorting module according to line 13 in Algorithm 1.

norm/upd_norm norm/ SQ upd_norm SQ Norm SQ CS REG hh4,4 4,1 SQ norm hh update hh3,4 3,1 SUM min 2,4 2,1 SQ 2 hh1,4 1,1 2 SQ |h 1, i| 2 SQ |h 1, i| SQ |h 1,2 i| |h | Column sorted

1,i columninput buffer

MUX columns

(a) (b)

Figure 13. Architecture of the modules of the iterative sorting procedure: (a) norm calculator; (b) sorting#1–3.

5. Implementation Results and Comparisons We designed the RTL models of the proposed SQRD processor with Verilog HDL and synthesized it by Synopsys Design Compiler with SMIC 55-nm COMS technology. The bit-width of the data pass of our design was set to 16 bits, including 5 bits for the integer part and 11 bits for the fractional part. The fixed-point design of the proposed SQRD was simulated in the same platform introduced in Section 2.1. The result shows that the BER performance has a negligible degradation compared to the floating-point one. Table 2 presents a summary of the implementation results and performance comparisons with related studies. The gate count of the proposed SQRD processor is 176.5 K which is greatly reduced compared to the other designs. This is due to the adopted low- complexity SQRD algorithm and time-sharing Givens rotation structure. Our design achieves a high throughput of 62.5 M SQRD/s and 250 M QyH /s, with an operating frequency of 250 MHz, which is better than most of the other studies. The design in reference [3] has the highest SQRD throughput with the best fmax , albeit at the expense of exorbitant hardware cost and long processing latency.

Electronics 2020, 9, x FOR PEER REVIEW 12 of 16

x8 y8

>>3 >>6 >>9 >>1 >>3 >>6 >>9 >>1

Figure 12. Architecture of scaling of adopted CORDIC module.

4.5. Sorting The iterative sorting procedure of the proposed SQRD processor is performed with norm calculator and sorting#1–3. Figure 13 shows the architectures of these modules. The complex elements of the 4 × 4 channel matrix H are delivered column by column as the input of norm calculator. First, the squares of the real part and imaginary part of the four elements in one column are calculated with SQ modules and then these squares are added up in pairs to obtain the norm value of every complex element; ultimately, all the norm values are added together with the SUM module to get the norm value of the current input column. In sorting#1, the norm update module is bypassed and the CS module compares these successive norm values until it finds the minimum one. Finally, the column with the smallest norm is swapped with the first column of the four stored in the column buffer. In a similar way, sorting#2 and sorting#3 perform the rest of the sorting procedures. However, what is different from sorting#1 is that, before the CS module, they will first update the norm values from Electronics 2020, 9, 1657 12 of 15 the former sorting module according to line 13 in Algorithm 1.

norm/upd_norm norm/ SQ upd_norm SQ Norm SQ CS REG hh4,4 4,1 SQ norm hh update hh3,4 3,1 SUM min 2,4 2,1 SQ 2 hh1,4 1,1 2 SQ |h 1, i| 2 SQ |h 1, i| SQ |h 1,2 i| |h | Column sorted

1,i columninput buffer

MUX columns

(a) (b)

FigureFigure 13. Architecture13. Architecture of the of modules the modules of the iterativeof the iterative sorting procedure:sorting procedure: (a) norm ( calculator;a) norm calculator; (b) sorting#1–3. (b) sorting#1–3. 5. Implementation Results and Comparisons 5. ImplementationWe designed the Results RTL models and C ofomparison the proposeds SQRD processor with Verilog HDL and synthesized it byWe Synopsys design Designed the CompilerRTL models with SMIC of the 55-nm proposed COMS SQRD technology. processor The bit-width with Verilog of the data HDL pass and of our design was set to 16 bits, including 5 bits for the integer part and 11 bits for the fractional synthesized it by Synopsys Design Compiler with SMIC 55-nm COMS technology. The bit-width of part. The fixed-point design of the proposed SQRD was simulated in the same platform introduced the data pass of our design was set to 16 bits, including 5 bits for the integer part and 11 bits for the in Section 2.1. The result shows that the BER performance has a negligible degradation compared to fractional part. The fixed-point design of the proposed SQRD was simulated in the same platform the floating-point one. Table2 presents a summary of the implementation results and performance introduced in Section 2.1. The result shows that the BER performance has a negligible degradation comparisons with related studies. The gate count of the proposed SQRD processor is 176.5 K which compared to the floating-point one. Table 2 presents a summary of the implementation results and is greatly reduced compared to the other designs. This is due to the adopted low-complexity SQRD performance comparisons with related studies. The gate count of the proposed SQRD processor is algorithm and time-sharing Givens rotation structure. Our design achieves a high throughput of 176.5 K which is greatly reduced compared to the other designs. This is due to the adopted low- 62.5 M SQRD/s and 250 M QHy/s, with an operating frequency of 250 MHz, which is better than most complexity SQRD algorithm and time-sharing Givens rotation structure. Our design achieves a high of the other studies. The design in reference [3] has the highest SQRD throughput with the best fmax, QyH throughputalbeit at the of expense 62.5 M S ofQRD/s exorbitant and 250 hardware M cost/s, with and longan operating processing frequency latency. of Therefore, 250 MHz, we which take is betterhardware than complexitymost of the and other implementation studies. The design technology in reference into consideration [3] has the and highest introduce SQRD normalized throughput withhardware the best effi ciencyfmax , albeit (NHE) at for the a fairexpense comparison of exorbitant of throughput. hardware Table cost2 clearly and long shows pro thatcessing our design latency. is superior to all the other SQRD processors in terms of normalized hardware efficiency. In addition, we evaluate these designs in a 4 4 64QAM MIMO system and our design achieves the highest data × throughput of 6 Gbps. For IEEE 802.11ax [1], the maximum uncoded data throughput in the same scenarios with a bandwidth of 160 MHz is 2.88 Gbps. Therefore, the proposed SQRD design is able to support the IEEE 802.11ax.

Table 2. Implementation results and performance comparisons.

Items This Study [9][20][3][8][19] Antennas 4 4 4 4 4 4 4 4 4 4 4 4 × × × × × × Algorithm GR GR GR GR GR GR Technology 55 nm 65 nm 0.18 µm 65 nm 0.13 µm 90 nm fmax (Hz) 250 M 243.9 M 116.3 M 550 M 200 M 220 M Gate count 176.5 K 278 K 437.5 K 468 K 299 K 375.1 K Processing latency (ns) 236 266.5 - 625.5 685-775 654.5 SQRD Processing cycles 4 5 4 8 8 5 SQRD throughput (SQRD/s) 62.5 M 48.8 M 29 M 68.75 M 25 M 44 M SQRD NHE 1 0.354 0.207 0.217 0.177 0.198 0.192 QHy Processing cycles 1 1.25 4 - - 5 QHy throughput (QHy/s) 250 M 195 M 29 M - - 44 M QHy NHE 1.416 0.829 0.217 - - 0.192 MIMO data throughput 2 6 Gbps 4.7 Gbps 696 Mbps - - 1 Gbps 1 NHE = throughput (M) Technology/(55nm Gate cout (K)); 2 evaluated for a 4 4 64QAM MIMO system [9]. · · × Electronics 2020, 9, 1657 13 of 15

6. Conclusions In this article, we designed an SQRD processor with reduced complexity and high throughput for MIMO detectors. An efficient SQRD algorithm based on a novel modified RVD was proposed, which could significantly reduce the computational complexity compared to the latest studies. According to the proposed algorithm, we designed the corresponding SQRD hardware architecture with CORDIC-based Givens rotation. In the hardware design, a time-sharing Givens rotation structure was adopted to take advantage of the CORDIC processor in an idle state as far as possible. In this way, hardware complexity was further decreased and hardware efficiency was improved. We also implemented the SQRD processor with SMIC 55-nm COMS technology. The implementation results show that our design surpasses other related studies in normalized hardware efficiency and achieves a MIMO data throughput of 6 Gbps, which can support current high-speed wireless MIMO systems.

Author Contributions: Conceptualization, L.S. and B.W.; methodology, L.S.; software, L.S.; validation, L.S., B.W., and T.Y.; writing—original draft preparation, L.S.; writing—review and editing, L.S. and B.W.; funding acquisition, T.Y. All authors have read and agreed to the published version of the manuscript. Funding: This research is supported by the National Science and Technology Major Project of China under Grant 2014ZX03001011-002. Conflicts of Interest: The authors declare no conflict of interest.

Appendix A

 2  Since y is normally distributed which is y Hs, δ INN [14], the likelihood to estimate s can be ∼ N given by   1   QN − 2 2 p(y; s) = √2πσ exp (y Hi,:s) /2σ i=1 − i − !    1 N QN √ − P 2 2 = i=1 2πσ exp (yi Hi,:s) /2σ (A1) · −i=1 − Q   1   = N √2πσ − exp y Hs 2/2σ2 i=1 · −k − k Note that to maximize likelihood, p(y; s) is equivalent to minimize y Hs 2. Therefore, the ML k − k ^ 2 solution of s is given by sML = argmin y Hs . s Ω k − k ∈ Appendix B The number of CORDIC operations needed in SCGR is the summation of the CORDIC operations needed in every column elimination process. The elimination process of the kth column zeros the complex-valued elements below the kth row and turns the element in the kth row into a real number, which contains two steps. Step 1, zero the imaginary parts of all the N k + 1 complex-valued elements to be processed of − the kth column. For the elimination process of every complex-valued element to be processed, one CORDIC operation for vectoring mode and N k CORDIC operations (rotation operations for the − subsequent N k elements in the corresponding row) for rotation mode are needed. Thus, the number − of CORDIC operations cost in step 1 is (N k + 1)(1 + N k). − − Step 2, the first row of the kth column is selected as pivot row to nullify all the N k real-valued − elements in the subsequent rows of the kth column. For the elimination operation of every real-valued element in the subsequent rows, one CORDIC operation for vectoring mode and 2(N k) CORDIC − operations (rotation operations for the subsequent N k paired complex-valued elements in the − corresponding rows) for rotation mode are needed. Thus, the number of CORDIC operations cost in step 2 is (N k)(1 + 2(N k)). − − For all the elimination processes of N columns in H, the number of CORDIC operations is totally PN ((N k + 1)(N k + 1) + (N k)(1 + 2(N k))) = N3. k=1 − − − − Electronics 2020, 9, 1657 14 of 15

References

1. IEEE Draft Standard for Information Technology—Telecommunications and Information Exchange between Systems Local and Metropolitan Area Networks—Specific Requirements Part 11: Wireless LAN Medium Access Control (MAC) and Physical Layer (PHY) Specifications Amendment Enhancements for High Efficiency WLAN. Available online: https://ieeexplore.ieee.org/document/8424259 (accessed on 31 July 2018). 2. Tsai, P.Y.; Lo, P.C.; Shih, F.J.; Jau, W.J.; Huang, M.Y.; Huang, Z.Y. A 4 4 MIMO-OFDM Baseband Receiver × with 160 MHz Bandwidth for Indoor Gigabit Wireless Communications. IEEE Trans. Circuits Syst. I Regul. Pap. 2015, 62, 2929–2939. [CrossRef] 3. Yan, Z.T.; He, G.H.; Ren, Y.F.; He, W.F.; Jiang, J.F.; Mao, Z.G. Design and Implementation of Flexible Dual-Mode Soft-Output MIMO Detector with Channel Preprocessing. IEEE Trans. Circuits Syst. I Regul. Pap. 2015, 62, 2706–2717. [CrossRef] 4. Wubben, D.; Bohnke, R.; Rinas, J.; Kuhn, V.; Kammeyer, K.D. Efficient algorithm for decoding layered space-time codes. Electron. Lett. 2001, 37, 1348–1350. [CrossRef] 5. Rakesh, G.; Ove, E.; Liu, L. An Adaptive QR Decomposition Processor for Carrier-Aggregated LTE-A in 28-nm FD-SOI. IEEE Trans. Circuits Syst. I Regul. Pap. 2017, 64, 1914–1926. [CrossRef] 6. Dongyeob, S.; Jongsun, P. A Low-Latency and Area-Efficient Gram–Schmidt-Based QRD Architecture for MIMO Receiver. IEEE Trans. Circuits Syst. I Regul. Pap. 2018, 65, 2606–2616. [CrossRef] 7. Patel, D.; Shabany, M.; Gulak, P.G. A low-complexity high-speed QR decomposition implementation for MIMO receivers. In Proceedings of the IEEE International Symposium Circuits and Systems, Taipei, Taiwan, 24–27 May 2009; pp. 33–36. [CrossRef] 8. Ren, Y.F.; He, G.H.; Ma, J. High-throughput sorted MMSE QR decomposition for MIMO detection. In Proceedings of the IEEE International Symposium on Circuits and Systems, Seoul, Korea, 20–23 May 2012; pp. 2845–2848. [CrossRef] 9. Lee, H.; Oh, K.; Jang, M.C.Y. Efficient Low-Latency Implementation of CORDIC-Based Sorted QR Decomposition for Multi-Gbps MIMO Systems. IEEE Trans. Circuits Syst. II Express Briefs 2018, 65, 1375–1379. [CrossRef] 10. Wieringen, W.N. Lecture Notes on Ridge Regression. Available online: https://arxiv.org/abs/1509.09169 (accessed on 30 September 2015). 11. Martino, L.; Read, J. Joint Introduction to Gaussian Processes and Relevance VectorMachines with Connections to Kalman Filtering and other Kernel Smoothers. Available online: https://arxiv.org/abs/2009.09217 (accessed on 19 September 2020). 12. Burg, A. VLSI Circuits for MIMO Communication Systems. Ph.D. Thesis, ETH, Zürich, Switzerland, 2006. 13. Krishnamoorthy, A.; Menon, D. Matrix inversion using . In Proceedings of the 2013 Signal Processing: Algorithms, Architectures, Arrangements, and Applications (SPA), Poznan, Poland, 26–28 September 2013; pp. 70–72. 14. Chen, Y.J.; Halbauer, H.; Jeschke, M.; Richter, R. An efficient Cholesky Decomposition based multiuser MIMO detection algorithm. In Proceedings of the 21st Annual IEEE International Symposium on Personal, Indoor and Mobile Radio Communications, Poznan, Poland, 26–30 September 2010; pp. 499–503. [CrossRef] 15. Peter, L.; Andreas, B.; Haene, S.; Perels, D.; Felber, N.; Fichtner, W. VLSI Implementation of a High-Speed Iterative Sorted MMSE QR Decomposition. In Proceedings of the 2007 IEEE International Symposium on Circuits and Systems, New Orleans, LA, USA, 27–30 May 2007; pp. 1421–1424. [CrossRef] 16. Peter, L.; Studer, C.; Duetsch, S.; Zgraggen, E.; Kaeslin, H.; Felber, N. Gram-Schmidt-based QR decomposition for MIMO detection: VLSI implementation and comparison. In Proceedings of the 2008 IEEE Asia Pacific Conference on Circuits and Systems, Macao, China, 30 November–3 December 2008; pp. 830–833. [CrossRef] 17. Miyaoka, Y.; Nagao, Y.; Kurosaki, M.; Ochi, H. Sorted QR decomposition for high-speed MMSE MIMO detection based wireless communication systems. In Proceedings of the 2012 IEEE International Symposium on Circuits and Systems, Seoul, Korea, 20–23 May 2012; pp. 2857–2860. [CrossRef] 18. Liao, C.; Wang, J.; Huang, Y. A 3.1 Gb/s 8 8 Sorting Reduced K-Best Detector with Lattice Reduction and QR × Decomposition. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2014, 22, 2675–2688. [CrossRef] 19. Zhang, C.; Prabhu, H.; Liu, Y.; Liu, L.; Edfors, O.; Öwall, V. Energy Efficient Group-Sort QRD Processor With On-Line Update for MIMO Channel Pre-Processing. IEEE Trans. Circuits Syst. I Regul. Pap. 2015, 62, 1220–1229. [CrossRef] Electronics 2020, 9, 1657 15 of 15

20. Tseng, T.; Shen, C. Design and implementation of a high-throughput configurable pre-processor for MIMO detections. Microelectron. J. 2018, 72, 14–23. [CrossRef] 21. Chen, W.; Guenther, D.; Shen, C.; Ascheid, G. Design and implementation of a low-latency, high-throughput sorted QR decomposition circuit for MIMO communications. In Proceedings of the IEEE Asia Pacific Conference on Circuits and Systems, Jeju, Korea, 25–28 October 2016; pp. 277–280. [CrossRef] 22. Lin, J.S.; Hwang, Y.T.; Fang, S.H.; Chu, P.H.; Shieh, M.D. Low-Complexity High-Throughput QR Decomposition Design for MIMO Systems. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2015, 23, 2342–2346. [CrossRef] 23. Guo, Z.; Nilson, P.A. 53.3 Mb/s 4 4 16-QAM MIMO decoder in 0.35µm CMOS. In Proceedings of the IEEE × International Symposium Circuits and Systems, Kobe, Japan, 23–26 May 2005; pp. 4947–4950. [CrossRef] 24. Huang, Z.Y.; Tsai, P.Y. Efficient Implementation of QR Decomposition for Gigabit MIMO-OFDM Systems. IEEE Trans. Circuits Syst. I Regul. Pap. 2011, 58, 2531–2542. [CrossRef] 25. Alexander, M.; Vladimir, P.; Roman, M.; Alexey, K. Triangular systolic array with reduced latency for QR-decomposition of complex matrices. In Proceedings of the IEEE International Symposium on Circuits and Systems, Island of Kos, Greece, 21–24 May 2006; pp. 385–388. [CrossRef]

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).