An Area Efficient Real- and Complex-Valued Multiply-Accumulate SIMD Unit for Digital Signal Processors

Lukas Gerlach, Guillermo Paya-Vay´ a,´ and Holger Blume Cluster of Excellence Hearing4all, Institute of Microelectronic Systems Leibniz Universitat¨ Hannover, Appelstr. 4, 30167 Hannover, Germany Email: {gerlach, guipava, blume}@ims.uni-hannover.de

Abstract—This paper explores a real- and complex-valued In the signal processing field, the fast Fourier transform multiply-accumulate (MAC) functional unit for digital signal pro- (FFT) is one of the mostly used transformations, which greatly cessors. MAC units with single-instruction-multiple-data (SIMD) pushes the performance requirements. The data parallelism support are often used to increase the processing performance inherent in the FFT processing allows operating with many in modern signal processing processors. Compared to a real- independent MAC operations simultaneously. Therefore, a valued SIMD-MAC units, the proposed unit uses the same performance increment can be achieved by MAC units with multipliers to also support complex-valued SIMD-MAC and butterfly operations. The area overhead for the complex mode SIMD mechanisms, but many instructions are still needed to is small. Complex-valued operations speed up signal processing operate the real- and imaginary parts of complex numbers algorithms and make the execution more efficient in terms of separately. The use of single instructions in DSPs, executing power consumption. As a case study, a fast Fourier transform operations with complex numbers, can lead to a significant (FFT) is implemented for a VLIW- with a complex- performance gain in many signal processing algorithms. valued SIMD butterfly extension. The proposed functional unit is quantitatively evaluated in terms of performance, silicon area, A SIMD-MAC unit that can handle both complex and and power consumption. real numbers without a large overhead in area and power consumption would be desirable. Using the same hardware multipliers and adders for real and complex arithmetic opera- I. INTRODUCTION tions reduces hardware cost and decreases power consumption. A new area efficient real- and complex-valued SIMD-MAC Nowadays, the tremendous progress in the field of portable unit for this purpose is proposed. This paper is organized as embedded applications (e.g., audio/video processing for mul- follows. In Section II, state-of-the-art of complex MAC units timedia content or wireless communication data) is pushing are presented, emphasizing the contribution of this work. Then, the computation load requirements of embedded target de- the proposed SIMD-MAC unit is explained in detail in Section vices. Digital Signal Processors (DSP) are commonly used III. In Section IV, a case study based on the FFT is presented. within these devices, due to their programmability, which Finally, the paper is concluded in Section V. allows their reuse in future algorithm modifications or for other applications. Solutions based on dedicated hardware II. RELATED WORK architectures would provide higher processing performance and lower power consumption but they lack flexibility, which Existing MAC units for real-valued and complex-valued is extremely desirable. How to extend the functionality of operations can be basically classified as follows. A single the DSP architectures to meet these new computation load MAC unit is used to compute a real-valued multiplication requirements but also decreasing the power consumption is and accumulation. For that, only one instruction is required. a classical research field, which is still a hot topic [1], [2]. Therefore, several instructions that make use of this MAC unit are required to perform a complex-valued MAC operation. Most frequently used signal processing operations are Several MAC units can be implemented in VLIW-DSP archi- correlations, filtering, or transformations, which are basically tectures to accelerate a complex-valued MAC operation [3]– implemented by using multiplication and addition/subtraction [6], drastically increasing the performance. SIMD-MAC units operations. Therefore, it is appropriate to specialize the target can be used to process several real-valued MAC operations signal processor (e.g., DSP) for this kind of computations. by executing a single instruction, increasing the performance A commonly used arithmetic unit in DSPs is a multiply- by taking profit of data parallelism. However, complex-valued accumulate (MAC) unit. This combination of multiplication MAC operations still require use of several sequential in- and addition makes the computation more efficient due to the structions [7]–[11]. Finally, specialized hardware for complex- reduction of execution cycles and registerfile utilization. The valued MAC units can either support only complex-valued or MAC unit can be extended to operate with multiple data values both real- and complex-valued operations [2], [12]–[15]. In this simultaneously, implementing the so-called single-instruction case, only one instruction is required to process a complex- multiple-data (SIMD) mechanism. This mechanism is one key valued MAC operation. Some of these architectures are also feature in modern DSPs to meet the performance demands and enhanced for a butterfly operation to speed up the computation power restrictions by exploiting this concurrency. of FFT algorithms [12]–[15].

978-1-4673-9604-2/15/.00 ©2015 IEEE TABLE I. REAL- AND COMPLEX- VALUED OPERATIONS:MULTIPLY, A. Single MAC Units MULTIPLY-ACCUMULATE AND BUTTERFLY

In [3], the authors present a DSP consisting of an array Arithmetic Operation Real-valued Complex-valued of identical datapaths. Each of these datapaths contains one Multiply MUL CMUL Multiply-accumulate MAC CMAC independently controlled MAC unit. The bit width of the Multiply-accumulate-zero MACZ CMACZ operands is 16-bit. Single double-precision MAC operations Butterfly - Butterfly need more than one cycle.

B. SIMD-MAC Units A fully custom function unit (CFU) for butterfly computa- tion is proposed in [15]. Each of these custom function units In [7], a DSP core is equipped with a dual MAC archi- is composed of four multipliers and eight adders. These units tecture. This MAC architecture is optimized for computation have two input ports and two output ports. Each port is 32-bit of digital filters. One input port of one MAC is the delayed wide and holds 16-bit real and complex values. With twelve input of the other MAC unit. Both MAC units support multiple of these units twelve butterfly operations can be performed subwords. This architecture saves data access requests for algo- concurrently. rithm which use the same operands for many MAC operations. In [8] and [16], DSPs with two MAC units are presented. D. Contribution of This Work In both cases, the MACs can reuse previously latched output values. These architectures are specifically optimized for FIR No MAC architecture, supporting real- and complex-valued filter computation. In [10] and [17], SIMD-MAC units are MAC operations and eventually butterfly operations, has been used in a two-way superscalar RISC architecture and a low found in any literature. This paper proposes a new MAC unit power DSP. In both cases, butterflies of a FFT are processed architecture for either multiply or multiply-accumulate real- in parallel to increase performance. or complex-valued operations. Moreover, this unit supports SIMD operations for both, real- and complex-valued MAC C. Specialized Complex-Valued MAC Units operations. The MAC architecture is also extended to compute butterfly operations for the FFT algorithm. Therefore, the aim In [2], a DSP is equipped with a specialized complex- of this work is to maintain the functionality of current real- valued multiplier. This functional unit can not be used for real valued SIMD-MAC units while extending their architecture multiplications and does not support SIMD operations. with complex-valued operations to perform real- and complex- The multiply-accumulate unit presented in [13] supports valued operations within one clock cycle. real- and complex-valued SIMD operations for single- and full-precision 16-bit MAC operations. Besides the efficient III. PROPOSED REAL- AND COMPLEX-VALUED implementation of FIR filters with these MAC architectures, SIMD-CMAC UNIT there are drawbacks for use in other cases. Parts of the Table I shows the arithmetic operations implemented in computed results, which are either real or complex, are stored the proposed SIMD-CMAC unit. These operations include in different accumulator registers, which are part of the MAC real- and complex-valued multiply and multiply-accumulate architecture. Not all of these registers are directly accessible in operations as well as a butterfly operation for the FFT. Since each cycle. The output multiplexer of the MAC also restricts SIMD mechanisms are used, all operations can process multi- the number of transferred words to the register file. The ple subwords simultaneously. complex-valued multiplication results of this MAC unit are used by the butterfly processor architecture, which constitutes A. Real-Valued MAC Operation the MAC unit. One butterfly per cycle can be computed. A generic real-value multiply (MUL), multiply-accumulate The authors of [12] propose a data processing unit for (MAC), and the multiply-accumulate-zero (MACZ) operation DSPs. This unit is composed of two multipliers with three is described by Eq. 1. The number of SIMD subwords is pipeline stages and five adders. New instructions are introduced hereinafter referred with the index variable s. to perform different combinations of additions and multipli- as = as + bs cs cations. This unit can perform two butterfly operations in 3 ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ a1 a1 b1 c1 a1 + b1 · c1 cycles. The benefit of this unit is the flexibility to switch ⎜a ⎟ ⎜a ⎟ ⎜b ⎟ ⎜c ⎟ ⎜a + b · c ⎟ ⎜ 2⎟ ⎜ 2⎟ ⎜ 2⎟ ⎜ 2⎟ ⎜ 2 2 2⎟ (1) between real and complex operations. The computation of the ⎝ . ⎠ = ⎝ . ⎠ + ⎝ . ⎠ ⎝ . ⎠ = ⎝ . ⎠ FFT algorithm is about 40% faster than DSPs using SIMD- . . . . . MAC units. The drawback of this processing design is that as as bs cs as + bs · cs two data paths are used without SIMD support. Each datapath The real-valued multiplication factors bs and cs as well as the contains one multiplier and multiple adders. Both fixed width accumulator as are vectors containing s subwords. For the data paths, including these multipliers and adders, are only multiply operation all subwords in bs and cs are multiplied used simultaneously in parallel algorithms, like the FFT. element-wise while the accumulator as is not used (zero). The The butterfly unit of the FFT processor presented in [14] resulting subwords are stored in as. The multiply-accumulate- is composed of four multipliers and four adders. It computes zero operation performs a element-wise multiplication and a one butterfly operation in one cycle. This butterfly unit has addition with an accumulator as, which is set to zero. In case two complex inputs and generates two complex outputs. The of the multiply-accumulate operation, the accumulator as is twiddle factors are stored in LUTs (lookup tables) or calcu- added to the multiplication. To perform a sequence of multiply- lated on-the-fly by a CORDIC (COordinate Rotation DIgital accumulate operations, the result vector as is the same as the Computer). accumulator vector as. 64-bit subword B. Complex-Valued MAC Operation s0 The same multiply and multiply-accumulate operation 64-bit 32-bit subwords scheme can also be defined for complex-valued numbers. In s1 s0 as bs cs s this case, the vectors , and consists of complex 32-bit 32-bit numbers with a real and an imaginary part. The complex- 16-bit subwords valued operations CMUL, CMAC and CMACZ are then given s3 s2 s1 s0 by Eq. 2. 16-bit 16-bit 16-bit 16-bit 8-bit subwords as = as + bs cs ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ s7 s6 s5 s4 s3 s2 s1 s0 a1 a1 b1 c1 a1 + b1 · c1 8-bit 8-bit 8-bit 8-bit 8-bit 8-bit 8-bit 8-bit ⎜a ⎟ ⎜a ⎟ ⎜b ⎟ ⎜c ⎟ ⎜a + b · c ⎟ ⎜ 2⎟ ⎜ 2⎟ ⎜ 2⎟ ⎜ 2⎟ ⎜ 2 2 2⎟ real subwords real or imaginary subwords ⎝ . ⎠ = ⎝ . ⎠ + ⎝ . ⎠ ⎝ . ⎠ = ⎝ . ⎠ . . . . . Fig. 1. SIMD data format. Each word of 64-bit is composed of 64-bit, 32-bit, as as bs cs as + bs · cs ⎛ ⎞ 16-bit, or 8-bit subwords. To compute real- or complex-valued operations, the (a1)+(b1)(c1)+(b1)(c1) subwords have to represent real or imaginary values. Real-valued subwords ⎜(a )+(b )(c )+(b )(c )⎟ are colored white and subwords, which represent real or imaginary values, are ⎜ 2 2 2 2 2 ⎟ (2) = ⎝ . ⎠ colored gray. . (as)+(bs)(cs) −(bs)(cs) bc ⎛ ⎞ (a1)+(b1)(c1)+(b1)(c1) ⎜(a2)+(b2)(c2)+(b2)(c2)⎟ + j · ⎜ ⎟ ⎝ . ⎠ Partial Product Matrix . Real part of Imaginary part of (as)+(bs)(cs)+(bs)(cs) complex-valued Real-valued complex-valued product product product Complex-valued MAC operations have been proposed by [2], Mux Mux [7], [12], [13], [18] but these architectures do not support Subtraction for butterfly SIMD operations or the access to single subwords is restricted. operation a0 a1 C. Butterfly Operation To speed up the computation of a radix-2 fast Fourier transform, a butterfly operation can be used. The butterfly operation consists of one complex-valued multiplication, one Fig. 2. Real- and complex-valued MAC/CMAC/butterfly SIMD architecture complex-valued addition and a subtraction. The equation for computing a SIMD butterfly operation is given in Eq. 3 and it is depicted in Fig. 1. Subwords can either have 64-bit, 32- Eq. 4 bit, 16-bit, or 8-bit. Every subword represents an unsigned as = as + bs cs or signed data type, storing integer or fixed point numbers. ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ a1 a1 b1 c1 a1 + b1 · c1 For the complex-valued operations, two consecutive subwords are interpreted as a complex number. The first subwords ⎜a2⎟ ⎜a2⎟ ⎜b2⎟ ⎜ c2 ⎟ ⎜a2 + b2 · c2⎟ (3) ⎜ ⎟ = ⎜ ⎟ + ⎜ ⎟ ⎜ ⎟ = ⎜ ⎟ (s1, s3, ..., sn) represent the real part and the second sub- ⎝ . ⎠ ⎝ . ⎠ ⎝ . ⎠ ⎝ . ⎠ ⎝ . ⎠ . . . . . words (s0, s2, ..., sn−1) represent the imaginary part of a as as bs ws as + bs · cs complex-valued number. The imaginary parts are highlighted in gray in Fig. 1. bs = as − bs cs ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ Fig. 2 illustrates the architecture of the proposed SIMD- b1 a1 b1 c1 a1 − b1 · c1 CMAC unit. To compute the real- and complex-valued multi- ⎜b ⎟ ⎜a ⎟ ⎜b ⎟ ⎜c ⎟ ⎜a − b · c ⎟ ⎜ 2⎟ ⎜ 2⎟ ⎜ 2⎟ ⎜ 2⎟ ⎜ 2 2 2⎟ (4) ply, multiply-accumulate and butterfly operations, three input ⎝ . ⎠ = ⎝ . ⎠ − ⎝ . ⎠ ⎝ . ⎠ = ⎝ . ⎠ ports and one double-width output port are utilized. Compared . . . . . b a b c a − b · c to other MAC architectures [1], [8], [13], [18], the proposed s s s s s s s architecture does not contain any registers. Instead, the register where a and b represent the complex-valued signal subwords file of the processor is used to replace the additional accumu- while the vector c contains the complex-valued twiddle factors. lator register. This design decision offers more flexibility to implement the butterfly operation and reduces the number of Several functional units designed for performing a butterfly registers used for an area efficient implementation. For this operation have been presented in related works but none of reason, the accumulator output port (a0 and a1) is connected them supports SIMD mechanisms [12], [13], [15]. to the same registers of the processor as the accumulator input port (a0 and a1). The accumulator is 128-bit wide. The input D. SIMD-CMAC Operation Support ports (b and c) represent the product factors. These ports have 64 bits each. The proposed SIMD-CMAC unit supports four subword modes. Each input port is 64-bit wide. These 64 bits are The number of multipliers needed for the implementation interpreted as subwords (s0, s1, ..., sn) of equal length, as depends on the operation itself and the SIMD mode. The b d e n n 2 2 c n 2 b1 b0 c0 n c0 · b1 c0 · b0 2 c1 c · b c · b mode mode mode mode mode mode mode 1 1 1 0 0 0 0 0 0 0 0

n c · b0 carry carry carry carry carry carry carry n · 2 2 + 0 + c0 · b1 · b · 2 2 · 2n + c1 0 sum c1 · b1 = c · b

Fig. 3. Real-valued partial products. The product b·c is formed by multiplying f and adding the subwords b0, b1 and c0, c1 according to their bit significance. b and c are of word size n. Fig. 6. Ripple carry SIMD adder architecture.

b n n 2 2 complex-valued multiplication. c n 2 b1 b0 c0 b · c =(b1 · c1 − b0 · c0)+i · (b1 · c0 + b0 · c1) (5) n c0 · b1 c0 · b0 2 c1 c1 · b1 c1 · b0 By using one additional SIMD adder and subtractor, the complex-valued multiplication can be computed with the same multipliers used for real-valued multiplications. This is shown b · c =(b · c − b · c )+i(b · c + b · c ) Complex multiplication: 1 1 0 0 1 0 0 1 in Fig. 2. A ripple carry adder architecture as shown in Fig. 6 Fig. 4. Complex-valued partial products. The complex-valued product b · c is is used for the additional adder and subtractor. Eight 8-bit formed by multiplying real subwords b1 and c1 and the imaginary subwords adders connected by a carry chain can perform eight 8-bit b0 and c0. additions, four 16-bit additions, two 32-bit additions, or one 64-bit addition by controlling the propagation of the carry bit. The adder can also be configured as a subtractor. partial product architectures presented in [19]–[22] have been extended in this work to perform both real- and complex- The final accumulator stage in Fig. 2 adds the outputs from the product matrix or the SIMD adder/substractor to valued multiplications with almost the same hardware require- a0 a1 ments. A real-valued multiplication based on a one stage partial the accumulator inputs and . The outputs of the partial product matrix is depicted in Fig. 3. The multiplier b and the product matrix can either be real- or complex-valued and are multiplicand c are subdivided in subwords of equal size. The multiplexed for every operation listed in Table I. The same corresponding subwords are multiplied and then added to form ripple carry SIMD adder architecture shown in Fig. 6 is used the final product result. This product scheme can be applied in the accumulator stage. One of the adders can be configured for different subword modes and word sizes. The advantage of as a subtractor to compute butterfly operation according to this product scheme is that the same multiplier stages can be Eq. 3 and Eq. 4. used for different subword modes, reducing the hardware cost in contrast to multiple parallel high-precision multipliers. This IV. CASE STUDY: FFT scheme has been extended for complex-valued multiplications The fast Fourier transform algorithm is used as benchmark in this work. Fig. 4 shows that the same multipliers can to assess the performance of proposed real- and complex- perform all products for a complex-valued multiplication. In valued SIMD-CMAC unit and other MAC architectures pre- b c this case, the subwords 1 and 1 represent the real-part and sented in related work. b0 and c0 represent the imaginary part of the complex-valued words b and c. The fast Fourier transform is used to compute the discrete Fourier transform defined by Eq. 6. This multiplier scheme is depicted in Fig. 5 for subwords N−1 of 8-bit, 16-bit or 32-bit width. In the first stage of the product −2iπ kn Xk = xn · e N k =0, ..., N − 1 (6) matrix, all 8-bit products are formed. The white cells, which n=0 are outlined in bold on the diagonal axis, contain either the real-valued products, parts of the complex-valued results or are The transform is computationally complex. The Cooley-Tukey used for partial products. The gray shaded cells contain either FFT [23] algorithm is commonly used to decrease the number parts of the complex-valued products or partial products. All of complex multiplications and additions. In the literature, dif- other cells are not used for partial product generation or as ferent complex-valued radix-2/4 Cooley-Tukey FFT algorithms results. Products, which are computed by partial products, are have been implemented on a variety of different DSPs with dif- indicated with a plus sign. All other products are computed ferent MAC architectures [3]–[6], [9]–[12], [14], [15], [17]. In directly by dedicated multipliers. Those are marked with order to compare the proposed SIMD-CMAC architecture with a multiplication sign. This implementation scheme reduces these architectures, a case study where the proposed SIMD- hardware costs. CMAC is used as a functional unit to compute butterflies of a radix-2 decimation in time FFT in a VLIW-SIMD processor, Before feeding the products to the accumulator stage, the called Kavuaka [24], is presented. The FFT performance for complex-valued multiplication results have to be computed. differently sized FFTs is listed in Table II. Those architectures, Eq. 5 indicates which products of the product matrix shown which are marked as programmable, are flexible. Their MAC in Fig. 4 have to be added and subtracted to compute the unit performs for real- and complex-valued operations and, 64-bit 8-bit 8-bit 8-bit 8-bit 8-bit 8-bit 8-bit 8-bit 8-bit b7 b6 b5 b4 b3 b2 b1 b0 8-bit c0 8-bit c1 64-bit 8-bit c2 8-bit c3 8-bit c4 8-bit c5 c 8-bit 6 c7 16-bit 16-bit 16-bit 16-bit b b b b 16-bit 3 2 1 0 c0 16-bit c1 16-bit c2 16-bit c3

32-bit 32-bit b1 b0 32-bit c0

32-bit c1

Fig. 5. Partial product matrix. The product matrix consists of three stages. Products are generated using the products of the previous stage. A product mark labels results generated by direct multiplications while a plus sign marks products generated by addition of partial products. The results are used for real- (white cells) and complex-valued (white or gray cells) multiplications.

TABLE II. RADIX 2/4 COMPLEX FFT PERFORMANCE OF PROPOSED AND RELATED ARCHITECTURES

MAC Name Clock Number Word/Subword FFT Points Arch. frequency of MACs [bit] 32 64 128 256 512 1024 [MHz] [cycles] ⎧ ⎪ 997 1657 4129 8393 18654 41385 ⎪ CFX [4] 300 2 24/24 ⎪ (697%) (496%) (503%) (438%) (424%) (416%) ⎪ ⎪ - - - 4786 - - ⎪ C55x [5] 200 2 16/16 ⎪ Single (250%) ⎪ ⎪ MAC ⎪ - - - - - 14440 ⎪ Hinrichs [3] 40 4 16/16 ⎪ (145%) ⎪ ⎪ - 1156 2158 4316 8770 18288 ⎪ ADSP-21161N [6] 100 2 32fl ⎪ (346%) (263%) (225%) (199%) (184%) ⎪ ⎪ ⎨ - - 2700 - - - Arm [17] 50 2 32/32 ⎪ (329%) ⎪ ⎪ 258 545 953 2216 4664 10055 ⎪ SIMD C674x [9] 456 2 32/32 programmable ⎪ (191%) (163%) (116%) (116%) (106%) (101%) ⎪ ⎪ MAC ⎪ - 839 - 4093 - 19257 ⎪ Nadehara [10] 200 1 64/16 ⎪ - (251%) - (214%) - (193%) ⎪ ⎪ 212 525 1273 2587 5854 11898 ⎪ SC3850 [11] 1000 4 64/16 ⎪ (157%) (157%) (155%) (135%) (133%) (119%) ⎪ ⎪ ⎪ SIMD KAVUAKA 135 334 821 1915 4397 9959 ⎪ 50 1 64/32 ⎩⎪ CMAC this work (100%) (100%) (100%) (100%) (100%) (100%) ⎧ ⎪ - - - 1024 - - ⎪ Al [14] - 1 16/16 ⎨⎪ (53%) Specialized - - - 1536 - 7680 ⎪ Lee [12] 144 1 -/- ⎪ CMAC (80%) (77%)

dedicated ⎪ ⎩⎪ - - 284 568 1188 2496 Liu [15] 320 2 32/16 (35%) (30%) (27%) (25%) therefore, can be used for other algorithms than the FFT. standard SIMD-MAC is about 3.7 for the same local memory The proposed SIMD-CMAC architecture needs less cycles interface. Those architectures, which are marked as dedicated, to compute the FFT compared to other programmable MAC are equipped with dedicated hardware architectures, like ded- architectures. The reason for this is the previously described icated butterfly units and local memory banks for twiddle full SIMD support for butterfly operations. In addition to the and sample data. These hardware mechanisms offer higher parallel execution of butterflies, no permutation and alignment performance compared to the more flexible programmable operations are needed as opposed to a real-valued MAC imple- architectures. mentation. The performance of the FFT algorithm also depends on the local memory bandwidth. In case of the Kavuaka The cell area and power consumption overhead for im- processor, the speedup using the SIMD-CMAC instead of a plementing a complex-valued SIMD-CMAC unit is shown in TABLE III. CELL COUNT, CELL AREA AND TOTAL POWER OF [7] Y.-L. Tsao, W.-H. Chen, M. H. Tan, M.-C. Lin, and S.-J. Jou, “Low- DIFFERENT IMPLEMENTATIONS OF THE PROPOSED SIMD-CMAC ARCHITECTURE power embedded dsp core for communication systems,” EURASIP Journal on Applied Signal Processing, vol. 2003, pp. 1355–1370, 2003. [8] B.-W. Kim, J.-H. Yang, C.-S. Hwang, Y.-S. Kwon, K.-M. Lee, I.- Implementation Cell count Cell Area Est. Total Power H. Kim, Y.-H. Lee, and C.-M. Kyung, “Mdsp-ii: A 16-bit dsp with μ 2 [ m ] [mW] mobile communication accelerator,” Solid-State Circuits, IEEE Journal 7417 12515 0.21 of, vol. 34, no. 3, pp. 397–404, 1999. Imp1 (100%) (100%) (100%) [9] T. I. Inc., TMS320C6748 Fixed- and Floating-Point DSP, Texas Instru- 12434 21263 0.38 ments Inc., 2014, Available: www.ti.com. Imp2 (168%) (170%) (181%) [10] K. Nadehara, T. Miyazaki, and I. Kuroda, “Radix-4 fft implementation 12838 21730 0.39 Imp3 using multimedia instructions,” in Acoustics, Speech, and Signal (173%) (174%) (186%) Processing, 1999. Proceedings., 1999 IEEE International Conference Imp1 - Real-valued MUL and MAC operations implemented. on, vol. 4. IEEE, 1999, pp. 2131–2134. Imp2 - Real-valued and complex-valued MUL and MAC operations implemented. [11] F. S. Inc., Six-Core , Freescale Semiconductor Imp3 - Real-valued and complex-valued MUL and MAC operations and butterfly Inc., 2013, Available: www.freescale.com/. operation implemented. [12] J. S. Lee and M. H. Sunwoo, “Design of new dsp instructions and their hardware architecture for high-speed fft,” Journal of VLSI signal processing systems for signal, image and video technology, vol. 33, Table III. Three different variations of the proposed SIMD- no. 3, pp. 247–254, 2003. CMAC unit are synthesized using a 40 nm low-power tech- [13] Y.-H. Huang, H.-P. Ma, M.-L. Liou, and T.-D. Chiueh, “A 1.1 g mac/s sub-word-parallel digital signal processor for wireless communication nology library from TSMC [25]. No pipelining is used, the applications,” Solid-State Circuits, IEEE Journal of, vol. 39, no. 1, pp. maximum clock frequency is set to 50 MHz and the operating 169–183, 2004. voltage is 1V. This configuration is based on a low-power [14] A. A. Al Sallab, H. Fahmy, and M. Rashwan, “Optimized hardware design flow. The total power consumption is an estimation implementation of fft processor,” in Design and Test Workshop (IDT), based on a switching factor of 0.5. The implementation Imp1 is 2009 4th International. IEEE, 2009, pp. 1–5. synthesized without complex-valued operations and represents [15] L. Liu, Z. Yang, S. Li, and M. Yan, “Implementation of high-throughput a standard SIMD-MAC unit. This implementation is used as a fft processing on an application-specific reconfigurable processor,” in Computer Science and Network Technology (ICCSNT), 2012 2nd reference. If complex-valued multiply and multiply-accumulate International Conference on. IEEE, 2012, pp. 1284–1288. operations are additionally synthesized, the area overhead [16] M. Basiri and N. M. Sk, “An efficient hardware based mac design is about 70%. This overhead is 30% smaller compared to in digital filters with complex numbers,” in Signal Processing and duplicating MAC-units, as it is presented in [26]. The SIMD Integrated Networks (SPIN), 2014 International Conference on. IEEE, butterfly-operation, implemented in variation Imp3, requires 2014, pp. 475–480. only 4% additional cell area. [17] C. Arm, S. Gyger, J.-M. Masgonty, M. Morgan, J.-L. Nagel, C. Piguet, F. Rampogna, and P. Volet, “Low-power 32-bit dual-mac 120 w/mhz 1.0 v icyflex1 dsp/mcu core,” Solid-State Circuits, IEEE Journal of, V. C ONCLUSION vol. 44, no. 7, pp. 2055–2064, 2009. [18] C.-K. Chen, P.-C. Tseng, Y.-C. Chang, and L.-G. Chen, “A digital signal This paper presents a real-/complex-valued SIMD multiply- processor with programmable correlator array architecture for third accumulate unit for low-power digital signal processors. A generation wireless communication system,” Circuits and Systems II: commonly used partial products architecture for SIMD multi- Analog and Digital Signal Processing, IEEE Transactions on, vol. 48, plications is extended to also support complex-valued multiply- no. 12, pp. 1110–1120, 2001. accumulate and butterfly operations. Since the same multiplier [19] S. Krithivasan and M. J. Schulte, “Multiplier architectures for media structure can be used, the overhead in silicon area is small com- processing,” in Signals, Systems and Computers, 2004. Conference Record of the Thirty-Seventh Asilomar Conference on, vol. 2. IEEE, pared to duplicating MAC units. The FFT benchmarks demon- 2003, pp. 2193–2197. strate that the proposed SIMD-CMAC unit outperforms MAC [20] A. Danysh and D. Tan, “Architecture and implementation of a vec- architectures of related digital signal processor by integrating tor/simd multiply-accumulate unit,” Computers, IEEE Transactions on, a SIMD butterfly operation into the CMAC architecture. vol. 54, no. 3, pp. 284–293, 2005. [21] L.-D. Van and J.-H. Tu, “Power-efficient pipelined reconfigurable fixed- REFERENCES width baugh-wooley multipliers,” Computers, IEEE Transactions on, vol. 58, no. 10, pp. 1346–1355, 2009. [1] L. Huang, N. Xiao, Z. Wang, Y. Wang, and M. Lai, “Efficient multi- [22] S. S. Kidambi, F. El-Guibaly, and A. Antoniou, “Area-efficient multipli- media coprocessor with enhanced simd engines for exploiting ilp and ers for digital signal processing applications,” Circuits and Systems II: dlp,” , vol. 39, no. 10, pp. 586–602, 2013. Analog and Digital Signal Processing, IEEE Transactions on, vol. 43, [2] L. Codrescu, W. Anderson, S. Venkumanhanti, M. Zeng, E. Plondke, no. 2, pp. 90–95, 1996. C. Koob, A. Ingle, C. Tabony, and R. Maule, “Hexagon dsp: An [23] J. W. Cooley and J. W. Tukey, “An algorithm for the machine calculation architecture optimized for mobile multimedia and communications,” of complex fourier series,” Mathematics of computation, vol. 19, no. 90, Micro, IEEE, vol. 34, no. 2, pp. 34–43, 2014. pp. 297–301, 1965. [3] W. Hinrichs, J. P. Wittenburg, H. Lieske, H. Kloos, M. Ohmacht, [24] J. Hartig, L. Gerlach, G. Paya-Vaya, and H. Blume, “Customizing a and P. Pirsch, “A 1.3-gops parallel dsp for high-performance image- vliw-simd application-specific instruction-set processor for hearing aid processing applications,” Solid-State Circuits, IEEE Journal of, vol. 35, devices,” in Signal Processing Systems (SiPS), 2014 IEEE Workshop no. 7, pp. 946–952, 2000. on. IEEE, 2014, pp. 1–6. [4] P. D. S. Labs, CoolFlux DSP, NXP, 2004, Available: [25] Taiwan semiconductor manufacturing company limited. [Online]. www.coolfluxdsp.com. Available: www.tsmc.com [5] T. I. Inc., TMS320C5517 Fixed-Point Digital Signal Processor, Texas [26] Y. Luo, Z. Zhang, X. Huang, J. Wu, and X. Chen, “Architecture Instruments Inc., 2014, Available: www.ti.com. and implementation of a vector mac unit for complex number,” in [6] A. D. Inc., ADSP-21161N, Analog Devices Inc., 2013, Available: Communications and Networking in China (CHINACOM), 2014 9th www.analog.com/. International Conference on. IEEE, 2014, pp. 589–594.