And Complex-Valued Multiply-Accumulate SIMD Unit for Digital Signal Processors

An Area Efficient Real- and Complex-Valued Multiply-Accumulate SIMD Unit for Digital Signal Processors Lukas Gerlach, Guillermo Paya-Vay´ a,´ and Holger Blume Cluster of Excellence Hearing4all, Institute of Microelectronic Systems Leibniz Universitat¨ Hannover, Appelstr. 4, 30167 Hannover, Germany Email: {gerlach, guipava, blume}@ims.uni-hannover.de Abstract—This paper explores a real- and complex-valued In the signal processing field, the fast Fourier transform multiply-accumulate (MAC) functional unit for digital signal pro- (FFT) is one of the mostly used transformations, which greatly cessors. MAC units with single-instruction-multiple-data (SIMD) pushes the performance requirements. The data parallelism support are often used to increase the processing performance inherent in the FFT processing allows operating with many in modern signal processing processors. Compared to a real- independent MAC operations simultaneously. Therefore, a valued SIMD-MAC units, the proposed unit uses the same performance increment can be achieved by MAC units with multipliers to also support complex-valued SIMD-MAC and butterfly operations. The area overhead for the complex mode SIMD mechanisms, but many instructions are still needed to is small. Complex-valued operations speed up signal processing operate the real- and imaginary parts of complex numbers algorithms and make the execution more efficient in terms of separately. The use of single instructions in DSPs, executing power consumption. As a case study, a fast Fourier transform operations with complex numbers, can lead to a significant (FFT) is implemented for a VLIW-processor with a complex- performance gain in many signal processing algorithms. valued SIMD butterfly extension. The proposed functional unit is quantitatively evaluated in terms of performance, silicon area, A SIMD-MAC unit that can handle both complex and and power consumption. real numbers without a large overhead in area and power consumption would be desirable. Using the same hardware multipliers and adders for real and complex arithmetic opera- I. INTRODUCTION tions reduces hardware cost and decreases power consumption. A new area efficient real- and complex-valued SIMD-MAC Nowadays, the tremendous progress in the field of portable unit for this purpose is proposed. This paper is organized as embedded applications (e.g., audio/video processing for mul- follows. In Section II, state-of-the-art of complex MAC units timedia content or wireless communication data) is pushing are presented, emphasizing the contribution of this work. Then, the computation load requirements of embedded target de- the proposed SIMD-MAC unit is explained in detail in Section vices. Digital Signal Processors (DSP) are commonly used III. In Section IV, a case study based on the FFT is presented. within these devices, due to their programmability, which Finally, the paper is concluded in Section V. allows their reuse in future algorithm modifications or for other applications. Solutions based on dedicated hardware II. RELATED WORK architectures would provide higher processing performance and lower power consumption but they lack flexibility, which Existing MAC units for real-valued and complex-valued is extremely desirable. How to extend the functionality of operations can be basically classified as follows. A single the DSP architectures to meet these new computation load MAC unit is used to compute a real-valued multiplication requirements but also decreasing the power consumption is and accumulation. For that, only one instruction is required. a classical research field, which is still a hot topic [1], [2]. Therefore, several instructions that make use of this MAC unit are required to perform a complex-valued MAC operation. Most frequently used signal processing operations are Several MAC units can be implemented in VLIW-DSP archi- correlations, filtering, or transformations, which are basically tectures to accelerate a complex-valued MAC operation [3]– implemented by using multiplication and addition/subtraction [6], drastically increasing the performance. SIMD-MAC units operations. Therefore, it is appropriate to specialize the target can be used to process several real-valued MAC operations signal processor (e.g., DSP) for this kind of computations. by executing a single instruction, increasing the performance A commonly used arithmetic unit in DSPs is a multiply- by taking profit of data parallelism. However, complex-valued accumulate (MAC) unit. This combination of multiplication MAC operations still require use of several sequential in- and addition makes the computation more efficient due to the structions [7]–[11]. Finally, specialized hardware for complex- reduction of execution cycles and registerfile utilization. The valued MAC units can either support only complex-valued or MAC unit can be extended to operate with multiple data values both real- and complex-valued operations [2], [12]–[15]. In this simultaneously, implementing the so-called single-instruction case, only one instruction is required to process a complex- multiple-data (SIMD) mechanism. This mechanism is one key valued MAC operation. Some of these architectures are also feature in modern DSPs to meet the performance demands and enhanced for a butterfly operation to speed up the computation power restrictions by exploiting this concurrency. of FFT algorithms [12]–[15]. 978-1-4673-9604-2/15/.00 ©2015 IEEE TABLE I. REAL- AND COMPLEX- VALUED OPERATIONS:MULTIPLY, A. Single MAC Units MULTIPLY-ACCUMULATE AND BUTTERFLY In [3], the authors present a DSP consisting of an array Arithmetic Operation Real-valued Complex-valued of identical datapaths. Each of these datapaths contains one Multiply MUL CMUL Multiply-accumulate MAC CMAC independently controlled MAC unit. The bit width of the Multiply-accumulate-zero MACZ CMACZ operands is 16-bit. Single double-precision MAC operations Butterfly - Butterfly need more than one cycle. B. SIMD-MAC Units A fully custom function unit (CFU) for butterfly computation is proposed in [15]. Each of these custom function units In [7], a DSP core is equipped with a dual MAC archi- is composed of four multipliers and eight adders. These units tecture. This MAC architecture is optimized for computation have two input ports and two output ports. Each port is 32-bit of digital filters. One input port of one MAC is the delayed wide and holds 16-bit real and complex values. With twelve input of the other MAC unit. Both MAC units support multiple of these units twelve butterfly operations can be performed subwords. This architecture saves data access requests for algo- concurrently. rithm which use the same operands for many MAC operations. In [8] and [16], DSPs with two MAC units are presented. D. Contribution of This Work In both cases, the MACs can reuse previously latched output values. These architectures are specifically optimized for FIR No MAC architecture, supporting real- and complex-valued filter computation. In [10] and [17], SIMD-MAC units are MAC operations and eventually butterfly operations, has been used in a two-way superscalar RISC architecture and a low found in any literature. This paper proposes a new MAC unit power DSP. In both cases, butterflies of a FFT are processed architecture for either multiply or multiply-accumulate real- in parallel to increase performance. or complex-valued operations. Moreover, this unit supports SIMD operations for both, real- and complex-valued MAC C. Specialized Complex-Valued MAC Units operations. The MAC architecture is also extended to compute butterfly operations for the FFT algorithm. Therefore, the aim In [2], a DSP is equipped with a specialized complex- of this work is to maintain the functionality of current real- valued multiplier. This functional unit can not be used for real valued SIMD-MAC units while extending their architecture multiplications and does not support SIMD operations. with complex-valued operations to perform real- and complex- The multiply-accumulate unit presented in [13] supports valued operations within one clock cycle. real- and complex-valued SIMD operations for single- and full-precision 16-bit MAC operations. Besides the efficient III. PROPOSED REAL- AND COMPLEX-VALUED implementation of FIR filters with these MAC architectures, SIMD-CMAC UNIT there are drawbacks for use in other cases. Parts of the Table I shows the arithmetic operations implemented in computed results, which are either real or complex, are stored the proposed SIMD-CMAC unit. These operations include in different accumulator registers, which are part of the MAC real- and complex-valued multiply and multiply-accumulate architecture. Not all of these registers are directly accessible in operations as well as a butterfly operation for the FFT. Since each cycle. The output multiplexer of the MAC also restricts SIMD mechanisms are used, all operations can process multi- the number of transferred words to the register file. The ple subwords simultaneously. complex-valued multiplication results of this MAC unit are used by the butterfly processor architecture, which constitutes A. Real-Valued MAC Operation the MAC unit. One butterfly per cycle can be computed. A generic real-value multiply (MUL), multiply-accumulate The authors of [12] propose a data processing unit for (MAC), and the multiply-accumulate-zero (MACZ) operation DSPs. This unit is composed of two multipliers with three is described by Eq. 1. The number of SIMD subwords is pipeline stages and five adders. New instructions are introduced hereinafter referred with

And Complex-Valued Multiply-Accumulate SIMD Unit for Digital Signal Processors

A Many-Core Architecture for In-Memory Data Processing

NVIDIA Bluefield-2 Datasheet

Bluefield As Platform

Smartnics: Current Trends in Research and Industry

Opportunities for Near Data Computing in Mapreduce Workloads

Dpus: Acceleration Through Disaggregation

Big Data Meets HPC Log Analytics: Scalable Approach to Understanding Systems at Extreme Scale

Hardware Acceleration of Biophotonic Simulations by Tanner Young

A Many-Core Architecture for In-Memory Data Processing

Unit – V– Sbs1203 – Computer Architecture

In Storage Process, the Next Generation of Storage System

Artificial Intelligence