High Performance Hardware Operators for Data Level Parallelism Exploration

INTERNATIONAL JOURNAL OF COMPUTERS Issue 1, Volume 6, 2012 High Performance Hardware Operators for Data Level Parallelism Exploration Libo Huang, Zhiying Wang, Nong Xiao dedicated arithmetic units based on the performance Abstract—Many microprocessor vendors have incorporated high characterization of the specific application [13]. While ASIPs performance operators in a single instruction multiple data (SIMD) are only suitable for a limited set of applications, SIMD unit fashion into their processors to meet the high performance demand of can be beneficial for more general data-intensive applications increasing multimedia workloads. This paper presents some recent such as multimedia computing [5, 22]. As multimedia works on hardware implementation of these operators for data-level applications contain a lot of inherent parallelism which can parallelism (DLP) exploration. Two general architectural techniques easily be exploited by using SIMD unit, these augmented for designing operators with SIMD support are first described function units can significantly accelerate the multimedia including low precision based scheme and high precision based scheme. Then new designs for integer operators as well as applications. To take advantage of this fact, several multimedia floating-point operators are provided to accommodate best tradeoff extensions were introduced into microprocessors architectures. between cost and performance. To verify the correctness and Examples are Intel's MMX, SSE1, SSE2 and even SSE3, effectiveness of these methods, a multimedia coprocessor augmented AMD's 3DNow, Sun's VIS, HP's MAX, MIPS's MDMX, and with SIMD operators is designed. The implemented chip successfully Motorola's AltiVec [22]. demonstrates that the proposed operators get good tradeoff between The main feature of these extensions is the exploration of cost and performance. the data level parallelism (DLP) available in multimedia applications by partitioning the processor's execution units into Keywords—Operator, SIMD, high performance, data level multiple lower precision segments called subwords, and parallelism performing operations on these subwords simultaneously in a single instruction multiple data (SIMD) fashion [22]. This is I. INTRODUCTION called subword parallelism (also referred to microSIMD Arithmetic unit design is a research area that has been of parallelism or packed parallelism). For example, in MMX great importance in the development of processor [7]. They are technology [3], the execution units can perform one 64-bit, two key components in processors which consume most power and 32-bit, four 16-bit or eight 8-bit additions simultaneously. attribute most latency. As such, a great deal of research has The integer SIMD computation is very popular in the been devoted to the study of the high efficiency arithmetic units domains of multimedia processing, where data are either 8 or [7, 19]. They have primarily focused on implementing the 16 bits wide: this allows for the usage of subword precision in various basic arithmetic units at smaller areas, lower power, existing 32-bit(or wider) arithmetic functional units. This way, and higher speed. Indeed, novel circuit techniques, and a 32-bit ALU can be used to perform a given arithmetic innovation in algorithm and structure have resulted in rapid operation on two 16-bit operands or four 8-bit operands, improvements in arithmetic unit performance. Examples are boosting performance without instantiating additional integer adders, multipliers and floating-point operators. As a arithmetic units. However, the floating-point operators with standard operator, the basic arithmetic unit design has been SIMD features appears relatively later such as SSE instructions already mature. (in Pentium III) and SSE2 and SSE3 instructions (in Pentium To further boost the performances of arithmetic units, its IV). This is due to that the area and power consumption of design methodology turns to global optimization for program floating-point operators are relatively larger than integer unit. execution instead of single optimal arithmetic unit. Adding Another reason is that until a few years ago the applications Application-specific instruction-set processors (ASIPs) or usually did not require extensive floating-point data parallel Single instruction multiple data (SIMD) units to a general computations. Nowadays things change a lot, forcing the purpose processor receives enormous attentions. The ASIP develops to address the continued need for SIMD method identifies critical operations and implements them by floating-point performance in mainstream scientific and engineering numerical applications, visual processing, Manuscript received October 31, 2011: This work was supported in part by recognition, data-mining/synthesis, gaming, physics, the National Natural Science Foundation of China under Grant No. 61103016, cryptography and other areas of applications. Recently, Intel No. 60803041, and No.60773024. Libo Huang is with the School of Computer, National University of Defense even announced an enhanced set of vector instructions with Technology, Changsha 410073, China (corresponding author phone: availability planned for 2010, called AVX which has 256-bit 86-731-84573640; e-mail: [email protected]). SIMD computation [12]. It enhances existing 128-bit Zhiying Wang and Nong Xiao are with the School of Computer, National floating-point arithmetic instructions with 256-bit capabilities University of Defense Technology, Changsha 410073, China (e-mail: {zywang,nongxiao}@nudt.edu.cn). for floating-point processing. 9 INTERNATIONAL JOURNAL OF COMPUTERS Issue 1, Volume 6, 2012 To effectively implement multimedia extension implementation could be explained as follows. instructions, the data format to be used in SIMD operations and A. Low-precision based scheme corresponding hardware support must be provided. The SIMD operators not only need to complete the basic arithmetic The low-precision based scheme involves building wider operations, but also need to deal with the computations, data SIMD elements out of several of the narrower SIMD elements conversion and data unpacking of different precision subwords and then combining the ultiple results together. This can be in parallel. This makes the SIMD operator a good capability in achieved by iteration or combination. The iteration method is to data handling, but the hardware to implementing them becomes perform high precision operations by iteratively recalculating more complicated, and its power and delay is much greater. the data back through the same low precision unit over more Therefore, to design SIMD operators with low power, small than one cycle while the combination method is to perform high area and delay is becoming a new design goal. precision operations by consecutively “unrolling the loop” and To reduce processor area, it is often desirable to have then combining the results together [1]. For example, the adder SIMD operations share circuitry with non-SIMD operations. implemented in the 2000-MOPS embedded RISC processor Thus, many optimized hardware structures supporting SIMD uses eight 8-bit adders to build a subword parallel adder [18]. n-1 2m-1m m-1 0 n-12m-1 m m-1 0 operations for fixed-point and floating-point units have been ... ... A[n:0] Ak-1 Ak-2 A1 A0 Bk-1 Bk-2 B1 B0 B[n:0] introduced and this paper attempts to survey these existing ... … … MUX Cycle(1, ,k) MUX Cycle(1, ,k) subword parallel algorithms in a system view. Then new mm methods for multiplier and floating-point Multiply-add fused Cycle(1,…,k) (MAF) unit are proposed to achieve a better performance in Cout m-bit adder Cin MUX power, cycle delay and silicon area. To test these methods, a C 1'b0 multimedia co-processor with SIMD fixed-point and MUX MUX ... MUX MUX floating-point units integrating into the LEON-3 host processor Rk-1 Rn/m-2 ... R1 R0 R[n:0] is designed. The main contribution of this paper is proposing and (a) m-bit adder supporting n-bit addition (iteration) evaluating methods and architectures for SIMD operator design. n-12m-1 m m-1 0 n-12m-1 m m-1 0 ... A[n:0] An/m-1 An/m-2 A1 A0 Bn/m-1 Bn/m-2 ... B1 B0 B[n:0] The remainder of this paper is organized as follows. In Section ... ... 2, the general structure of proposed SIMD unit is present. After that, section 3 provides detailed description of the proposed m-bit m-bit ... m-bit m-bit Adder Adder Adder Adder SIMD units. Then Section 4 describes the evaluation results ... R R ... under the context of a multimedia coprocessor. Finally, Section k-1 k-2 R1 R0 R[n:0] 5 gives the conclusion of the whole work. (b) k m-bit adders supporting n-bit addition (combination) II. GENERALIZED SIMD TECHNIQUES Fig. 1 Subword adder using low-precision based scheme Figure 1 shows the structures of SIMD adder using All modern microprocessors have now added multimedia low-precision based scheme. In Figure 1(a), a m-bit adder is instructions to their base instruction set architectures (ISAs). used to perform n-bit addition with n/m cycles using iteration These various ISA extensions are in fact similar with few method, where n>m and n can be divided by m. Each cycle m differences mainly in the types of operands, latencies of the bits of source operands A and B are selected to the m-bit adder individual instructions, the width of the data path, and in the to perform addition operation and then computed m-bit result memory management capabilities. They all need support are stored into corresponding location of result operand R. To concurrent operations on the foundation of the original propagate the carry chain to higher part, the carry-out bit of hardware. The simplest realization method is to increase adder should be stored in register and then send back to adder various subword computing hardware resources, and then as carry-in bit in the next cycle. The iterative method is useful choose the right result according to various subword modes. when SIMD arithmetic units are needed with a minimal amount But because it consumes hardware resources and the power, of hardware.

Load more