
2012 Symposium on Application Accelerators in High Performance Computing ALU Architecture with Dynamic Precision Support Getao Liang, JunKyu Lee, Gregory D. Peterson Department of Electrical Engineering and Computer Science University of Tennessee Knoxville, TN, USA [gliang, jlee57, gdp]@utk.edu Abstract—Exploiting computational precision can have shown the potential of significant performance improve performance significantly without losing improvement over microprocessors for certain accuracy in many applications. To enable this, we applications [14]. With the development of FPGA propose an innovative arithmetic logic unit (ALU) hardware and CAD tools, floating-point arithmetic architecture that supports true dynamic precision functions are no longer impractical for FPGAs designs operations on the fly. The proposed architecture [15]. Instead, FPGA accelerators can out-perform targets both fixed-point and floating-point ALUs, general processors or GPUs in very-high-precision but in this paper we focus mainly on the precision- floating-point operations (higher than double-precision) due to the native hardware support. Other advantages controlling mechanism and the corresponding FPGAs have over other computing solutions include implementations for fixed-point adders and their fine-grained parallelism, fault-tolerant designs, multipliers. We implemented the architecture on and flexible precision configurations. Xilinx Virtex-5 XC5VLX110T FPGAs, and the Exploiting precision can gain performance results show that the area and latency overheads improvement in many applications [1-5]. Lower are 1% ~ 24% depending on the structure and precision renders Arithmetic Logic Units (ALUs) configuration. This implies the overhead can be smaller implying higher performance and better energy minimized if the ALU structure and configuration efficiency [4]. The exploitable parallelism can be are chosen carefully for specific applications. As a increased since the number of ALUs in fixed area is case study, we apply this architecture to binary increased using smaller ALUs. Fewer transistors are cascade iterative refinement (BCIR). 4X speedup is employed to build smaller ALUs implying lower power observed in this case study. consumption. Most current computing platforms employing statically defined arithmetic units such as Keywords-dynamic precision, ALUs, FPGAs, multi-cores and GPUs face limitations on exploiting high-performance computing, iterative refinement precision since they employ only single and double precision ALUs in hardware. However, FPGAs are I. INTRODUCTION able to support arbitrary precision ALUs as long as In computational science and engineering, users sufficient hardware resources are available. continually seek to solve ever more challenging Previous work shows the advantages of arbitrary problems: faster computers with bigger memory precision ALUs on FPGAs in some applications [4-6], capacity enable the analysis of larger systems, finer but the impact on performance of utilizing arbitrary resolution, and/or the inclusion of additional physics. precision can be maximized if ALUs on FPGAs can be For the past several decades, standardized floating- switched to desirable precision ALUs on the fly. point representations have enabled more predictable Exploiting multiple precisions in some applications is numeric behavior and portability, but at the expense of explored in [1, 7]. To employ multiple precision ALUs, making it impractical for users to exploit customized it would be necessary to download bit-stream files on precision with good performance. The introduction of FPGAs whenever the precision requirement is changed, reconfigurable computing platforms provides scientists causing degraded performance. Partial reconfiguration with an affordable accelerating solution that not only (PR) can be applied to reduce the re-programing time has the computational power of dedicated hardware with smaller partial bit-streams [33]. However, the processors (ASICs and DSPs), but also the flexibility performance improvement might be limited for designs of software due to the fabric and circuit configurability with relatively small and frequently switching PR [13]. Serving as either standalone platforms or co- modules as ALUs. The small size of the ALU makes processors, field-programmable gate arrays (FPGAs) the cost of embedded/external processor and interface CFP1225P-ART/12 $26.00 © 2012 IEEE 26 DOI 10.1109/SAAHPC.2012.29 for PR control too expensive; the frequent develop and refine a complete statically-defined, reconfiguration makes it harder to control and to variable-precision fixed- and floating-point ALU guarantee the accuracy of operation; and the device- library for reconfigurable hardware [26]. specific PR makes it difficult to port the design to other Since re-synthesizing, re-downloading, and re- vendors or impossible to other technologies. configuring are required for static multiple-precision It is desired to remove the requirement for updating ALUs whenever precision is changed, this solution is the configuration bit-stream. Hence, we propose a new not practical for applications that require frequently ALU architecture performing arbitrary precision changing precision. Thus, multi-mode ALUs become arithmetic operations according to the user’s more attractive. In [27], Tan proposed a 64-bit multiply preference. This new paradigm for ALUs may enable accumulator (MAC) that can compute one 64x64, two users to prevent unnecessary loss of performance and 32x32, four 16x16, or eight 8x8 unsigned/signed energy from applying unnecessarily high precision on multiply-accumulations using shared segmentation. On computation (e.g., programmers often employ double the other hands, Akkas presented architectures for dual- precision for scientific computations even when not mode adders and multipliers in floating-point [28, 29], needed). In small-scale applications, people often do and Isseven presented a dual-mode floating-point not care as much about the performance loss from divider [30] that supports two parallel double-precision overly high precision. However, as scientific divisions or one quadruple-precision division. In [31], applications become larger, the potential impact of this Huang present a three-mode (32-, 64- and 128-bit new ALU architecture on achievable accuracy and mode) floating-point fused multiply-add (FMA) unit performance becomes greater. with SIMD support. It is clear that all the above multi- What distinguishes this dynamic precision ALU mode multiple-precision structures can only support a research from previous multiple precision work is the few pre-defined precisions. To the best of our dynamic support of arbitrary precisions with arbitrary knowledge, our proposed architecture is the first true position arrangement. Although the proposed dynamic precision architecture targeting both fixed- architecture is designed for floating-point ALUs, the point and floating-point ALUs. main focus of this paper will be put on the fixed-point B. Applications datapath, which is the fundamental arithmetic engine of a floating-point ALU. Dynamic precision computations were investigated Previous work on ALU structures and applications around 20 or 30 years ago, since at that time the requiring multiple precisions are discussed in section computational scientists required extremely accurate II. Section III is dedicated to the approach and numeric solutions compared to the contemporaneous implementation for the proposed architecture. The computing technology [1, 7-9]. At that time, implementations results are analyzed and a case study computations with high precision arithmetic generally is conducted to justify the potentials of the proposed used software routines. innovative architecture in section IV, followed by In [1], Muller’s method is implemented using conclusions in section V. dynamic precision computations. The implementation monitors the magnitude of some function values in II. PREVIOUS WORK order to recognize the solution is getting closer to the solution. For example, when the function value is A. ALUs Designs small, they change the precision from low to high, Computer engineers have never stopped trying to recognizing the solution is getting closer to the exact improve system performance by optimizing arithmetic solution. They claimed that the performance gain was units. In 1951, Booth proposed a signed binary from 1.4 to 4 compared to static precision computation. recoding scheme [16] for multipliers to reduce the They utilized two types of precision on the MasPar number of partial products, and this scheme was later MP-1 and Cray Y-MP. improved by Wallace in [17]. Besides integer Binary Cascade Iterative Refinement (BCIR) was operations, floating-point arithmetic is also a hot topic proposed in 1981 [9]. BCIR seeks optimized for many researchers [15, 18-22]. performance to solve linear systems by utilizing With the emergence of reconfigurable computing, multiple precisions. BCIR faces limitations for engineers started to look for practical solutions for practical use since the condition number of the matrix multiple-precision computations. Constantinides, and should be given before computation to decide an initial his colleagues had broad explorations [23-25] of bit- precision (i.e., lowest precision among multiple width assignment and optimization for static multiple- precisions). Another paper proposes BCIR employing precision applications. Wang and Leeser spent years to doubled precision arithmetic per iteration
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages8 Page
-
File Size-