Automatic Generation of Lookup Table Optimizations

Automatic Generation of Lookup Table Optimizations

Mesa: Automatic Generation of Lookup Table Optimizations Chris Wilcox Michelle Mills Strout James M. Bieman Computer Science Dept. Computer Science Dept. Computer Science Dept. Colorado State University Colorado State University Colorado State University Fort Collins, Colorado 80523 Fort Collins, Colorado 80523 Fort Collins, Colorado 80523 [email protected] [email protected] [email protected] ABSTRACT Keywords Scientific programmers strive constantly to meet performance Lookup table, error analysis, performance optimization demands. Tuning is often done manually, despite the signif- icant development time and effort required. One example is lookup table (LUT) optimization, a technique that is gener- 1. INTRODUCTION ally applied by hand due to a lack of methodology and tools. The computational needs of scientific applications are al- LUT methods reduce execution time by replacing computa- ways growing, driven by increasingly complex models in the tions with memory accesses to precomputed tables of results. physical and biological sciences. Scientific programs often LUT optimizations improve performance when the memory require extensive tuning to perform well on multicore sys- access is faster than the original computation, and the level tems. Performance optimization can consume a major share of reuse is sufficient to amortize LUT initialization. Current of development time and effort [14], and software engineering practice requires programmers to inspect program source to practices are sometimes ignored in the rush to attain perfor- identify candidate expressions, then develop specific LUT mance [12]. Manual tuning, including parallelization, is inef- code for each optimization. Measurement of LUT accuracy ficient and can obfuscate application code, making it harder is usually ad hoc, and the interaction with multicore paral- to maintain and adapt [10]. One solution is automated per- lelization has not been explored. formance tuning, which improves on manual methods by re- In this paper we present Mesa, a standalone tool that im- ducing the programming effort and simplifying the code [7]. plements error analysis and code generation to improve the This paper examines a serial optimization that uses pre- process of LUT optimization. We evaluate Mesa on a multi- computed lookup tables (LUTs) to avoid computation. We core system using a molecular biology application and other study LUT optimizations with Mesa, a standalone tool that scientific expressions. Our LUT optimizations realize a per- we have developed. Mesa supports the application of LUT formance improvement of 5X for the application and up to optimizations to existing code, without the loss of abstrac- 45X for the expressions, while tightly controlling error. We tion caused by manual tuning. The primary goal of Mesa is also show that the serial optimization is just as effective on to decrease the cost of LUT optimization while giving the a parallel version of the application. Our research provides a scientific programmer control of the tradeoff between accu- methodology and tool for incorporating LUT optimizations racy and performance. Our secondary goal is to show that into existing scientific code. LUT methods can improve the performance of scientific ap- plications on multicore systems. This work was motivated by a research collaboration with Categories and Subject Descriptors the small angle X-ray scattering (SAXS) research project at D.3.3 [Language Constructs and Features]: Patterns; Colorado State University [1]. The SAXS software simulates D.3.4 [Processors]: Code Generation, Compilers, Opti- X-ray scattering of proteins using Debye's formula, shown in mization; G.1.2 [Approximation]: Special function ap- Equation (1). proximations N−1 N I(θ) = 2Σi=1 Σj=i+1Fi(θ)Fj (θ)sin(4πrθ)=(4πrθ) (1) For performance evaluation of the SAXS code, we use en- General Terms zymes from the Protein Data Bank (PDB), including the Performance, Languages 1xib molecule, which has 3052 atoms. We defer a detailed discussion of computational complexity to Section 4, but we have measured ≈ 4:7 × 109 evaluations of the formula to scatter the 1xib molecule. The SAXS scattering code there- Permission to make digital or hard copies of all or part of this work for fore has a significant amount of computation in which we personal or classroom use is granted without fee provided that copies are have observed a large amount fuzzy reuse [3]. The initial not made or distributed for profit or commercial advantage and that copies version of scattering code required more than 1.5 hours to bear this notice and the full citation on the first page. To copy otherwise, to scatter the 1xib molecule. Removal of redundant code and republish, to post on servers or to redistribute to lists, requires prior specific precomputation of scattering constants resulted in a 6X im- permission and/or a fee. ICSE ’11, May 21–28, 2011, Waikiki, Honolulu, HI, USA provement on a 32-bit system. Porting the code to a 64-bit Copyright 2011 ACM 978-1-4503-0577-8/11/05 ...$10.00. system yielded another 3X improvement. Table 1: SAXS scattering code improvements. Run Cumulative Delta Version Time Factor Factor Description 5365s 1X 1X Original code 872s 6X 6X Removed redundancy 279s 19X 3X 64-bit system and compilers 41s 131X 7X Manual table optimization 23s 233X 1.8X Parallel version, dual-core 13s 413X 1.8X Parallel version, quad-core allel version, that is the single core and multicore optimiza- tions are independent and complementary for this applica- tion. Table 1 shows the progression of performance improve- ments for the scattering code. Current trends in computing platforms include the broad (a) 1xib molecule, which contains 3052 atoms. availability of multicore hardware and a decrease in memory access performance relative to processor performance. As a 109 result, the focus in scientific computing has shifted to paral- Original Code Optimized Code lel execution and transformations that reduce the number of 8 memory accesses on multicore systems. However, single core 10 performance still remains important enough to justify con- tinued study of optimizations that improve performance by 107 eliminating redundant operations. Such optimizations must be evaluated to ensure that they remain effective when the 6 code is parallelized, as we have done in this paper. 10 Mesa makes the process of applying LUT optimizations to existing code less costly and more repeatable. Our contri- 105 butions are as follows: (1) we show that that LUT optimiza- Scattering Intensity (Absolute) tions can benefit scientific codes, (2) we present a tool for the partial automation of LUT optimization, including error 104 analysis, and (3) we demonstrate that LUT optimization is 0 0.05 0.1 0.15 0.2 applicable in a multicore environment. Scattering Angle (Radians) The remainder of this paper is organized as follows: Sec- (b) Original and optimized scattering curves. tion 2 provides background information on LUT optimiza- tion, Section 3 introduces the Mesa tool, Section 4 describes Figure 1: 1xib molecule and scattering curves. case studies performed with Mesa, Section 5 gives details on error analysis, Section 6 shows related work, and Section 7 presents our conclusions. The project goal was to scatter thousands of rotations of various molecules. This raised the performance require- ment, so we decided to investigate LUT methods. Table 2. LUT APPROXIMATION optimizations date back to the early days of computing [4], LUT optimizations represent continuous functions as sets and are commonly used to optimize function evaluation in of discrete values, with each LUT access returning a stored hardware. Programmers often use LUT methods, but the approximation of the original function. Increasing the num- technical literature is very limited with respect to a software ber of LUT entries improves accuracy, but reduces the level methodology. Our initial LUT optimization was manual, so of reuse that occurs when different inputs share the same we had to empirically determine the LUT size to meet the LUT entry. Decreasing reuse causes an increase in the penal- required level of accuracy and performance. After extensive ties associated with memory usage, notably cache misses. experimentation we were able to gain an additional 7X im- Thus LUT optimizations have an inherent tradeoff between provement. We achieved these numbers despite a modest accuracy and performance. In this section we define how to table size of 3.2MB and average error of < 0:0014 percent. characterize the error introduced by LUT optimization. Figure 1 shows the 1xib molecule and intensity curve. LUT approximations introduce error because memory lim- Both the original and optimized curves are plotted, but the itations make it impractical to match the floating-point ac- difference between them is negligible and impossible to see curacy of the processor. IEEE floating-point standards pro- visually. To improve the process we developed the Mesa tool vide accuracy of ±1:19 × 10−07 for single-precision (SP) and to perform error analysis and automatic generation of LUT ±2:22×10−16 for double-precision (DP). To match this pre- optimization code for a broad range of scientific expressions. cision for an input variable with a domain [0:0; 1:0] would Mesa, described in Section 3, allows us to characterize the require a 32MB table for a float, and a 16PB table for a dou- performance and accuracy of LUT optimizations. ble. The latter is clearly out of reach for modern computers, In addition, we parallelized the SAXS scattering loop with even for limited domains. OpenMP, showing a further 1:8X improvement on a dual- We define a LUT approximation as a function l(x) with core, and 3:6X on a quad-core system. We conclude that identical parameters to the original function f(x), but less the benefit of LUT optimization applies equally to the par- accuracy. The l(x) function performs a table lookup by di- viding the input value by LUT granularity and rounding it 5 Emax to an integer LUT index.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    8 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us