Efficient Function Evaluations with Lookup

1 Efficient Function Evaluations with Lookup Tables for Structured Matrix Operations Kanwaldeep Sobti, Lanping Deng, Chaitali Chakrabarti, Nikos Pitsianis and Xiaobai Sun Abstract— A hardware-efficient approach is introduced for technique can use small area since the memory is much denser extensive function evaluations in certain structured matrix than logic modules such as adders and multipliers [2]. computations. It is a comprehensive approach that utilizes The ultimate goal is to have accurate function evaluations lookup tables for their relatively low utilization of hardware resources, employs interpolations with adders and multipliers for with fast lookup schemes and small lookup tables. There is, their adaptivity to non-tabulated values and, more distinctively, however, a tangled relationship among the latency, accuracy exploits the function properties and the matrix structures to claim and memory size. For example, in a straightforward or direct better control over numerical dynamic ranges. This approach table lookup scheme, a higher-accuracy evaluation requires a brings forth more opportunities and room for improvement finer sampling for table entries and hence a larger table. While in hardware efficiency without compromising latency and numerical accuracy. We demonstrate the effectiveness of the such direct lookup scheme is generally believed to be fast, the approach with simulation results on evaluating, in particular, growth in table size may be exponential with the number of the cosine function, the exponential function and the zero-order required accurate digits, not only consuming large memory Bessel function of the first kind. space but also subsuming the lookup mechanism and speed. The techniques we introduce in this paper are developed . to explore the latency-accuracy-resource design space for Keywords: computer arithmetic, elementary function eval- combined reduction or minimization in memory and latency uation, lookup-table based algorithms, structured matrices, subject to an accuracy constraint. Furthermore, to narrow the geometric tiling conventional ’tradeoff’ among the three issues, we find ’pay- off’ by exploring and exploiting the structures in application environments where the function evaluations are requested. I. INTRODUCTION Specifically, we consider the basic matrix operations with This paper introduces efficient techniques for hardware im- a class of structured matrices that are certain transforms in plementation of accurate function evaluation via the integrated discrete format and appear commonly in signal and image design of lookup tables and lookup schemes. The functions of processing, in data or information processing, and in many interest include mathematical elementary and special functions scientific and engineering simulations. We explore and exploit used often in real-time digital signal processing or large-size the geometric and analytic structures of the transformation simulation in scientific and engineering studies. In particu- kernel functions and the associated discrete data sets. lar, we have used the techniques for efficiently evaluating The rest of the paper is organized as below. In Section II trigonometric functions, square-root extraction, logarithmic we provide preliminary approaches for exploring the design exponential function, and more complex functions such as the space and describe the structured matrices we are interested Bessel functions. Conventionally, most of these functions are in. In Section III we describe our approach to exploiting the evaluated in software, with no lookup table or with lookup geometric structures of the kernel functions in matrix evalu- tables of different sizes based on the available memory space. ations, which are to be followed by the discrete transforms, The overhead in both memory and latency associated with i.e., matrix-vector multiplications in the case of linear trans- software implementation is often too high to meet the high forms. In Section IV we introduce a new hardware-efficient speed and high throughput requirement. In hardware imple- interpolation scheme to be used together with fast table- mentation, there are also additional requirements on power and lookup for function evaluation to higher accuracy adaptively. area consumption. The use of lookup tables is an important We then demonstrate in Section V the effectiveness of the component in both algorithmic and architectural domains. It is introduced approaches. We conclude the paper in Section VI an effective, and sometimes necessary, scheme to utilize pre- with discussion on other related issues. computed tabulated data for function evaluation, especially at non-tabulated locations, with fewer operations and within a II. PRELIMINARIES required accuracy. In hardware implementation, lookup table We introduce first certain pre-processing, interlacing cal- Manuscript received ??, 2006; revised ??, 2006. culation and post-processing approaches which exploit the K. Sobti, L. Deng and C. Chakrabarti are with Department of Electrical mathematical properties of the functions to be evaluated. This Engineering, Arizona State University helps to reduce the size of lookup table. We then describe N. Pitsianis is with the Department of Electrical Engineering, Duke University the structured matrices of our interest, for which efficient and X. Sun is with the Department of Computer Science, Duke University. accurate function evaluations at large data sets are required. 2 A. Reduction in lookup table size at x0. The approximation error is described as follows, In preprocessing, one may reduce the dynamic range of the (n+1) f (ξ(x)) n+1 input by variable folding or/and scaling. In folding, when ap- Rn(x) = (x − x0) ; (4) plicable, the entire dynamic range of the input is folded onto a (n + 1)! much smaller range. For trigonometric functions, for example, where ξ(x) is between x and x. One determines from the the folding can be based on the trigonometric identities such 0 error estimates the minimal approximation degree n, taking as into consideration the sample space between table entries and sin(2nπ + u) = sin(u); cos(2nπ + u) = cos(u); the behavior of the higher order derivatives. The formation of the Taylor polynomial in (3) requires not only tabulating sin(2u) = 2 sin(u) cos(u); cos(2u) = 2 cos2(u) − 1: (1) f(x0) but also the derivatives at x0 up to the n order. For the exponential function, we may use αx = αx=2 · αx=2 In Newton’s interpolation scheme, only the tabulated func- to reduce the initial range by a factor of 2, at the expense tion values, not the derivatives, are explicitly required. But of a shift in the input and a multiplication at the output. For more number of tabulated function values are used for the some other functions such as power functions, one may use approximation of f at x, and the number of such reference multiplicative change in the input variable, i.e., variable scaling points increases with the higher accuracy requirement. Assume to reduce the dynamic range, that function f(x) is to be evaluated at point x and the function β β β has already been sampled at points s0; s1; · · · ; sk. Then, f(x) x = c (x=c) : (2) is approximated by the following polynomial, The power functions include in particular the division (β = Nn(x) = f[s0] + f[s0; s1](x − s0) −1) and the square root (β = 1=2). The folding and scaling 1 can be performed in either hardware or software. Equation (2) + f[s0; s1; s2] Y(x − sj ) + · · · already includes the scaling recovery in the post-processing, j=0 (5) where cβ is pre-selected and pre-evaluated. Both folding, n−1 scaling, their reverse operations, and the associated overhead + f[s0; s1; · · · ; sn] Y(x − sj ): shall be taken into consideration in the design and development j=0 of table construction and lookup, Between the preprocessing and postprocessing, the table Similarly to the Taylor polynomial, the i-th term in the Newton lookups may be tightly coupled with interpolations to adap- polynomial is a polynomial of degree i. In place of the i-th tively meet accuracy requirements without increasing the table derivative divided by i!, here we have the i-th order divided size. A simple example is the approximation of a function f at difference, which can be computed recursively from those of a non-entry value x by a linear interpolation of two tabulated lower orders and all the way from the function values, neighbor values f(x1) and f(x2) such that x1 < x < x2, f[s0] = f(s0); f(x2) − f(x1) f[s ] − f[s ] f(x) ≈ f(x1) + (x − x1): f[s ; s ] = 1 0 ; x2 − x1 0 1 s1 − s0 If higher accuracy is required, one may use a higher-order f[s1; s2] − f[s0; s1] f[s0; s1; s2] = ; interpolation scheme. s2 − s0 Two particular interpolation schemes are considered in this . paper. One requires pre-computed derivatives for every table . f[s1; · · · ; sk] − f[s0; · · · ; sk−1] entry, based on Taylor’s series expansion and truncation. f[s0; · · · ; sk] = : The other uses finite divided differences, based on Newton’s sk − s0 interpolation approach [4]. Mathematically, both the schemes The approximation error in N may be described as follow, render a polynomial approximation of a specified degree. In n error analysis, both the schemes have similar error estimates. n But they do differ in interpolation locality, in requirement for En(x) = f[s0; · · · ; sn; x] Y(x − sj ) pre-computed quantities, in look-up table construction. They j=0 n also represent a tradeoff between memory space and lookup 1 = f (n+1)(ξ) (x − s ); (6) latency in certain situation . (n + 1!) Y j We give a brief review of the two interpolation schemes. j=0 According to Taylor’s theorem [3], a function f at x can be where ξ is between min s and max s . approximated by a Taylor polynomial of degree n, j=0:n j j=0:n j The above interpolation schemes can be easily found in text n f (k)(x ) books on numerical analysis, see [4] for instance.

Load more