1 Efficient Function Evaluations with Lookup Tables for Structured Matrix Operations Kanwaldeep Sobti, Lanping Deng, Chaitali Chakrabarti, Nikos Pitsianis and Xiaobai Sun

Abstract— A hardware-efficient approach is introduced for technique can use small area since the memory is much denser extensive function evaluations in certain structured matrix than logic modules such as adders and multipliers [2]. computations. It is a comprehensive approach that utilizes The ultimate goal is to have accurate function evaluations lookup tables for their relatively low utilization of hardware resources, employs interpolations with adders and multipliers for with fast lookup schemes and small lookup tables. There is, their adaptivity to non-tabulated values and, more distinctively, however, a tangled relationship among the latency, accuracy exploits the function properties and the matrix structures to claim and memory size. For example, in a straightforward or direct better control over numerical dynamic ranges. This approach table lookup scheme, a higher-accuracy evaluation requires a brings forth more opportunities and room for improvement finer sampling for table entries and hence a larger table. While in hardware efficiency without compromising latency and numerical accuracy. We demonstrate the effectiveness of the such direct lookup scheme is generally believed to be fast, the approach with simulation results on evaluating, in particular, growth in table size may be exponential with the number of the cosine function, the exponential function and the zero-order required accurate digits, not only consuming large memory Bessel function of the first kind. space but also subsuming the lookup mechanism and speed. The techniques we introduce in this paper are developed . to explore the latency-accuracy-resource design space for Keywords: computer arithmetic, elementary function eval- combined reduction or minimization in memory and latency uation, lookup-table based , structured matrices, subject to an accuracy constraint. Furthermore, to narrow the geometric tiling conventional ’tradeoff’ among the three issues, we find ’pay- off’ by exploring and exploiting the structures in application environments where the function evaluations are requested. I. INTRODUCTION Specifically, we consider the basic matrix operations with This paper introduces efficient techniques for hardware im- a class of structured matrices that are certain transforms in plementation of accurate function evaluation via the integrated discrete format and appear commonly in signal and image design of lookup tables and lookup schemes. The functions of processing, in data or information processing, and in many interest include mathematical elementary and special functions scientific and engineering simulations. We explore and exploit used often in real-time digital or large-size the geometric and analytic structures of the transformation simulation in scientific and engineering studies. In particu- kernel functions and the associated discrete data sets. lar, we have used the techniques for efficiently evaluating The rest of the paper is organized as below. In Section II trigonometric functions, square-root extraction, logarithmic we provide preliminary approaches for exploring the design exponential function, and more complex functions such as the space and describe the structured matrices we are interested Bessel functions. Conventionally, most of these functions are in. In Section III we describe our approach to exploiting the evaluated in software, with no lookup table or with lookup geometric structures of the kernel functions in matrix evalu- tables of different sizes based on the available memory space. ations, which are to be followed by the discrete transforms, The overhead in both memory and latency associated with i.e., matrix-vector multiplications in the case of linear trans- software implementation is often too high to meet the high forms. In Section IV we introduce a new hardware-efficient speed and high throughput requirement. In hardware imple- interpolation scheme to be used together with fast table- mentation, there are also additional requirements on power and lookup for function evaluation to higher accuracy adaptively. area consumption. The use of lookup tables is an important We then demonstrate in Section V the effectiveness of the component in both algorithmic and architectural domains. It is introduced approaches. We conclude the paper in Section VI an effective, and sometimes necessary, scheme to utilize pre- with discussion on other related issues. computed tabulated data for function evaluation, especially at non-tabulated locations, with fewer operations and within a II. PRELIMINARIES required accuracy. In hardware implementation, lookup table We introduce first certain pre-processing, interlacing cal- Manuscript received ??, 2006; revised ??, 2006. culation and post-processing approaches which exploit the K. Sobti, L. Deng and . Chakrabarti are with Department of Electrical mathematical properties of the functions to be evaluated. This Engineering, Arizona State University helps to reduce the size of lookup table. We then describe N. Pitsianis is with the Department of Electrical Engineering, Duke University the structured matrices of our interest, for which efficient and X. Sun is with the Department of Computer Science, Duke University. accurate function evaluations at large data sets are required. 2

A. Reduction in lookup table size at x0. The approximation error is described as follows,

In preprocessing, one may reduce the dynamic range of the (n+1) f (ξ(x)) n+1 input by variable folding or/and scaling. In folding, when ap- Rn(x) = (x − x0) , (4) plicable, the entire dynamic range of the input is folded onto a (n + 1)! much smaller range. For trigonometric functions, for example, where ξ(x) is between x and x. One determines from the the folding can be based on the trigonometric identities such 0 error estimates the minimal approximation degree n, taking as into consideration the sample space between table entries and sin(2nπ + u) = sin(u), cos(2nπ + u) = cos(u); the behavior of the higher order derivatives. The formation of the Taylor in (3) requires not only tabulating sin(2u) = 2 sin(u) cos(u), cos(2u) = 2 cos2(u) − 1. (1) f(x0) but also the derivatives at x0 up to the n order. For the exponential function, we may use αx = αx/2 · αx/2 In Newton’s interpolation scheme, only the tabulated func- to reduce the initial range by a factor of 2, at the expense tion values, not the derivatives, are explicitly required. But of a shift in the input and a multiplication at the output. For more number of tabulated function values are used for the some other functions such as power functions, one may use approximation of f at x, and the number of such reference multiplicative change in the input variable, i.e., variable scaling points increases with the higher accuracy requirement. Assume to reduce the dynamic range, that function f(x) is to be evaluated at point x and the function

β β β has already been sampled at points s0, s1, · · · , sk. Then, f(x) x = c (x/c) . (2) is approximated by the following polynomial, The power functions include in particular the division (β = Nn(x) = f[s0] + f[s0, s1](x − s0) −1) and the square root (β = 1/2). The folding and scaling 1 can be performed in either hardware or software. Equation (2) + f[s0, s1, s2] Y(x − sj ) + · · · already includes the scaling recovery in the post-processing, j=0 (5) where cβ is pre-selected and pre-evaluated. Both folding, n−1 scaling, their reverse operations, and the associated overhead + f[s0, s1, · · · , sn] Y(x − sj ). shall be taken into consideration in the design and development j=0 of table construction and lookup, Between the preprocessing and postprocessing, the table Similarly to the Taylor polynomial, the i-th term in the Newton lookups may be tightly coupled with interpolations to adap- polynomial is a polynomial of degree i. In place of the i-th tively meet accuracy requirements without increasing the table derivative divided by i!, here we have the i-th order divided size. A simple example is the approximation of a function f at difference, which can be computed recursively from those of a non-entry value x by a linear interpolation of two tabulated lower orders and all the way from the function values, neighbor values f(x1) and f(x2) such that x1 < x < x2, f[s0] = f(s0), f(x2) − f(x1) f[s ] − f[s ] f(x) ≈ f(x1) + (x − x1). f[s , s ] = 1 0 , x2 − x1 0 1 s1 − s0 If higher accuracy is required, one may use a higher-order f[s1, s2] − f[s0, s1] f[s0, s1, s2] = , interpolation scheme. s2 − s0 Two particular interpolation schemes are considered in this . paper. One requires pre-computed derivatives for every table . f[s1, · · · , sk] − f[s0, · · · , sk−1] entry, based on Taylor’s series expansion and truncation. f[s0, · · · , sk] = . The other uses finite divided differences, based on Newton’s sk − s0 interpolation approach [4]. Mathematically, both the schemes The approximation error in N may be described as follow, render a polynomial approximation of a specified degree. In n error analysis, both the schemes have similar error estimates. n But they do differ in interpolation locality, in requirement for En(x) = f[s0, · · · , sn, x] Y(x − sj ) pre-computed quantities, in look-up table construction. They j=0 n also represent a tradeoff between memory space and lookup 1 = f (n+1)(ξ) (x − s ), (6) latency in certain situation . (n + 1!) Y j We give a brief review of the two interpolation schemes. j=0 According to Taylor’s theorem [3], a function f at x can be where ξ is between min s and max s . approximated by a Taylor polynomial of degree n, j=0:n j j=0:n j The above interpolation schemes can be easily found in text n f (k)(x ) books on numerical analysis, see [4] for instance. They repre- T (x) = 0 (x − x )k, (3) n X k! 0 sent the extreme cases in reference locality for any specified k=0 polynomial degree. In between are interpolation algorithms where x0 indicates a table entry, the function is assumed that use both derivatives and multiple reference points such as smooth enough at x0 to warrant the existence of the derivatives the Hermitian interpolation. 3

B. Structured matrices: geometric structures points are on equally spaced grid, the matrix has nested In many matrix operations, matrix element evaluations com- Toeplitz structures. In this case, at most (m + n) log(m + n) prise bulk of the computations. For example, the product of an function evaluations are necessary for all the matrix elements. n × n dense matrix with a vector takes 2n2 arithmetic opera- Many practical cases are not so ideal, i.e., the target or source tions, assuming the matrix is numerically provided. However, domain or both are far from being Kronecker cubes and the the numerical evaluation of the matrix elements may be much samples are not equally spaced. Thus, the matrix formation more expensive. We are concerned with a particular class of may need up to m × n function evaluations. If the two datum matrices that result from discretization of certain transforms coincide, the corresponding matrix is symmetric, in which fundamental to scientific and engineering studies. We describe case the computation for function evaluations can be reduced in this section the matrix structures and we discuss in later by half at most. In this paper we introduce an effective and sections how to exploit these structures for efficient function efficient approach for the function evaluations by exploiting evaluation in the implementation of certain discrete transforms. the geometric structures of the kernel function and the sample A matrix in the class of our interest is typically a linear distributions that were conveniently ignored in both matrix or linearized convolutional transform on two discrete, finitely structure abstraction and matrix evaluation. bounded data sets. We may describe the matrices in more spe- cific terms. Let κ(t, s) be the kernel function of a convolutional III. INTEGRATED DESIGN transform. Let T and S denote, respectively, the target set and We introduce in this section and the next section our ap- source set of sample points at which the kernel function is proach for integrated design of lookup tables and table lookup 3 sampled and evaluated, in T, S ⊂ R . The discrete transform schemes for evaluating structured matrices. We describe in becomes this section a new matrix partition or tiling scheme and its advantages. g(ti) = X κ(ti − sj ) · f(sj), ti ∈ T. (7) sj ∈S A. Geometric tiling We give a few examples for transform kernel functions. For the gravitational interaction, the kernel function is essentially The conventional naive tiling scheme is mainly for cache 1/kt − sk, i.e., the interaction between a particle located performance, no numerical concern is taken into consideration. at t and a particle at s is reciprocally proportional to their Geometric tiling has the same effect in tuning cache perfor- Euclidean distance. The square-root extraction is therefore mance. In addition, it also aims at reducing the dynamic range needed. This kernel function is used to describe Newtonian of numerical values per tile and across tiles, thereby reducing particle interactions from molecular dynamics to galaxy dy- the bit-width without compromising numerical accuracy and namics as well as electro-static interactions [6]. For the second saving power and computational resource consumption. example, the kernel function eikkt−sk/kt − sk appears in 3D image processing based on computational electromagnetics. This requires the evaluation of cosine and functions, in addition to the square-root extraction. This function may be transformed into the zero-order Bessel functions of the first kind for certain 2D image processing problems [1]. These functions are smooth except that eikkt−sk/kt − sk has a singular point at t = s, is translation invariant, and it vanishes at infinity. The equations in (7) can be represented in an aggregated Fig. 1. Two sample clusters located in two sub-divided regions form, i.e., in matrix form

g(T) = K(T, S) · f(S), In geometric tiling, we partition a matrix according to the geometric locations of the sample data associated with the where, the vector f(S) is a given function defined on the matrix. First we partition and cluster the samples based on source set S, the matrix K(T, S) is the aggregation of a prescribed clustering rule. Imagine that all the particles, point-point interaction over the target set and the source set, the sources and targets, are bounded in a box B. We may representing the cluster-cluster interaction. The vector g(T) assume that one of the box’s corners is at the origin, see is the transformed function over the target set T. Notice that Figure 1. Specifically, one may subdivide the box B into the matrix rows and columns are indexed with corresponding eight non-overlapping smaller boxes of equal size, B0, B1, target and source locations ti and sj , respectively, at this B2, up to B7. Each and every particle falls into one of the moment. smaller boxes. Denote by Ti and Si the respective clusters The number of function evaluations for a matrix formation of the target particles and source particles in box Bi. This depends on the sample distribution. Assume T contains m particle clustering scheme induces a virtual partition of the target points and S contains n source points. Then, the matrix interaction matrix into 8 × 8 tiles (sub-matrices). The (i, j) K(T, S) has m · n elements. When both the target and source tile is K(Ti, Sj ), the interaction matrix between clusters Ti domain are in the form of Kronecker cubes and the sample and Sj . This tile is empty if one of the clusters is empty. 4

4

3.5 By this fact the three coordinates of the difference t−s and

3 the distance kt − sk for each and every source-target pair in a 2.5 geometric tile can be represented in hardware with one bit less 2 without compromising the numerical accuracy. Furthermore, 1.5 one can refine the tiling to finer granularity, depending on 1

0.5 the sample distribution and the kernel function as well as

0 architectural implementation complexity. −0.5 2. Reduction of dynamic range at the output of function −1 evaluation. The interaction functions considered here, such 4 as those examples given in Section II-B, are smooth. Based 3.5 on this property, a sufficiently refined geometric tiling can 3

2.5 make the numerical range of the source-target interaction over

2 a tile smaller than or at most equal to half of the range 1.5 over the whole matrix. For the gravitational interaction with 1 κ(t, s) = 1/kt − sk, in particular, a partition of the box B in 0.5 the middle along each dimension is sufficient for this reduction 0

−0.5 in numerical range of the output over each and every geometric

−1 tile.

Fig. 2. Images of the interaction matrix associated with the two-clusters datum set in a plain tiling (top) and in a geometric tiling (bottom) in log scale

For the case illustrated in Figure 1, the 8 × 8 virtual partition renders a 2 × 2 partition of the interaction matrix with the zero partitions omitted. The corresponding matrix is shown in Figure 2. We show in Figure 2 two images of the same interaction matrix associated with the two-clusters datum set in Figure 1. The top image results from a plain tiling, oblivious to the geometric information, and the bottom image from a geometric tiling. The numerical ranges in the partitioned tiles are indi- cated by colors. Any geometric tile may be further partitioned with either geometric tiling or plain tiling, depending on the circumstance of the integrated design. Fig. 3. The numerical landscape of a larger matrix in geometric tiles Geometric clustering incurs an overhead. Fortunately, this overhead is linearly proportional to the total number of the 3. Increase of locality in data range. A segment of a lookup particles, which is smaller than the total number of the table may be loaded and unloaded for each tile and reloaded interaction entries by an order of magnitude. Moreover, this for a tile of the same range. The segment may be loaded once overhead is subsumed by the many advantages of geometric for many tiles of the same ranges, based on the translation- tiling over plain tiling, as described in the next section. invariant property of the interaction functions. Specifically, we organize geometric tiles at different elevation levels, according B. Potential advantages to their numerical ranges, see Figure 3. We traverse the tiles and accumulate submatrix-vector products by the elevation The impact of geometric tiling on efficient table lookups levels, in an ascending order, in order to reduce the variation may be summarized as follows. oscillation in numerical range across tiles. 1. Reduction of dynamic range at the input of function In the hardware aspect, the increased locality in dynamic evaluation. Computing pairwise difference (t−s) and distance range enables us to configure specialized structures for the tiles kt − sk is common to the class of convolutional kernel in a designated range. For example, we can vary the number of functions we are interested in, as described in Section II- lookup table entries required for evaluating the cosine function B. Within each geometric tile, the difference t − s between based on the expected dynamic range of the inputs without any source-target pair (t, s) in the same sub-box is less than, suffering from the high overhead of dynamic reconfiguration. or at most equal to, half of the maximum particle-to-particle difference in the big box, in each and every component in magnitude as well as in the Euclidean length. This reduction C. Cost and constraints in LUT hardware in the input range can be applied to all the translation-invariant We now discuss on how geometric tiling scheme can be kernel functions, in addition to folding or scaling whenever utilized to reduce hardware cost or meet certain hardware applicable. constraints. 5

Consider the situation where the memory for a LUT is The higher order differences f[s0, s1, · · · , sk], where k ≤ 4 very limited. In this case, the numerical range of the input by the above assumption, can be treated with two different parameters in successive cycles should be kept small. This options. They can be pre-computed to minimize the interpo- can be translated into an architectural constraint on geometric lation latency. Or, they can be re-computed whenever needed, tiling at the algorithmic level. Integrating with the geometric without increasing the table size quadratically relative to the tiling scheme, we can segment the entire range of the input total number of sample points. There is thus a memory-latency parameter into small ones that can meet the spatial constraint tradeoff. We point out that this frequent re-computation and as well as the numerical accuracy requirement. increased latency can be substantially reduced by function In Table I we illustrate with two simple segmentation evaluation on clustered input data. Specifically, the higher options, the evaluation of the zero-order Bessel function J0 order differences are computed only once for every input over a range of the input parameter kt − sk associated the cluster, used and re-used for the function evaluation at all the samples in the target set T and the source set S. The dynamic input values in the cluster. range for the input in each segment in Design A is twice that for the design B. In other words, the table size for Design is V. SIMULATION RESULTS as twice as that for Design B. We present in this section primary simulation results to TABLE I demonstrate the reduction in table size and other computation RANGE SEGMENTATION FOR FUNCTION J0 resources achieved by integrating geometric tiling with the Design B: 10 segments design and use of lookup tables for extensive function evalua- Input Output tions in certain structured matrix computation as described in t s 0 t s Design A: 5 segments k − k J (k − k) the previous sections. The reduction is shown by comparing , [ 0.77, 1.00] Input Output [0 1] [1, 2] [ 0.22, 0.77] the integrated scheme to the conventional scheme without kt − sk J0(kt − sk) [2, 3] [-0.26, 0.22] geometric clustering. To have fair comparisons, we let both [0, 2] [ 0.22, 1.00] [3, 4] [-0.40, -0.26] [2, 4] [ -0.40, 0.22] schemes meet the same prescribed accuracy requirement with [4, 5] [-0.40, -0.18] [4, 6] [-0.40, 0.15] [5, 6] [-0.18, 0.15] minimal cost within each scheme. [6, 8] [ 0.15, 0.30] [6, 7] [ 0.15, 0.30] We shall describe first the setup for the simulation results. [8, 10] [ -0.25, 0.17] [7, 8] [ 0.17, 0.30] Lookup tables and interpolations are simulated in MATLAB [8, 9] [-0.09, 0.17] [9, 10] [-0.25, -0.09] with single-precision floating arithmetic. We experiment with two elementary functions cos(x), e−x and one special function J0(x), the zero-order Bessel function of the first kind. The IV. HARDWARE-EFFICIENT INTERPOLATIONS input folding and scaling scheme are applicable to the cosine In this section, we introduce an approach to exploiting function and exponential function, respectively, but not to the the data locality provided by geometric tiling for hardware- Bessel function. The initial range for the input is specified as efficient interpolations in combination with table lookups. (0, 100) for the cosine and Bessel functions, and (0, 10) for We illustrate the approach in particular with the Newton the exponential function. The sampling points for the table interpolation approach as described in Section II-A. entries are uniformly distributed. The use of the Newton interpolation method may be pic- We carry the interpolation experiments with the two meth- torially described in Figure 4. The sample or nodal points, ods described in Section II-A. We alter the accuracy by varying denoted by sp, are assumed equally spaced, although not the degree of the interpolation , see Equations necessarily. The evaluation points at non-nodal locations are (3) and (5). The errors in the interpolation are estimated denoted by xq . For a particular function and a specified analytically by (4) and (6), respectively, and further tested evaluation accuracy, we determine based on the error estimate numerically. In numerical testing, the accuracy of function in (6) the number of nodal points for interpolation at any evaluations is described by the worst deviation from that by non-nodal point x within the input range, which may be MATLAB in double-precision floating-point arithmetic, which reduced as we have discussed. Suppose five nodal points we refer to as the maximum absolute errors in the following give sufficient accuracy. Then, the function evaluation at x1 tables. needs table entries associated with the nearby sample points Next we present and explain the simulation results in s0, s1, s2, s3, s4, see the top of Figure 4. The evaluation at x2 Tables II, III and IV. We use GT and PT to denote geometric needs table entries at s2, s3, s4, s5, s6. tiling and plain tiling, respectively. Each individual configura- By utilizing the geometric tiling scheme, the input values for tion, labeled with either GT or PT with a subscript, is specified function evaluations are clustered. In other words, the entire by the interpolation degree in a specified interpolation scheme, dynamic range of the input is divided into smaller sub-regions, the size of lookup table, the number of adders and the number which may or may not overlap. Some of the sub-regions are of multipliers. The table size refers to the total number of shown by the bounding intervals D0, D1, · · ·, Dk at the bottom coefficients in the terms in Equations (3) and (5). In hardware of Figure 4. For all the evaluation points within the same sub- implementation such as using field programmable gate arrays, interval, the same segment of table entries can be used. For these coefficients can be pre-calculated and stored in desig- example, to compute f(x) for all x in D1, it is sufficient to nated memory. Hardware resources also include the adders and use only the table entries at s5, s6, s7, s8, s9. multipliers used for function evaluation at non-tabulated points 6

Fig. 4. Top:Interpolation using table entries at neighbor sample points ; Bottom: clustering of input data and segmentation of input range

TABLE III via interpolations. With unfolded hardware architectures, the LOOKUP-TABLE IMPLEMENTATION OF cos(x) number of adder and multiplier is proportional to the respective numbers of algebraic multiplications and additions in an Function cos(x) Geometric Plain interpolation scheme. Configuration GT3 PT6 PT7 PT8 PT9 Taylor degree 3 3 3 9 9 TABLE II Input folding yes yes no no no LOOKUP-TABLE IMPLEMENTATION OF J0(x) Table size 34 102 1602 36 68 Multipliers 2 4 2 14 14 Adders 9 9 7 15 15 Function J0(x) Geometric Plain Max. abs. err. 4.06e-5 4.26e-5 4.26e-5 0.0517 1.04e-4 Configuration GT1 PT1 PT2 PT3 Taylor degree 3 3 9 9 Table size 282 6792 864 288 Multipliers 3 3 27 27 evaluation of the cosine function. The Taylor interpolation Adders 9 9 36 36 Max. abs. err. 2.10e-06 2.11e-06 2.54e-06 9.68e-04 is used only because of the simple relationship between the function and its derivatives. By geometric tiling the numerical

Function J0(x) Geometric Plain range is partitioned into 4 equal-sized segments. The com- Configuration GT2 PT4 PT5 parison between GT3 and PT6 shows again the advantage Newton degree 3 3 9 of the lookup-table scheme integrated with geometric tiling. Table size 400 3204 1010 In addition, the scheme for input folding to [ 0, 2π ] by Multipliers 4 4 10 Adders 11 11 28 Equation (1) is applicable to both tiling schemes. The folding Max. abs. err. 3.57e-06 3.36e-06 5.13e-05 effect can be observed by the comparison between PT6 and PT7, for instance. With PT8 and PT9 we show that the In Table II we show lookup-table implementations for the accuracy can be easily compromised even with higher-degree J0(x) function with the Taylor interpolation (top) and the interpolation and more adders and multipliers. Newton interpolation (bottom). By geometric tiling the numer- ical range is partitioned into 50 segments of equal size. With TABLE IV −x LOOKUP-TABLE IMPLEMENTATION OF e the Taylor interpolation, the table size for configuration PT1 is 24 times of that by GT1, with the same accuracy as shown Function e−x Geometric Plain in the maximum absolute errors and the same latency, which is Configuration GT4 PT10 PT11 PT12 PT13 characterized here by the same logic resource utilization with Taylor degree 3 3 3 9 3 adders and multipliers. PT is an alternative configuration Input scaling yes yes no no no 2 Table size 9 9 81 9 9 reaching to the same accuracy, using smaller table and higher- Multipliers 3 3 2 14 2 degree interpolation, at the cost of more adders and multipliers. Adders 7 8 6 14 6 PT3 uses the same table size as GT1, but it fails in reaching Max. abs. err. 4.00e-5 4.00e-5 4.00e-5 1.47e-5 0.0350 the same accuracy even with more adders and multipliers. The results with the Newton interpolation as shown in the bottom Finally, we present in Table IV the simulation results for table tells a similar story. In GT2 configuration, the entire evaluating the exponential function. In this case, PT10 does numerical range was divided by geometric tiling into 16 equal- as well as GT4 because the input scaling has effectively put sized segments. them into the same numerical range. However the scaling In Table III we present the simulation results for the coefficients have to be computed on point by point basis for 7

PT10, while on tile by tile basis for GT4. The scaling effect is evident by the comparisons among PT10, PT11, PT12 and PT13.

VI. CONCLUSION The hardware-efficient techniques introduced in this paper for function evaluations associated with certain structured ma- trix computation are based on the principle concept of breaking down or shifting the traditional boundaries and tradeoffs in algorithmic module abstraction, architectural module abstrac- tion, and between algorithms and architectures. Specifically, the design and use of lookup tables as an efficient function evaluation component are combined with and enhanced by algorithmic techniques such as input folding, scaling and especially geometric tiling, which is based on a new matrix abstraction to preserve the geometric information that is com- monly available but was neglected. A broader realization of this principle concept is on -architecture codesign of special-purpose systems, [5] which may include one or more functions to be evaluated. The experiments presented in Section V are intended to demonstrate the promise and impact of these integration techniques. There are many more architectural choices than presented, such as hardware configurations with folded struc- tures using fewer adders and multipliers. This may in turn change the pipeline structure and the latency. This variety in design options shows a great opportunity for performance enhancement and optimization in the architectural design space. As we have illustrated in this paper, it is not necessarily hard to integrate architectural design and algorithmic design, and the payoff from such integration is substantial.

REFERENCES [1] W. C. Chew. Waves and Field in Inhomogeneous Media. Van Nostrand Reinhold, New York, 1990. [2] B. Parhami (editor). Computer arithmetic algorithms and hardware designs. Oxford University Press, 2000. [3] J.W. Hauser and C.N. Purdy. Approximating functions for embedded and asic applications. In Proceedings of the 44th IEEE Midwest Symposium on Circuits and Systems, volume 1, pages 478–481, Aug 2001. [4] F. B. Hildebrandt. Introduction to Numerical Analysis. McGraw-Hill Book Company, 1956. [5] J. Kim, P. Mangalagiri, M. Kandemir, V. Narayanan, L. Deng, K. Sobti, C. Chakrabarti, N. Pitsianis, and X. Sun. Algorithm-architecture codesign for structured matrix operations on reconfigurable systems. Technical Report CS-2006-11, Department of Computer Science, Duke University, 2006. [6] B. Thide.´ Electromagnetic Field Theory. Upsilon Books, 1997.