<<

i

604 IV. Branches of

Karp, . M., and V. Ramachandran. 1990. Parallel really goes on is a far more interesting process of exe- for shared-memory machines. In Handbook of Theoret- cution of algorithms. In most cases the job could not ical Science, volume A, Algorithms and Com- be done even in principle by formulas, for most mathe- plexity, edited by J. van Leeuwen. Cambridge, MA: MIT matical problems cannot be solved by a finite sequence Press/Elsevier. of elementary operations. What happens instead is Kearns, M. J., and U. V. Vazirani. 1994. An Introduction that fast algorithms quickly converge to “approximate” to Computational Learning Theory. Cambridge, MA: MIT Press. answers that are accurate to three or ten digits of pre- Kitaev, A., A. Shen, and M. Vyalyi. 2002. Classical and Quan- cision, or a hundred. For a scientific or engineering tum Computation. Providence, RI: American Mathematical application, such an answer may be as good as exact. Society. We can illustrate the complexities of exact versus Kushilevitz, E., and N. Nisan. 1996. Communication Com- approximate solutions by an elementary example. Sup- plexity. Cambridge: Cambridge University Press. pose we have one of degree 4, Ron, . 2001. Property testing (a tutorial). In Handbook on 2 3 4 Randomized , volume II. Dordrecht: Kluwer. p(z) = c0 + c1z + c2z + c3z + c4z , Shaltiel, R. 2002. Recent developments in explicit construc- and another of degree 5, tions of extractors. Bulletin of the European Association 2 3 4 5 for Theoretical 77:67–95. q(z) = d0 + d1z + d2z + d3z + d4z + d5z . Sipser, M. 1997. Introduction to the . It is well-known that there is an explicit formula that Boston, MA: PWS. Strassen, V. 1990: Algebraic complexity theory. In Hand- expresses the roots of p in terms of radicals (discov- book of Theoretical Computer Science, volume A, Algo- ered by Ferrari around 1540), but no such formula for rithms and Complexity, edited by J. van Leeuwen. Cam- the roots of q (as shown by Ruffini and abel [VI.33] bridge, MA: MIT Press/Elsevier. more than 250 years later; see the insolubility of the quintic [V.21] for more details). Thus, in a cer- tain philosophical sense the root-finding problems for IV.21 Numerical p and q are utterly different. Yet in practice they hardly Lloyd N. Trefethen differ at all. If a scientist or a wants to know the roots of one of these , he or she 1 The Need for Numerical Computation will turn to a computer and get an answer to sixteen Everyone knows that when scientists and engineers digits of precision in less than a millisecond. Did the need numerical answers to mathematical problems, computer use an explicit formula? In the case of q, they turn to . Nevertheless, there is a wide- the answer is certainly no, but what about p? Maybe, spread misconception about this process. maybe not. Most of the time, the user neither knows The power of numbers has been extraordinary. It is nor cares, and probably not one mathematician in a often noted that the scientific revolution was set in hundred could write down formulas for the roots of motion when Galileo and others made it a principle p from memory. that everything must be measured. Numerical measure- Here are three more examples of problems that can ments led to physical laws expressed mathematically, be solved in principle by a finite sequence of elementary and, in the remarkable cycle whose fruits are all around operations, like finding the roots of p. us, finer measurements led to refined laws, which in (i) Linear : solve a system of n linear equa- turn led to better technology and still finer measure- tions in n unknowns. ments. The day has long since passed when an advance (ii) : minimize a linear in the physical sciences could be achieved, or a signifi- of n variables subject to m linear constraints. cant engineering product developed, without numerical (iii) Traveling salesman problem: find the shortest tour mathematics. between n cities. Computers certainly play a part in this story, yet there is a misunderstanding about what their role is. And here are five that, like finding the roots of q, cannot Many people imagine that scientists and mathemati- generally be solved in this manner. cians generate formulas, and then, by inserting num- bers into these formulas, computers grind out the nec- (iv) Find an eigenvalue [I.3 §4.3] of an n × n . essary results. The reality is nothing like this. What (v) Minimize a function of several variables. i

IV.21. Numerical Analysis 605

(vi) Evaluate an . ics began to explode, but now mainly in the hands of (vii) Solve an ordinary differential (ODE). specialists. New journals were founded such as Math- (viii) Solve a partial differential equation (PDE). ematics of Computation (1943) and Numerische Mathe- matik (1959). The revolution was sparked by hardware, Can we conclude that (i)–(iii) will be easier than (iv)–(viii) but it included mathematical and algorithmic develop- in practice? Absolutely not. Problem (iii) is usually very ments that had nothing to do with hardware. In the half- hard indeed if n is, say, in the hundreds or thousands. century from the 1950s, machines sped up by a factor Problems (vi) and (vii) are usually rather easy, at least of around 109, but so did the best algorithms known if the integral is in one dimension. Problems (i) and (iv) for some problems, generating a combined increase in are of almost exactly the same difficulty: easy when n speed of almost incomprehensible scale. is small, like 100, and often very hard when n is large, Half a century on, numerical analysis has grown in- like 1 000 000. In fact, in these matters philosophy is to one of the largest branches of mathematics, the such a poor guide to practice that, for each of the three specialty of thousands of researchers who publish in problems (i)–(iii), when n and m are large one often dozens of mathematical journals as well as applica- ignores the exact solution and uses approximate (but tions journals across the sciences and engineering. fast!) methods instead. Thanks to the efforts of these people going back many Numerical analysis is the study of algorithms for decades, and thanks to ever more powerful comput- solving the problems of continuous mathematics, by ers, we have reached a point where most of the clas- which we mean problems involving real or complex sical mathematical problems of the physical sciences variables. (This definition includes problems like lin- can be solved numerically to high accuracy. Most of ear programming and the traveling salesman problem the algorithms that make this possible were invented posed over the real numbers, but not their discrete ana- since 1950. logues.) In the remainder of this article we shall review Numerical analysis is built on a strong foundation: some of its main branches, past accomplishments, and the mathematical subject of . possible future trends. This field encompasses classical questions of inter- harmonic analy- 2 A Brief History polation, expansions, and sis [IV.11] associated with newton [VI.14], fourier Throughout history, leading have been [VI.25], Gauss, and others; semiclassical problems of involved with scientific applications, and in many cases polynomial and rational minimax approximation asso- this has led to the discovery of numerical algorithms ciated with names such as chebyshev [VI.45] and Bern- still in use today. gauss [VI.26], as usual, is an out- stein; and major newer topics, including splines, radial standing example. Among many other contributions, basis functions, and wavelets [VII.3]. We shall not he made crucial advances in least-squares data fitting have space to address these subjects, but in almost (1795), systems of linear equations (1809), and numer- every area of numerical analysis it is a fact that, sooner ical (1814), as well as inventing the fast or later, the discussion comes down to approximation fourier transform [III.26] (1805), though the last theory. did not become widely known until its rediscovery by Cooley and Tukey in 1965. 3 Machine and Rounding Errors Around 1900, the numerical side of mathematics started to become less conspicuous in the activities It is well-known that computers cannot represent real of research mathematicians. This was a consequence or complex numbers exactly. A quotient like 1/7 eval- of the growth of mathematics generally and of great uated on a computer, for example, will normally yield advances in fields in which, for technical reasons, math- an inexact result. (It would be different if we designed ematical rigor had to be the heart of the matter. For machines to work in base 7!) Computers approximate example, many advances of the early twentieth cen- real numbers by a system of floating-point arithmetic, tury sprang from mathematicians’ new ability to reason in which each number is represented in a digital equiv- rigorously about infinity, a subject relatively far from alent of scientific notation, so that the scale does not numerical calculation. matter unless the number is so huge or tiny as to cause A generation passed, and in the 1940s the computer overflow or underflow. Floating-point arithmetic was was invented. From this moment numerical mathemat- invented by Konrad Zuse in Berlin in the 1930s, and i

606 IV. Branches of Mathematics by the end of the 1950s it was standard across the the magnetic moment of the electron to the Bohr mag- computer industry. neton. At present, almost nothing in physics is known Until the 1980s, different computers had widely dif- to more than 12 or 13 digits of accuracy. Thus IEEE ferent arithmetic properties. Then, in 1985, after years numbers are orders of magnitude more precise than of discussion, the IEEE (Institute of Electrical and Elec- any number in science. (Of course, purely mathematical tronics Engineers) standard for binary floating-point quantities like π are another matter.) arithmetic was adopted, or IEEE arithmetic for short. In two senses, then, floating-point arithmetic is far This standard has subsequently become nearly univer- closer to its ideal than is physics. It is a curious sal on processors of many kinds. An IEEE (double pre- phenomenon that, nevertheless, it is floating-point cision) consists of a 64-bit word divided arithmetic rather than the laws of physics that is into 53 bits for a signed fraction in base 2 and 11 bits widely regarded as an ugly and dangerous compromise. for a signed exponent. Since 2−53 ≈ 1.1 × 10−16, IEEE Numerical analysts themselves are partly to blame for numbers represent the numbers of the real line to a rel- this perception. In the 1950s and 1960s, the founding 10 ative accuracy of about 16 digits. Since 2±2 ≈ 10±308, fathers of the field discovered that inexact arithmetic this system works for numbers up to about 10308 and can be a source of danger, causing errors in results down to about 10−308. that “ought” to be right. The source of such prob- Computers do not merely represent numbers, of lems is numerical instability: that is, the amplification course; they perform operations on them such as of rounding errors from microscopic to macroscopic , subtraction, multiplication, and , and scale by certain modes of computation. These men, more complicated results are obtained from sequences including von neumann [VI.91], Wilkinson, Forsythe, of these elementary operations. In floating-point arith- and Henrici, took great pains to publicize the risks of metic, the computed result of each elementary opera- careless reliance on machine arithmetic. These risks are tion is almost exactly correct in the following sense: if very real, but the message was communicated all too “∗” is one of these four operations in its ideal form and successfully, leading to the current widespread impres- “-* ” is the same operation as realized on the computer, sion that the main business of numerical analysis is then for any floating-point numbers x and y, assuming coping with rounding errors. In fact, the main busi- that there is no underflow or overflow, ness of numerical analysis is designing algorithms that converge quickly; rounding-error analysis, while often x -* y = (x ∗ y)(1 + ε). a part of the discussion, is rarely the central issue. If Here ε is a very small quantity, no greater in abso- rounding errors vanished, 90% of numerical analysis lute value than a number known as , would remain. denoted by εmach, that measures the accuracy of the = −53 ≈ × computer. In the IEEE system, εmach 2 1.1 4 Numerical Linear 10−16. Thus, on a computer, the interval [1, 2], for example, became a standard topic in undergradu- is approximated by about 1016 numbers. It is interest- ate mathematics curriculums in the 1950s and 1960s, ing to compare the fineness of this with and has remained there ever since. There are several that of the of physics. In a handful of reasons for this, but I think one is at the bottom of it: solid or liquid or a balloonful of gas, the number of the importance of linear algebra has exploded since the atoms or molecules in a line from one point to another arrival of computers. is on the order of 108 (the cube root of Avogadro’s num- The starting point of this subject is Gaussian elimi- ber). Such a system behaves enough like a continuum nation, a procedure that can solve n linear equations to justify our definitions of physical quantities such as in n unknowns using on the order of n3 arithmetic density, pressure, stress, strain, and temperature. Com- operations. Equivalently, it solves equations of the form puter arithmetic, however, is more than a million times Ax = b, where A is an n×n matrix and x and b are col- finer than this. Another comparison with physics con- umn vectors of size n. is invoked cerns the precision to which fundamental constants are on computers around the world almost every time a known, such as (roughly) 4 digits for the gravitational system of linear equations is solved. Even if n is as constant G, 7 digits for Planck’s constant h and the ele- large as 1000, the time required is well under a sec- mentary charge e, and 12 digits for the ratio µe/µB of ond on a typical 2008 desktop machine. The idea of i

IV.21. Numerical Analysis 607 elimination was first discovered by Chinese scholars with permuted rows. If the permutations are chosen to about 2000 years ago, and more recent contributors bring the largest entry below the diagonal in column k include lagrange [VI.22], Gauss, and jacobi [VI.35]. to the (k, k) position before the kth elimination step,

The modern way of describing such algorithms, how- then L has the additional property |ij | 1 for all i ever, was apparently introduced as late as the 1930s. and j. Suppose that, say, α times the first row of A is sub- The discovery of pivoting came quickly, but its tracted from the second row. This operation can be theoretical analysis has proved astonishingly hard. In interpreted as the multiplication of A on the left by the practice, pivoting makes Gaussian elimination almost lower- M1 consisting of the identity perfectly stable, and it is routinely done by almost all with the additional nonzero entry m21 =−α. Further computer programs that need to solve linear systems analogous row operations correspond to further multi- of equations. Yet it was realized in around 1960 by plications on the left by lower-triangular matrices Mj . Wilkinson and others that for certain exceptional matri- If k steps convert A to an upper-triangular matrix U, ces, Gaussian elimination is still unstable, even with then we have MA = U with M = Mk ···M2M1, or, upon pivoting. The lack of an explanation of this discrep- −1 setting L = M , ancy represents an embarrassing gap at the heart of A = LU. numerical analysis. Experiments suggest that the frac- Here L is unit lower-triangular, that is, lower-triangular tion of matrices (for example, among random matri- with all its diagonal entries equal to 1. Since U rep- ces with independent normally distributed entries) for resents the target structure and L encodes the opera- which Gaussian elimination amplifies rounding errors 1/2 tions carried out to get there, we can say that Gaussian by a factor greater than ρn is in a certain sense expo- elimination is a process of lower-triangular upper- nentially small as a function of ρ as ρ →∞, where n is triangularization. the dimension, but a theorem to this effect has never Many other algorithms of been proved. are also based on writing a matrix as a product of matri- Meanwhile, beginning in the late 1950s, the field of ces that have special properties. To borrow a phrase numerical linear algebra expanded in another direction: from biology, we may say that this field has a central the use of algorithms based on orthogonal [III.50 §3] dogma: or unitary [III.50 §3] matrices, that is, real matrices with Q−1 = QT or complex ones with Q−1 = Q∗, where algorithms ←→ matrix factorizations. Q∗ denotes the conjugate transpose. The starting point In this framework we can quickly describe the next of such developments is the idea of QR factorization.If that needs to be considered. Not every matrix A is an m × n matrix with m n, a QR factorization × has an LU factorization; a 2 2 counterexample is the of A is a product matrix   A = QR, 01 A = . 10 where Q has orthonormal columns and R is upper-tri- angular. One can interpret this formula as a matrix ex- Soon after computers came into use it was observed pression of the familiar idea of Gram–Schmidt orthog- that even for matrices that do have LU factoriza- onalization, in which the columns q ,q ,... of Q are tions, the pure form of Gaussian elimination is unsta- 1 2 determined one after another. These column opera- ble, amplifying rounding errors by potentially large tions correspond to multiplication of A on the right amounts. Stability can be achieved by interchanging by elementary upper-triangular matrices. One could say rows during the elimination in order to bring maxi- mal entries to the diagonal, a process known as piv- that the Gram–Schmidt algorithm aims for Q and gets oting. Since pivoting acts on rows, it again corresponds R as a by-product, and is thus a process of triangular to a multiplication of A by other matrices on the left. orthogonalization. A big event was when Householder The matrix factorization corresponding to Gaussian showed in 1958 that a dual strategy of orthogonal tri- elimination with pivoting is angularization is more effective for many purposes. In this approach, by applying a succession of elementary = PA LU, matrix operations each of which reflects Rm across a where U is upper-triangular, L is unit lower-triangular, hyperplane, one reduces A to upper-triangular form and P is a permutation matrix, i.e., the identity matrix via orthogonal operations: one aims at R and gets Q i

608 IV. Branches of Mathematics as a by-product. The Householder method turns out to have been so successful that the computation of matrix be more stable numerically, because orthogonal oper- eigenvalues long ago became a “black box” operation ations preserve norms and thus do not amplify the for virtually every scientist, with nobody but a few spe- rounding errors introduced at each step. cialists knowing the details of how it is done. A curi- From the QR factorization sprang a rich collec- ous related story is that EISPACK’s relative LINPACK for tion of linear algebra algorithms in the 1960s. The solving linear systems of equations took on an unex- QR factorization can be used by itself to solve least- pected function: it became the original basis for the squares problems and construct orthonormal bases. benchmarks that all computer manufacturers run to More remarkable is its use as a step in other algorithms. test the speed of their computers. If a In particular, one of the central problems of numerical is lucky enough to make the TOP500 list, updated twice linear algebra is the determination of the eigenvalues a year since 1993, it is because of its prowess in solving and eigenvectors of a square matrix A.IfA has a com- certain matrix problems Ax = b of dimensions ranging plete set of eigenvectors, then by forming a matrix X from 100 into the millions. whose columns are these eigenvectors and a diagonal The eigenvalue decomposition is familiar to all math- matrix D whose diagonal entries are the corresponding ematicians, but the development of numerical linear eigenvalues, we obtain algebra has also brought its younger cousin onto the AX = XD, scene: the singular value decomposition (SVD). The SVD was discovered by Beltrami, jordan [VI.52], and and hence, since X is nonsingular, sylvester [VI.42] in the late nineteenth century, and −1 A = XDX , made famous by Golub and other numerical analysts the eigenvalue decomposition. In the special case in beginning in around 1965. If A is an m × n matrix with which A is hermitian [III.50 §3], a complete set of m n, an SVD of A is a factorization ∗ orthonormal eigenvectors always exists, giving A = UΣV , = ∗ A QDQ , where U is m×n with orthonormal columns, V is n×n where Q is unitary. The standard algorithm for com- and unitary, and Σ is diagonal with diagonal entries puting these factorizations was developed in the early σ1 σ2 ··· σn 0. One could compute the 1960s by Francis, Kublanovskaya, and Wilkinson: the SVD by relating it to the eigenvalue problems for AA∗ QR algorithm. Because polynomials of degree 5 or more and A∗A, but this proves numerically unstable; a bet- cannot be solved by a formula, we know that - ter approach is to use a variant of the QR algorithm values cannot generally be computed in closed form. that does not square A. Computing the SVD is the stan-

The QR algorithm is therefore necessarily an iterative dard route to determining the norm [III.62] A=σ1 one, involving a sequence of QR factorizations that is (here · is the [III.37] or “2” norm), the −1 in principle infinite. Nevertheless, its convergence is norm of the inverse A =1/σn in the case where A extraordinarily rapid. In the symmetric case, for a typ- is square and nonsingular, or their product, known as ical matrix A, the QR algorithm converges cubically, the , in the sense that at each step the number of cor- −1 κ(A) =AA =σ1/σn. rect digits in one of the eigenvalue–eigenvector pairs approximately triples. It is also a step in an extraordinary variety of fur- The QR algorithm is one of the great triumphs of ther computational problems including rank-deficient numerical analysis, and its impact through widely used least-squares, computation of ranges and nullspaces, products has been enormous. Algorithms and determination of ranks, “total least-squares,” low-rank analysis based on it led in the 1960s to computer codes approximation, and determination of angles between in Algol and and later to the software subspaces. EISPACK (“Eigensystem Package”) and its descendant All the discussion above concerns “classical” numer- LAPACK. The same methods have also been incor- ical linear algebra, born in the period 1950–75. The porated in general-purpose numerical libraries such ensuing quarter-century brought in a whole new set as the NAG, IMSL, and collections, of tools: methods for large-scale problems based on and in problem-solving environments such as MAT- Krylov subspace iterations. The idea of these iterations LAB, , and Mathematica. These developments is as follows. Suppose a linear algebra problem is given i

IV.21. Numerical Analysis 609 that involves a matrix of large dimension, say n . far as 2.376 for certain recursive (though impractical) 1000. The solution may be characterized as the vector algorithms published by Coppersmith and Winograd in x ∈ Rn that satisfies a certain variational property such 1990. Is there a “fast matrix inverse” in store for us? 1 T − T = as minimizing 2 x Ax x b (for solving Ax b if A is symmetric positive definite) or being a stationary point 5 Numerical Solution of Differential Equations of (xTAx)/(xTx) (for solving Ax = λx if A is symmet- Long before much attention was paid to linear alge- ric). Now if K is a k-dimensional subspace of Rn with k bra, mathematicians developed numerical methods to k / n, then it may be possible to solve the same vari- solve problems of analysis. The problem of numeri- ational problem much more quickly in that subspace. cal integration or quadrature goes back to Gauss and The magical choice of K is a Krylov subspace k newton [VI.14], and even to archimedes [VI.3]. The = k−1 Kk(A, q) span(q,Aq,...,A q) classic quadrature formulas are derived from the idea for an initial vector q. For reasons that have fascinat- of interpolating data at n + 1 points by a polynomial ing connections with approximation theory, solutions of degree n, then integrating the polynomial exactly. in these subspaces often converge very rapidly to the Equally spaced points give the Newton– exact solution in Rn as k increases, if the eigenvalues of Cotes formulas, which are useful for small degrees but n A are favorably distributed. For example, it is often pos- diverge at a rate as high as 2 as n →∞: the Runge sible to solve a matrix problem involving 105 unknowns phenomenon. If the points are chosen optimally, then to ten-digit precision in just a few hundred iterations. the result is Gauss quadrature, which converges rapidly The speedup compared with the classical algorithms and is numerically stable. It turns out that these opti- may be a factor of thousands. mal points are roots of Legendre polynomials, which Krylov subspace iterations originated with the conju- are clustered near the endpoints. (A proof is sketched in gate gradient and Lanczos iterations published in 1952, [III.85].) Equally good for most pur- but in those years computers were not powerful enough poses is Clenshaw–Curtis quadrature, where the inter- to solve problems of a large enough scale for the meth- polation points become cos(jπ/n),0 j n. This ods to be competitive. They took off in the 1970s with quadrature method is also stable and rapidly conver- the work of Reid and Paige and especially van der Vorst gent, and unlike Gauss quadrature can be executed in and Meijerink, who made famous the idea of precon- O(nlog n) operations by the . ditioning. In preconditioning a system Ax = b, one The explanation of why clustered points are necessary replaces it by a mathematically equivalent system such for effective quadrature rules is related to the subject as of . MAx = Mb Around 1850 another problem of analysis began to get attention: the solution of ODEs. The Adams formu- for some nonsingular matrix M.IfM is well chosen, las are based on in equally the new problem involving MA may have favorably dis- spaced points, which in practice typically number fewer tributed eigenvalues and a Krylov subspace iteration than ten. These were the first of what are now called may solve it quickly. multistep methods for the numerical solution of ODEs. Since the 1970s, preconditioned matrix iterations  The idea here is that for an initial value problem u = have emerged as an indispensable tool of computa- f(t,u)with independent variable t>0, we pick a small tional science. As one indication of their prominence we time step ∆t>0 and consider a finite set of time values may note that in 2001, Thomson ISI announced that the = most heavily cited article in all of mathematics in the tn n∆t, n 0. 1990s was the 1989 paper by van der Vorst introducing We then replace the ODE by an algebraic approxi- Bi-CGStab, a generalization of conjugate gradients for mation that enables us to calculate a succession of nonsymmetric matrices. approximate values Finally, we must mention the biggest unsolved prob- n v ≈ u(tn), n 0. lem in numerical analysis. Can an arbitrary n×n matrix A be inverted in O(nα) operations for every α>2? (The superscript here is just a superscript, not a power.) (The problems of solving a system Ax = b or com- The simplest such approximate formula, going back to puting a matrix product AB are equivalent.) Gaussian [VI.19], is n+1 n n elimination has α = 3, and the exponent shrinks as v = v + ∆tf(tn,v ), i

610 IV. Branches of Mathematics

n n or, using the abbreviation f = f(tn,v ), ship between the number of “stages” (i.e., function eval- + uations per step) and the attainable order of accuracy. vn 1 = vn + ∆tf n. The classical methods with s = 1, 2, 3, 4 were known to Both the ODE itself and its numerical approximation Kutta in 1901 and have order p = s, but it was not until may involve one equation or many, in which case 1963 that it was proved that s = 6 stages are required n u(t, x) and v become vectors of an appropriate to achieve order p = 5. The analysis of such problems dimension. The Adams formulas are higher-order gen- involves beautiful mathematics from and eralizations of Euler’s formula that are much more effi- other areas, and a key figure in this area since the 1960s cient at generating accurate solutions. For example, the has been John Butcher. For orders p = 6, 7, 8 the mini- fourth-order Adams–Bashforth formula is mal numbers of stages are s = 7, 9, 11, while for p>8 n+1 = n + 1 n − n−1 + n−2 − n−3 exact minima are not known. Fortunately, these higher v v 24 ∆t(55f 59f 37f 9f ). orders are rarely needed for practical purposes. The term “fourth-order” reflects a new element in the numerical treatment of problems of analysis: the When computers began to be used to solve differ- appearance of questions of convergence as ∆t → 0. ential equations after World War II, a phenomenon of The formula above is of fourth order in the sense that it the greatest practical importance appeared: once again, will normally converge at the rate O((∆t)4). The orders numerical instability. As before, this phrase refers employed in practice are most often in the range 3–6, to the unbounded amplification of local errors by a enabling excellent accuracy for all kinds of computa- computational process, but now the dominant local tions, typically in the range of 3–10 digits, and higher- errors are usually those of discretization rather than order formulas are occasionally used when still more rounding. Instability typically manifests itself as an accuracy is needed. oscillatory error in the computed solution that blows Most unfortunately, the habit in the numerical analy- up exponentially as more numerical steps are taken. sis literature is to speak not of the convergence of these One mathematician concerned with this effect was Ger- magnificently efficient methods, but of their error,or mund Dahlquist. Dahlquist saw that the phenomenon more precisely their discretization or truncation error could be analyzed with great power and generality, as distinct from rounding error. This ubiquitous lan- and some people regard the appearance of his 1956 guage of error analysis is dismal in tone, but seems paper as one of the events marking the birth of mod- ineradicable. ern numerical analysis. This landmark paper intro- At the turn of the twentieth century, the second great duced what might be called the fundamental theorem class of ODE algorithms, known as Runge–Kutta or of numerical analysis: one-step methods, was developed by Runge, Heun, and consistency + stability = convergence. Kutta. For example, here are the formulas of the famous fourth-order Runge–Kutta method, which advance a The theory is based on precise definitions of these three numerical solution (again scalar or system) from time notions along the following lines. Consistency is the step tn to tn+1 with the aid of four evaluations of the property that the discrete formula has locally positive function f : order of accuracy and thus models the right ODE. Sta-

n bility is the property that errors introduced at one time a = ∆tf(tn,v ), step cannot grow unboundedly at later times. Conver- = + 1 n + 1 b ∆tf(tn 2 ∆t,v 2 a), gence is the property that as ∆t → 0, in the absence = + 1 n + 1 of rounding errors, the numerical solution converges ∆tf(tn 2 ∆t,v 2 b), n to the correct result. Before Dahlquist’s paper, the idea d = ∆tf(tn + ∆t,v + c), of an equivalence of stability and convergence was per- n+1 = n + 1 + + + v v 6 (a 2b 2c d). haps in the air in the sense that practitioners realized Runge–Kutta methods tend to be easier to implement that if a numerical scheme was not unstable, then it but sometimes harder to analyze than multistep for- would probably give a good approximation to the right mulas. For example, for any s, it is a trivial matter to answer. His theory gave rigorous form to that idea for derive the coefficients of the s-step Adams–Bashforth a wide class of numerical methods. formula, which has order of accuracy p = s. For Runge– As computer methods for ODEs were being devel- Kutta methods, by contrast, there is no simple relation- oped, the same was happening for the much bigger i

IV.21. Numerical Analysis 611 subject of PDEs. Discrete numerical methods for solv- In the half-century since von Neumann died, the ing PDEs had been invented around 1910 by Richard- Lax–Wendroff formula and its relatives have grown son for applications in stress analysis and meteo- into a breathtakingly powerful subject known as com- rology, and further developed by Southwell; in 1928 putational fluid dynamics. Early treatments of linear there was also a theoretical paper on finite-difference and nonlinear equations in one space dimension soon methods by courant [VI.83], Friedrichs, and Lewy. moved to two dimensions and eventually to three. It But although the Courant–Friedrichs–Lewy work later is now a routine matter to solve problems involving became famous, the impact of these ideas before com- millions of variables on computational grids with hun- puters came along was limited. After that point the dreds of points in each of three directions. The equa- subject developed quickly. Particularly influential in tions are linear or nonlinear; the grids are uniform or the early years was the group of researchers around nonuniform, often adaptively refined to give special von Neumann at the Los Alamos laboratory, including attention to boundary layers and other fast-changing the young Peter Lax. features; the applications are everywhere. Numerical Just as for ODEs, von Neumann and his colleagues methods were used first to model airfoils, then whole discovered that some numerical methods for PDEs were wings, then whole aircraft. Engineers still use wind subject to catastrophic instabilities. For example, to tunnels, but they rely more on computations. Many of these successes have been facilitated by solve the linear wave equation ut = ux numerically we may pick space and time steps ∆x and ∆t for a regular another numerical technology for solving PDEs that grid, emerged in the 1960s from diverse roots in engineering and mathematics: finite elements. Instead of approx- xj = j∆x, tn = n∆t, j,n 0, imating a differential operator by a difference quo- and replace the PDE by algebraic formulas that com- tient, finite-element methods approximate the solu- pute a succession of approximate values: tion itself by functions f that can be broken up into n ≈ simple pieces. For instance, one might partition the vj u(tn,xj ), j, n 0. domain of f into elementary sets such as triangles A well-known discretization for this purpose is the Lax– or tetrahedra and insist that the restriction of f to Wendroff formula: each piece is a polynomial of small degree. The solu- n+1 = n + 1 n − n + 1 2 n − n + n tion is obtained by solving a variational form of the vj vj 2 λ(vj+1 vj−1) 2 λ (vj+1 2vj vj−1), PDE within the corresponding finite-dimensional sub- = where λ ∆t/∆x, which can be generalized to non- space, and there is often a guarantee that the computed linear systems of hyperbolic conservation laws in one solution is optimal within that subspace. Finite-element = dimension. For ut ux,ifλ is held fixed at a value less methods have taken advantage of tools of functional than or equal to 1, the method will converge to the cor- analysis to develop to a very mature state. These meth- rect solution as ∆x,∆t → 0 (ignoring rounding errors). ods are known for their flexibility in handling compli- If λ is greater than 1, on the other hand, it will explode. cated , and in particular they are entirely Von Neumann and others realized that the presence or dominant in applications in structural mechanics and absence of such instabilities could be tested, at least civil engineering. The number of books and articles that for linear constant-coefficient problems, by discrete have been published about finite-element methods is in [III.27] in x: “von Neumann analy- excess of 10 000. sis.” Experience indicated that, as a practical matter, a In the vast and mature field of numerical solution of method would succeed if it was not unstable. A theory PDEs, what aspect of the current state of the art would soon appeared that gave rigor to this observation: the most surprise Richardson or Courant, Friedrichs, and Lax equivalence theorem, published by Lax and Richt- Lewy? I think it is the universal dependence on exotic myer in 1956, the same year as Dahlquist’s paper. Many algorithms of linear algebra. The solution of a large- details were different—this theory was restricted to lin- scale PDE problem in three dimensions may require a ear equations whereas Dahlquist’s theory for ODEs also system of a million equations to be solved at each time applied to nonlinear ones—but broadly speaking the step. This may be achieved by a GMRES matrix iteration new result followed the same pattern of equating con- that utilizes a finite-difference imple- vergence to consistency plus stability. Mathematically, mented by a Bi-CGStab iteration relying on another the key point was the uniform boundedness principle. multigrid preconditioner. Such stacking of tools was i

612 IV. Branches of Mathematics surely not imagined by the early computer pioneers. their Taylor series; indeed, Arnol’d has argued that Tay- The need for it ultimately traces to numerical insta- lor series were Newton’s “main mathematical discov- bility, for as Crank and Nicolson first noted in 1947, ery.” To find a zero x∗ of a function F of a real variable the crucial tool for combating instability is the use of x, everyone knows the idea of Newton’s method: at the implicit formulas, which couple together unknowns at kth step, given an estimate x(k) ≈ x∗, use the  (k) the new time step tn+1, and it is in implementing this F (x ) to define a linear approximation from which to + coupling that solutions of systems of equations are derive a better estimate x(k 1): +  required. x(k 1) = x(k) − F(x(k))/F (x(k)). Here are some examples that illustrate the suc- Newton (1669) and Raphson (1690) applied this idea cessful reliance of today’s science and engineering to polynomials, and Simpson (1740) generalized it to on the numerical solution of PDEs: chemistry (the other functions F and to systems of two equations. schrödinger equation [III.83]); structural mechan- In today’s language, for a system of n equations in n ics (the equations of elasticity); weather prediction (the unknowns, we regard F as an n-vector whose derivative geostrophic equations); turbine design (the navier– at a point x(k) ∈ Rn is the n × n Jacobian matrix with stokes equations [III.23]); (the Helmholtz entries maxwell’s equations equation); telecommunications ( (k) ∂Fi (k) Jij (x ) = (x ), 1 i, j n. [IV.13 §1.1]); cosmology (the Einstein equations); oil dis- ∂xj covery (the migration equations); groundwater remedi- This matrix defines a linear approximation to F(x) that ation (Darcy’s law); design (the drift is accurate for x ≈ x(k). Newton’s method then takes diffusion equations); tsunami modeling (the shallow- the matrix form water equations); optical fibers (the nonlinear wave + − x(k 1) = x(k) − (J(x(k))) 1F(x(k)), equations [III.49]); image enhancement (the Perona– (k+1) (k) Malik equation); metallurgy (the Cahn–Hilliard equa- which in practice means that to get x from x , tion); pricing financial options (the black–scholes we solve a linear system of equations: + equation [VII.9 §2]). J(x(k))(x(k 1) − x(k)) =−F(x(k)). As long as J is Lipschitz continuous and nonsingu- 6 Numerical Optimization lar at x∗ and the initial guess is good enough, the convergence of this iteration is quadratic: The third great branch of numerical analysis is opti- (k+1) (k) 2 x − x∗=O(x − x∗ ). (1) mization, that is, the minimization of functions of Students often think it might be a good idea to develop several variables and the closely related problem of formulas to enhance the exponent in this estimate to 3 solution of nonlinear systems of equations. The devel- or 4. However, this is an illusion. Taking two steps at opment of optimization has been somewhat indepen- a time of a quadratically convergent algorithm yields dent of that of the rest of numerical analysis, carried a quartically convergent one, so the difference in effi- forward in part by a community of scholars with close ciency between quadratic and quartic is at best a con- links to and economics. stant factor. The same goes if the exponent 2, 3, or students learn that a smooth function may 4 is replaced by any other number greater than 1. achieve an extremum at a point of zero derivative, or at The true distinction is between all of these algorithms a boundary. The same two possibilities characterize the that converge superlinearly, of which Newton’s method two big strands of the field of optimization. At one end is the prototype, and those that converge linearly or there are problems of finding interior zeros and min- geometrically, where the exponent is just 1. ima of unconstrained nonlinear functions by methods From the point of view of multivariate calculus, it is a related to multivariate calculus. At the other are prob- small step from solving a system of equations to min- lems of linear programming, where the function to be imizing a scalar function f of a variable x ∈ Rn:to minimized is linear and therefore easy to understand, find a (local) minimum, we seek a zero of the gradient and all the challenge is in the boundary constraints. g(x) =∇f(x),ann-vector. The derivative of g is the Unconstrained nonlinear optimization is an old sub- Jacobian matrix known as the Hessian of f , with entries ject. Newton introduced the idea of approximating 2 (k) = ∂ f (k) functions by the first few terms of what we now call Hij (x ) (x ), 1 i, j n, ∂xi∂xj i

IV.21. Numerical Analysis 613 and one may utilize it just as before in a Newton itera- Rn to R. Even the problem of stating local optimality tion to find a zero of g(x), the new feature being that conditions for solutions to such problems is nontrivial, a Hessian is always symmetric. a matter involving lagrange multipliers [III.64] and Though the Newton formulas for minimization and a distinction between active and inactive constraints. finding zeros were already established, the arrival of This problem was solved by what are now known as computers created a new field of numerical optimiza- the KKT conditions, introduced by Kuhn and Tucker in tion. One of the obstacles quickly encountered was 1951 and also twelve years earlier, it was subsequently that Newton’s method often fails if the initial guess realized, by Karush. Development of algorithms for is not good. This problem has been comprehensively constrained nonlinear optimization continues to be an addressed both practically and theoretically by the active research topic today. algorithmic technologies known as line searches and The problem of constraints brings us to the other trust regions. strand of numerical optimization, linear programming. For problems with more than a few variables, it also This subject was born in the 1930s and 1940s with quickly became clear that the cost of evaluating Jaco- Kantorovich in the Soviet Union and Dantzig in the bians or Hessians at every step could be exorbitant. United States. As an outgrowth of his work for the Faster methods were needed that might make use of U.S. Air Force in the war, Dantzig invented in 1947 inexact Jacobians or Hessians and/or inexact solutions the famous [III.84] for solving lin- of the associated linear equations, while still achiev- ear programs. A linear program is nothing more than ing superlinear convergence. An early breakthrough a problem of minimizing a linear function of n vari- of this kind was the discovery of quasi-Newton meth- ables subject to m linear equality and/or inequality ods in the 1960s by Broyden, Davidon, Fletcher, and constraints. How can this be a challenge? One answer is Powell, in which partial information is used to gen- that m and n may be large. Large-scale problems may erate steadily improving estimates of the true Jaco- arise through discretization of continuous problems bian or Hessian or its matrix factors. An illustration and also in their own right. A famous early example of the urgency of this subject at the time is the fact was Leontiev’s theory of input–output models in eco- that in 1970 the optimal rank-two symmetric positive- nomics, which won him the Nobel Prize in 1973. Even in definite quasi-Newton updating formula was published the 1970s the Soviet Union used an input–output com- independently by no fewer than four different authors, puter model involving thousands of variables as a tool namely Broyden, Fletcher, Goldfarb, and Shanno; their for planning the economy. discovery has been known ever since as the BFGS for- The simplex algorithm made medium- and large- mula. In subsequent years, as the scale of tractable scale linear programming problems tractable. Such a problems has increased exponentially, new ideas have problem is defined by its objective function, the func- also become important, including automatic differen- tion f(x) to be minimized, and its feasible region, the tiation, a technology that enables of com- set of vectors x ∈ Rn that satisfy all the constraints. For puted functions to be determined automatically: the a linear program the feasible region is a polyhedron, a computer program itself is “differentiated,” so that as closed domain bounded by hyperplanes, and the opti- well as producing numerical outputs it also produces mal value of f is guaranteed to be attained at one of their derivatives. The idea of automatic differentiation the vertex points. (A point is called a vertex if it is is an old one, but for various reasons, partly related to the unique solution of some subset of the equations advances in sparse linear algebra and to the develop- that define the constraints.) The simplex algorithm pro- ment of “reverse mode” formulations, it did not become ceeds by moving systematically downhill from one ver- fully practical until the work of Bischof, Carle, and tex to another until an optimal point is reached. All of Griewank in the 1990s. the iterates lie on the boundary of the feasible region. Unconstrained optimization problems are relatively In 1984, an upheaval occurred in this field, triggered easy, but they are not typical; the true depth of this by Narendra Karmarkar at AT&T Bell Laboratories. Kar- field is revealed by the methods that have been devel- markar showed that one could sometimes do much oped for dealing with constraints. Suppose a function better than the simplex algorithm by working in the f : Rn → R is to be minimized subject to certain equal- interior of the feasible region instead. Once a connec- ity constraints cj (x) = 0 and inequality constraints tion was shown between Karmarkar’s method and the dj (x) 0, where {cj } and {dj } are also functions from logarithmic barrier methods popularized by Fiacco and i

614 IV. Branches of Mathematics

McCormick in the 1960s, new interior methods for lin- First, computer programs for quadrature became ear programming were devised by applying techniques adaptive; then programs for ODEs did as well. For previously viewed as suitable only for nonlinear prob- PDEs, the move to adaptive programs is happening lems. The crucial idea of working in tandem with a pair on a longer timescale. More recently there have been of primal and dual problems led to today’s powerful related developments in the computation of Fourier primal–dual methods, which can solve continuous opti- transforms, optimization, and large-scale numerical mization problems with millions of variables and con- linear algebra, and some of the new algorithms adapt straints. Starting with Karmarkar’s work, not only has to the as well as the mathemat- the field of linear programming changed completely, ical problem. In a world where several algorithms are but the linear and nonlinear sides of optimization are known for solving every problem, we increasingly find seen today as closely related rather than essentially that the most robust computer program will be one different. that has diverse capabilities at its disposal and deploys them adaptively on the fly. In other words, numeri- 7 The Future cal computation is increasingly embedded in intelligent Numerical analysis sprang from mathematics; then it control loops. I believe this process will continue, just spawned the field of computer science. When universi- as has happened in so many other areas of technology, ties began to found computer science departments in removing scientists further from the details of their the 1960s, numerical analysts were often in the lead. computations but offering steadily growing power in Now, two generations later, most of them are to be exchange. I expect that most of the numerical computer found in mathematics departments. What happened? A programs of 2050 will be 99% intelligent “wrapper” and part of the answer is that numerical analysts deal with just 1% actual “algorithm,” if such a distinction makes continuous mathematical problems, whereas computer sense. Hardly anyone will know how they work, but they scientists prefer discrete ones, and it is remarkable how will be extraordinarily powerful and reliable, and will wide a gap this can be. often deliver results of guaranteed accuracy. Nevertheless, the computer science side of numerical This story will have a mathematical corollary. One analysis is of crucial importance, and I would like to end of the fundamental distinctions in mathematics is be- with a prediction that emphasizes this aspect of the tween linear problems, which can be solved in one subject. Traditionally one might think of a numerical step, and nonlinear ones, which usually require itera- algorithm as a cut-and-dried procedure, a loop of some tion. A related distinction is between forward problems kind to be executed until a well-defined termination (one step) and inverse problems (iteration). As numeri- criterion is satisfied. For some computations this pic- cal algorithms are increasingly embedded in intelligent ture is accurate. On the other hand, beginning with the control loops, almost every problem will be handled by work of de Boor, Lyness, Rice and others in the 1960s, a iteration, regardless of its philosophical status. Prob- less deterministic kind of numerical computing began lems of algebra will be solved by methods of analy- to appear: adaptive algorithms. In an adaptive quadra- sis; and between linear and nonlinear, or forward and ture program of the simplest kind, two estimates of inverse, the distinctions will fade. the integral are calculated on each portion of a certain mesh and then compared to produce an estimate of 8 Appendix: Some Major the local error. Based on this estimate, the mesh may Numerical Algorithms then be refined locally to improve the accuracy. This process is carried out iteratively until a final answer is The list in table 1 attempts to identify some of the obtained that aims to be accurate to a tolerance spec- most significant algorithmic (as opposed to theoret- ified in advance by the user. Most such computations ical) developments in the history of numerical analy- come with no guarantee of accuracy, but an exciting sis. In each case some of the key early figures are cited, ongoing development is the advance of more sophis- more or less chronologically, and a key early date is ticated techniques of a posteriori error control that given. Of course, any brief sketch of history like this sometimes do provide guarantees. When these are com- must be an oversimplification. Distressing omissions of bined with techniques of , there is names occur throughout the list, including many early even the prospect of accuracy guaranteed with respect contributors in fields such as finite elements, precondi- to rounding as well as discretization error. tioning, and automatic differentiation, as well as more i

IV.22. 615

Table 1 Some algorithmic developments in the history of numerical analysis.

Year Development Key early figures

263 Gaussian elimination Liu, Lagrange, Gauss, Jacobi 1671 Newton’s method Newton, Raphson, Simpson 1795 Least-squares fitting Gauss, Legendre 1814 Gauss quadrature Gauss, Jacobi, Christoffel, Stieltjes 1855 Adams ODE formulas Euler, Adams, Bashforth 1895 Runge–Kutta ODE formulas Runge, Heun, Kutta 1910 Finite differences for PDE Richardson, Southwell, Courant, von Neumann, Lax 1936 Floating-point arithmetic Torres y Quevedo, Zuse, Turing 1943 Finite elements for PDE Courant, Feng, Argyris, Clough 1946 Splines Schoenberg, de Casteljau, Bezier, de Boor 1947 Monte Carlo simulation Ulam, von Neumann, Metropolis 1947 Simplex algorithm Kantorovich, Dantzig 1952 Lanczos and conjugate gradient iterations Lanczos, Hestenes, Stiefel 1952 Stiff ODE solvers Curtiss, Hirschfelder, Dahlquist, Gear 1954 Fortran Backus 1958 Orthogonal linear algebra Aitken, Givens, Householder, Wilkinson, Golub 1959 Quasi-Newton iterations Davidon, Fletcher, Powell, Broyden 1961 QR algorithm for eigenvalues Rutishauser, Kublanovskaya, Francis, Wilkinson 1965 Fast Fourier transform Gauss, Cooley, Tukey, Sande 1971 Spectral methods for PDE Chebyshev, Lanczos, Clenshaw, Orszag, Gottlieb 1971 Radial basis functions Hardy, Askey, Duchon, Micchelli 1973 Multigrid iterations Fedorenko, Bakhvalov, Brandt, Hackbusch 1976 EISPACK, LINPACK, LAPACK Moler, Stewart, Smith, Dongarra, Demmel, Bai 1976 Nonsymmetric Krylov iterations Vinsome, Saad, van der Vorst, Sorensen 1977 Preconditioned matrix iterations van der Vorst, Meijerink 1977 MATLAB Moler 1977 IEEE arithmetic Kahan 1982 Wavelets Morlet, Grossmann, Meyer, Daubechies 1984 Interior methods in optimization Fiacco, McCormick, Karmarkar, Megiddo 1987 Fast multipole method Rokhlin, Greengard 1991 Automatic differentiation Iri, Bischof, Carle, Griewank than half of the authors of the EISPACK, LINPACK, and Iserles, A., ed. 1992–. Acta Numerica (annual volumes). LAPACK libraries. Even the dates can be questioned; the Cambridge: Cambridge University Press. fast Fourier transform is listed as 1965, for example, Nocedal, J., and S. J. Wright. 1999. Numerical Optimization. since that is the year of the paper that brought it to New York: Springer. Powell, M. J. D. 1981. Approximation Theory and Methods. the world’s attention, though Gauss made the same dis- Cambridge: Cambridge University Press. covery 160 years earlier. Nor should one imagine that Richtmyer, R. D., and K. W. Morton. 1967. Difference Meth- the years from 1991 to the present have been a blank! ods for Initial-Value Problems. New York: Wiley Inter- No doubt in the future we shall identify developments science. from this period that deserve a place in the table.

Further Reading IV.22 Set Theory

Ciarlet, P. G. 1978. The for Elliptic Joan Bagaria Problems. Amsterdam: North-Holland. Golub, G. H., and C. F. Van Loan. 1996. Matrix Computations, 1 Introduction 3rd edn. Baltimore, MD: Johns Hopkins University Press. Hairer, E., S. P. Nørsett (for volume I), and G. Wanner. 1993, Among all mathematical disciplines, set theory occu- 1996. Solving Ordinary Differential Equations, volumes I pies a special place because it plays two very different and II. New York: Springer. roles at the same time: on the one hand, it is an area of