<<

Thomas Finley, [email protected] Norms For A Rm n, if (A) = n, then AT A is SPD. ∈ × A vector norm · : Rn R satisfies: The determinant det : Rn n R satisfies: / / → × → Basic Linear Algebra Subroutines 1. x 0, and x = 0 x = %0. 1. det(AB) = det(A) det(B). 2 2 A subspace is a set S Rn such that 0 S and x, y / / ≥ / / ⇔ 0. ops, like x + y . O(1) flops, O(1) data. γx γ x γ R x Rn ff R ⊆ ∈ ∀ ∈ 2. = | | · for all , and all . 2. det(A) = 0 i A is singular. 1. Vector ops, like !y = ax + y. O(n) flops, O(n) data. S, α, β . αx + βy S. / / / / ∈ Rn ∈ ∈ ∈ n 3. x + y x + y , for all x, y . 3. det(L) = ℓ1,1ℓ2,2 ··· ℓn,n for triangular L. T The span of {v1,..., vk} is the set of all vectors in R / / ≤ / / / / ∈ T 2. -vector ops, like rank-one update A = A + xy . Common norms include: 4. det(A) = det(A ). 2 2 that are linear combinations of v1,..., vk. T s O(n ) flops, O(n ) data. 1. x 1 = |x1| + |x2| + ··· + |xn| To compute det(A) factor A = P LU. det(P ) = ( 1) 2 A B of subspace S, B = {v1,..., vk} S has / / 2 2 2 − 3. Matrix-matrix ops, like C = C + AB. O(n ) data, ⊂ 2. x 2 = x1 + x2 + ··· + xn where s is the number of swaps, det(L) = 1. When com- 3 Span(B) = S and all vi linearly independent. / / 1 O(n ) flops. ! p p p puting det(U) watch out for overflow! 3. x = lim (|x1| + ··· + |xn| ) = max |xi| Use the highest BLAS level possible. Operators are ar- The dimension of S is |B| for a basis B of S. ∞ p i=1..n / / →∞ Ax ! Orthogonal Matrices chitecture tuned, e.g., data processed in cache-sized bites. For subspaces S, T with S T , dim(S) dim(T ), and An induced matrix norm is A ! = sup ' ' . It x=0 x ! Rn n ⊆ ≤ / / & ' ' For Q × , these statements are equivalent: further if dim(S) = dim(T ), then S = T . satisfies the three properties of norms. T∈ T Linear Least Squares Rn Rm n m n 1. Q Q = QQ = I (i.e., Q is orthogonal) A linear transformation T : has x, y x R ,A R × , Ax ! A ! x !. Suppose we have points (u1, v1),..., (u5, v5) that we want Rn R → ∀ ∈ ∀ ∈ ∈ / / ≤ / / / / 2. The · 2 for each row and column of Q. The inner 2 , α, β .T (αx + βy) = αT (x) + βT (y). Further, AB ! A ! B !, called submultiplicativity. / / to fit a quadratic curve au + bu + c through. We want to m∈n / / ≤ / / / / product of any row (or column) with another is 0. A R × such that x .T (x) Ax. T 2 a b a 2 b 2, called Cauchy-Schwarz inequality. n solve for u u1 1 v1 ∃ ∈ ∀ ≡ n m m ≤ / / / / 3. For all x R , Qx 2 = x 2. 1 a For two linear transformations T : R R , S : R 1. A = max n |a | (max row sum). ∈ / / / / . . . . → → i=1,...,m j=1 i,j A matrix Q Rm n with m > n has orthonormal columns  . . .   b  =  .  Rp, S T S(T (x)) is linear transformation. (T (x) / /∞ m × . . . . 2. A 1 = maxj=1,...,n "i=1 |ai,j| (max column sum). ∈ T 2 c ◦ ≡ ≡ / / 3 2 if the columns are orthonormal, and Q Q = I.  u5 u5 1   v5  Ax) (S(y) B) (S T )(x) BAx. 3. A 2 is hard: it takes" O(n ), not O(n ) operations.       ∧ ≡ ⇒ ◦ ≡ / / The product of orthogonal matrices is orthogonal. The matrix’s row space is the span of its rows, its column n m 2 This is overdetermined so an exact solution is out. Instead, 4. A F = i=1 j=1 ai,j. · F often replaces · 2. For orthogonal Q, QA 2 = A 2 and AQ 2 = A 2. space or range is the span of its columns, and its rank is / / # / / / / / / / / / / / / find the least squares solution x that minimizes Ax b 2. " " / − / the dimension of either of these spaces. QR- For the method of normal , solve for x in m n Six sources of error in scientific computing: modeling er- Rm n T T For A R × , rank(A) min(m, n). A has full row (or For any A × with m n, we can factor A = QR, A Ax = A b by using Cholesky factorization. This takes ∈ ≤ rors, measurement or data errors, blunders, discretization ∈Rm m ≥ T 3 column) rank if rank(A) = m (or n). where Q × is orthogonal, and R = [ R1 0 ] 2 n ∈ ∈ mn + 3 + O(mn) flops. It is conditionally but not back- n n or truncation errors, convergence tolerance, and rounding Rm n ff T A diagonal matrix D R × has d = 0 for j = k. The × is upper triangular. rank(A) = n i R1 is invertible. wards stable: A A doubles the . j,k exponent ∈ , errors. For single and double: T diagonal I has ij,j = 1. Q’s first n (or last m n) columns form an orthonormal Alternatively, factor A = QR. Let c = [ c1 c2 ] = e − T The upper (or lower) bandwidth of A is max |i j| among ± d1.d2d3 ··· dt β t = 24, e ∈ {−126,..., 127} basis for span(A) (or nullspace(A )). T 1 × T Q b. The least squares solution is x = R1− c1. − &'$% t = 53, e ∈ {−1022,..., 1023} 2vv i, j where i j (or i j) such that A = 0. sign mantissa base A Householder reflection is H = I T . H is symmet- i,j − v v If rank(A) = r and r < n (rank deficient), factor A = ≥ ≤ , $%&' $ %& ' $%&' |ˆx x| ric and orthogonal. Explicit H.H. QR-factorization is: Σ T T T A matrix with lower bandwidth 1 is upper Hessenberg. The relative error in ˆx approximating x is |−x| . U V , let y = V x and c = U b. Then, min Ax n n t+1 / − For A, B R , B is A’s inverse if AB = BA = I. If ff 1: for k = 1 : n do r 2 m 2 ci × Unit roundo or machine epsilon is ǫmach = β− . b 2 = min (σiyi ci) + c , so yi = . For ∈ 1 2: / i=1 − i=r+1 i σi such a B exists, A is invertible or nonsingular. B = A− . operations have relative error bounded by ǫmach. v = A(k : m, k) ± A(k : m, k) 2e1 # i = r + 1 : n, "yi is arbitrary. " 1 / 2vvT / The inverse of A is A− = [x1, ··· , xn] where Axi = ei. E.g., consider z = x y with input x, y. This program has 3: A(k : m, k : n) = I A(k : m, k : n) − vT v For A Rn n the following are equivalent: A is nonsin- three roundoff errors.z ˆ = ((1 + δ )x (1 + δ )y) (1 + δ ), ( − ) Singular Value Decomposition × 1 2 3 4: end for m n T ∈ − For any A R × , we can express A = UΣV such gular, rank(A) = n, Ax = b has a solution x for any b, if where δ1, δ2, δ3 [ ǫmach, ǫmach]. We get HnHn 1 ··· H1A = R, so then, Q = H1H2 ··· Hn. R∈m m Rn n ∈ − 2 that U × and V × are orthogonal, and Ax = 0 then x = 0. |z zˆ| |(δ1+δ3)x (δ2+δ3)y+O(ǫ )| −2 2 3 − = − mach This takes 2mn n + O(mn) flops. Σ ∈ Rm∈n Rm n x Rn x 0 |z| |x y| − 3 = diag(σ1, ··· , σp) × where p = min(m, n) and The nullspace of A × is { : A = }. − Givens requires 50% more flops. Preferable for sparse A. ∈ Rm n ∈ ∈ T The bad case is where δ1 = ǫmach, δ2 = ǫmach, δ3 = 0: σ1 σ2 ··· σp 0. The σi are singular values. For A × , Range(A) and Nullspace(A ) are − The Gram-Schmidt produces a skinny/reduced QR- ≥ ≥ ≥ ≥ ∈ |z zˆ| |x+y| 1. Matrix 2-norm, where A 2 = σ1. orthogonal complements, i.e., x Range(A), y − = ǫmach m n |z| |x y| factorization A = Q R , where Q R has or- / / 1 σ1 T T ∈ m ∈ − 1 1 1 × 2. The condition number κ (A) = A A = , or Nullspace(A ) x y = 0, and for all p R , p = x + y ∈ 2 2 − 2 σn Inaccuracy if |x+y| |x y| called catastrophic calcellation. thonormal columns. The Gram-Schmidt is: / / / σ/1 ⇒ ∈ rectangular condition number κ2(A) = . Note for unique x and y. ≫ − Left Looking Right Looking σmin(m,n) n n Conditioning & Backwards Stability T 2 For a permutation matrix P R , PA permutes the that κ2(A A) = κ2(A) . ∈ × 1: for k = 1 : n do 1: Q = A rows of A, AP the columns of A. P 1 = P T . A problem instance is ill conditioned if the solution is sen- 3. For a rank k approximation to A, let Σ = − 2: q = a 2: for k = 1 : n do k sitive to perturbations of the data. For example, sin 1 is k k diag(σ , ··· , σ , 0T ). Then A = UΣ V T . rank(A ) 3: for j = 1 : k 1 do 3: R(k, k) = q 1 k k k k well conditioned, but sin 12392193 is ill conditioned. k 2 ff σ ≤ 4: −T 4: / / k and rank(Ak) = k i k > 0. Among rank k or lower GE produces a factorization A = LU, GEPP PA = LU. Suppose we perturb Ax = b by (A + E)ˆx = b + e where R(j, k) = qj ak qk = qk/R(k, k) matrices, Ak minimizes A Ak 2 = σk+1. Plain GE GEPP E e ˆx+x 2 5: qk = qk R(j, k)qj 5: for j = k + 1 : n do / − / 'A' δ, 'b' δ. Then ' x ' 2δκ(A) + O(δ ), where − 4. Rank determination, since rank(A) = r equals the ' ' ≤ ' ' ≤ ' ' ≤ 6: end for 6: R(k, j) = qT q 1: for k = 1 to n 1 do 1: for k = 1 to n 1 do κ(A) = A A 1 is the condition number of A. k j σ − − − 7: 7: number of nonzero , or in machine arithmetic, per- 2: if a = 0 then stop 2: γ = argmax |a | / R//n n / R(k, k) = qk 2 qj = qj R(k, j)qk kk ik 1. A × , κ(A) 1. / / − haps the number of σ ǫmach σ1. 3: i {k+1,...,n} ∀ ∈ ≥ 8: qk = qk/R(k, k) 8: end for ≥ × ℓk+1:n,k = ak+1:n,k/akk ∈ κ Σ(1 : r, 1 : r) 0 V T 3: a = a 2. (I) = 1. 9: 9: Σ T 1 4: [γ,k],k:n [k,γ],k:n end for end for A = U V = U1 U2 T ak+1:n,k:n = ak+1:n,k:n 4: 3. For γ = 0, κ(γA) = κ(A). 2 0 0 32 V 3 − ℓ[γ,k],1:k 1 = ℓ[k,γ],1:k 1 T 2 ℓ a − − , In left looking, let line 6 be q q for modified G.S. to 0 1 k+1:n,k k,k:n 5: 4. For diagonal D and all p, D p = maxi=1..n |dii|. So, j j 1 pk = γ − See that range(U1) = range(A). The SVD gives an or- 5: end for maxi ..n |dii| / / make it backwards stable. 6: ℓ = a /a κ(D) = =1 . T k:n,k k:n,k kk mini=1..n |dii| T thonormal basis for the range and nullspace of A and A . Backward Substitution 7: 1 Positive Definite, A = LDL ak+1:n,k:n = ak+1:n,k:n If κ(A) , A may as well be singular. Compute the SVD by using shifted QR on AT A. − ǫmach Rn n 1: x = zeros(n, 1) ℓ ≥ A × is positive definite (PD) (or semidefinite (PSD)) k+1:n,kak,k:n An algorithm is backwards stable if in the presence of ∈ 2: for j = n to 1 do 8: if xT Ax > 0 (or xT Ax 0). Information Retrival & LSI end for roundoff error it returns the exact solution to a nearby m w u x ≥ In the bag of words model, wd R , where wd(i) is the 3: x = j − j,j+1:n j+1:n problem instance. When LU-factorizing symmetric A, the result is A = ∈ j u T (perhaps weighted) frequency of term i in document d. The j,j GEPP solves Ax = b by returning ˆx where (A+E)ˆx = b. LDL ; L is unit lower triangular, D is diagonal. A is SPD 4: end for corpus matrix is A = [w , ··· , w ] Rm n. For a query E ∞ iff D has all positive entries. The Cholesky factorization is 1 n × It is backwards stable if ' ' O(ǫ ). With GEPP, ∈ qT w T A ∞ mach T 1/2 1/2 T T q Rm, rank documents according to a d score. To solve Ax = b, factor A = LU (or A = P LU), solve ' ' ≤ A = LDL = LD D L = GG . Can be done directly wd 2 E ∞ 2 3 ∈ ' ' c ǫ + O(ǫ ), where c is worst case expo- n 2 ' ' Lw = b (or Lw = bˆ where bˆ = P b) for w using forward A ∞ n mach mach n in +O(n ) flops. If G has all positive diagonal A is SPD. In latent semantic indexing, you do the same, but in a ' ' ≤ 3 substitution, then solve Ux = w for x using backward sub- nential in n, but in practice almost always low order poly- To solve Ax = b for SPD A, factor A = GGT , solve k dimensional subspace. Factor A = UΣV T , then define T k n T 2 3 2 nomial. T A∗ = Σ V R × . Each w∗ = A∗ = U w , and stitution. The complexity of GE and GEPP is 3 n +O(n ). Gw = b by forward substitution, then solve G x = w 1:k,1:k :,1:k d :,d :,1:k d Combining stability and conditioning analysis yields n3 2 T ∈ GEPP encounters an exact 0 pivot iff A is singular. with backwards substitution, which takes + O(n ) flops. q∗ = U q. ˆx x 2 3 :,1:k For banded A, L + U has the same bandwidths as A. ' −x ' cn · κ(A)ǫmach + O(ǫmach). ' ' ≤ (k+1) (k) In the Ando-Lee analysis, for a corpus with k topics, for Arnoldi and Lanczos A linear (or quadratic) model approximates a function f 4: x = x + αksk {Use special line search for αk!} t 1 : k and d 1 : n, let R 0 be document d’s Given A Rn n and unit length q Rn, output Q, H by the first two (or three) terms of f’s Taylor expansion. 5: y = f(x(k+1)) f(x(k)) ∈ ∈ t,d ≥ ∈ × 1 ∈ k ∇ − ∇ relevance to topic t. R = 1. True document similar- such that A = QHQT . Use Lanczos for symmetric A. y yT B s sT B :,d 2 6: k k k k k k T n n / / Nonlinear Solving Bk+1 = Bk + ity is RR = R , where entry (i, j) is relevance of i to Arnoldi Lanczos n m T T × Given f : R R , we want x such that f(x) = 0. αyk sk − sk Bksk T → 7: end for j. Using LSI, if A contains information about RR , then 1: for k = 1 : n 1 do 1: β0 = w0 2 In fixed point iteration, we choose g : Rn Rn such that T T − / / (A∗) A∗ will approximate RR well. LSI depends on even 2: q q 2: for do (k+1) (k) → 2 ˜k+1 = A k k = 1, 2,... x = g(x ). If it converges to x∗, g(x∗) x∗ = 0. By maintaining Bk in factored form, can iterate in O(n ) maxt Rt,: 2 w − distribution of topics, where distribution is ρ = ' ' . 3: for ℓ = 1 : k do 3: q = k 1 (k) (k) (−k) 2 flops. B is SPD provided sT y > 0 (use line search to mint Rt,: 2 k βk−1 g(x ) = g(x∗)+ g(x∗)(x x∗)+O( x x∗ ) For k k ' ' 4: qT q 4: u q (k) (k) ∇ − / − / Great for ρ is near 1, but if ρ 1, LSI does worse. H(ℓ, k) = ℓ ˜k+1 k = A k small e = x x∗, ignore the last term. If g(x∗) has increase αk if needed). The secant condition αkBk+1sk = ≫ 5: 5: − ∇ q˜k+1 = q˜k+1 H(ℓ, k)qℓ vk = uk βk 1qk 1 λ < 1, then x(k) x as e(k) ck e(0) for large k, yk holds. If BFCS converges, it converges superlinearly. Complex Numbers − − − − | max | ∗ 6: end for 6: α = qT v → / / ≤ / / Complex numbers are written z = x + iy C for i = √ 1. k k k where c = λ| max | +ǫ, where ǫ is the influence of the ignored Non-linear Least Squares ∈ − 7: H(k + 1, k) = q˜k+1 2 7: wk = vk αkqk Rn Rm The real part is x = (z). The imaginary part is y = (z). last term. This indicates a linear rate of convergence. For g : , m n, we want the x for min g(x) 2. q˜k+1 / / 8: − ℜ ℜ 8: q = βk = wk 2 g x H → ≥ (k+1) (k) / / The conjugate of z is z = x iy. Ax = (Ax), A B = (AB) k+1 H(k+1,k) / / Suppose for ( ∗) = QT Q , T is non-normal, i.e., In the Gauss-Newton method, x = x h where − 9: end for 9: end for ∇ T 1 T − The absolute value of z is |z| = x2 + y2. T ’s superdiagonal portion is large relative to the diagonal. h = ( g(x) g(x))− g(x) g(x). Note that h is a solu- k ∇ ∇ ∇ (k) H T n n For Lanczos, the αk and βk are diagonal and subdiagonal Then this may not converge as ( g(x∗)) initially grows! tion to a linear least squares problem min g(x )h The conjugate of x is !x = (x) . A C × is ∈ (k+1) / ∇(k) / (k) 1 (k) (k) /∇ − Hermitian or self-adjoint if A = AH . entries of the Hermitian tridiagonal Tk, and we have H in In Newton’s method, x = x ( f(x ))− f(x ). g(x ) ! GN is derived by applying NMUM to to (−k+1)∇ (k) 2 / H Arnoldi. After very few iterations of either method, the This converges quadratically, i.e., e c e . g(x)T g(x), and dropping a resulting (derivative If Q Q = I, Q is unitary. / / ≤ / / eigenvalues of T and H will be excellent approximations Automatic differentiation takes advantage of the notion of Jacobian). You keep the quadratic convergence when Eigenvalues & Eigenvectors k to the “extreme” eigenvalues of A. that a computer program is nothing but arithmetic opera- g(x∗) = 0, since the tensor 0 as k . For A Cn n, if Ax = λx where x = 0, x is an eigenvector 2 → → ∞ ∈ × , For k iterations, Arnoldi is O(nk ) times and O(nk) tions, and one can apply the chain rule to get the derivative. ff of A and λ is the corresponding eigenvalue. Ordinary Di erential Equations space, Lanczos is O(nk)+k·M time (M is time for matrix- This may be used to compute Jacobians and . ODE (or PDE) has one (or multiple) independent variables. Remember, A λx is singular iff det(A λI) = 0. With − − vector multiplication) and O(nk) space, or O(n + k) space Optimization In initial value problems, given dy = f(y, t), y(t) Rn, λ as a variable, det(A λI) is A’s characteristic . dt if old qk’s are discarded. Rn R ∈ − n n 1 In continuous optimization, f : x and y(0) = y0, we want y(t) for t > 0. Examples include: For nonsingular T C × , T − AT (the similarity trans- → min f( ) ∈ Iterative Methods for Ax = b is the objective function, g : Rn Rm 1. Exponential growth/decay with dy = ay, with closed formation) is similar to A. Similar matrices have the same → s.t. g(x) = 0 dt Useful for sparse A where GE would cause fill-in. holds equality constraints, h : Rn Rp form y(t) = y eat. Growth if a > 0, decay if a < 0. characteristic polynomial and hence the same eigenvalues h(x) 0 0 → ≥ dyi ff In the splitting method, A = M N and Mv = c is easily holds inequality constraints. 2. Ecological models, dt = fi(y1, . . . , yn, t) for species (though probably di erent eigenvectors). This relationship (k+1) 1 −(k) solvable. Then, x = M − Nx + b . If it converges, We did unrestricted optimization min f(x) in the course. i = 1, . . . , n. yi is population size, fi encodes species is reflexive, transitive, and symmetric. Rn the limit point x∗ is a solution4 to Ax = b5. A ball is a set B(x, r) = {y : x y < r}. relationships. A is diagonalizable if A is similar to a diagonal matrix (k) 1 k ∈ / − / 1 The error is e = (M − N) e0, so splitting methods We have local minimizers x∗ which are the best in a 3. Mechanics, e.g. wall-spring-block models for F = ma D = T AT . A’s eigenvalues are D’s diagonals, and the T − 1 region, i.e., r > 0 such that f(x ) f(x) for all x d2x d2x kx d[x,v] converge if λ| max |(M − N) < 1. ∗ (a = 2 ) and F = kx, so 2 = − . Yields = eigenvectors are columns of T since AT:,i = Di,iT:,i. A is ∃ ≤ ∈ dt − dt m dt In the , consider M as the diagonals of A. B(x∗, r). A global minizer is the best local minimizer. kx T diagonalizable iff it has n linearly independent eigenvectors. v − with y as initial position and velocity. Assume f is c2. If x is a local minimizer, then f(x ) = m 0 Rn n This will fail of A has any zero diagonals. ∗ ∗ dy n n For symmetric A × , A is diagonalizable, has all 2 ∇ For0 stability1 of an ODE, let = Ay for A C × . ∈ 0 and f(x∗) is PSD. Semi-conversely, if f(x∗) = 0 and dt ∈ real eigenvalues, and the eigenvectors may be chosen as the Conjugate Gradient 2 ∇ ∇ The stable or neutrally spable or unstable case is where f(x∗) is PD, then x∗ is a local minimizer. columns of an Q. A = QDQT is the Conjugate gradient iteratively solve Ax = b for SPD A. ∇ max (λ (A)) < 0 or = 0 or > 0 respectively. Steepest Descent i ℜ i eigendecomposition of A. Further for symmetric A: It is derived from Lanczos and takes advantage of if A is In finite difference methods, approximate y(t) by discrete SPD then T is SPD. It produces the exact solution after n Go where the function (locally) decreases most rapidly via 1. The singular values are absolute values of eigenvalues. (k+1) (k) (k) points y0 (given), y1, y2,... so yk y(tk) for increasing tk. x = x αk f(x . αk is explained later. SD is ≈ 2. Is SPD (or SPSD) iff eigenvalues > 0 (or 0). iterations. Time per iteration is O(n) + M. − ∇ For many IVPs and FDMs, if the local truncation error ≥ stateless: depends only on the current point. Too slow. 3. For SPD, singular values equal eigenvalues. 1: x(0) = arbitrary (0 is okay) Error is reduced by (error at each step) is O(hp+1), the global truncation error Rm n 2: r b x(0) ( κ(A) 1)/( κ(A) + 1) Newton’s Method for Unconstrained Min. (error overall) is O(hp). Call p the order of accuracy. 4. For B × , m n, singular values of B are the 0 = A − (k+1) (k) 2 (k) 1 (k) ∈ T ≥ 3: − per! iteration.! Thus, for Iterate by x = x ( f(x ))− f(x ), derived square roots of B B’s eigenvalues. p0 = r0 − ∇ 2 ∇ (k) To find p, substitute the exact solution into FDM for- n n H 4: κ(A) = 1, CG converges by solving for where f(x∗) = 0. If f(x ) is PD and For any A C × , the Schur form of A is A = QT Q for k=0,1,2,. . . do (k) ∇ ∇ mula, insert a remainder term +R on RHS, use a Taylor ∈ n n n n 5: T T after 1 iteration. To f(x ) = 0, the step is a descent direction. with unitary Q C × and upper triangular T C × . αk = (rk rk)/(pk Apk) ∇ , series expansion, solve for R, keep only the leading term. ∈ ∈ (k+1) (k) What if the Hessian isn’t PD? Use (a) secant method, (b) 6: x = x + α p speed up CG, use a per- In Euler’s method, let yk+1 = yk + f(yk, tk)hk where In this sheet I denote λ| max | = maxλ {λ1,...,λn} |λ|. k k T 2 (k) ∈ conditioner M such that direction of negative curvature where h f(x )h < 0 Cn n k 7: rk+1 = rk αkApk ∇ hk = tk+1 tk is the step size, and y′ = f(y, t) is perhaps For B × , then limk B = 0 if λ| max |(B) < 1. where h or h (doesn’t work well in practice), (c) trust − ∈ →∞ 8: rT− r rT r κ(MA) κ(A) and solve computed by finite difference. p = 1, very low. Explicit! βk+1 = ( k+1 k+1)/( k k) ≪ − 2 (k) 1 (k) Power Methods for Eigenvalues MAx = Mb instead. region idea so h = ( f(x ) + tI)− f(x ) (interpo- 9: pk+1 = rk+1 βk+1pk − ∇ ∇ A stiff problem has widely ranging time scales in the so- x(k+1) = Ax(k) converges to λ (A)’s eigenvector. − lation of NMUM and SD), (d) factor 2f(x(k)) by Cholesky | max | 10: end for ∇ lution, e.g., a transient initial velocity that in the true so- Once you find an eigenvector x, find the associated eigen- when checking for PD, detect 0 pivots, modify that diago- T 2 (k) lution disappears immediately, chemical reaction rate vari- x(k) Ax(k) Multivariate Calculus nal in f(x ) and keep going (unjustified by theory, but value λ through the Raleigh quotient λ = T . n ability over temperature, transients in electical circuits. An x(k) x(k) Provided f : R R, the gradient and Hessian are ∇ 2 2 2 works in practice). The inverse shifted power method is x(k+1) = (A δf → δ f δ f δ f explicit method requires hk to be on the smallest scale! 2 ··· − δ δx δx1δx2 δx1δxn Line Search σI) 1x(k). If A has eigenpairs (λ , u ),..., (λ , u ), then x1 1 Backward Euler has yk+1 = yk + hf(yk+1, tk+1). BE − 1 1 n n  .  2  . .  Line search, given x(k) and step h (perhaps derived from 1 1 1 f = . , f = . . is implicit (yk+1 on the RHS). If the original program is (A σI) has eigenpairs , u ,..., , u . ∇ ∇ (k+1) (k) − λ1 σ 1 λn σ n δf 2 2 2 SD or NMUM), finds a α > 0 for x = x + αh. − − −    δ f δ f δ f  stable, any h will work! T ( ) ( ) δxn  ··· 2  (k) Factor A = QHQ where H is upper Hessenberg.    δxnδx1 δxnδx2 δxn  In exact line search, optimize min f(x + αh) over α. T 2 nd 2 Miscellaneous If f is c (2 partials are all continuous), f is symmetric. Frowned upon because it’s computationally expensive. p+1 To factor A = QHQ , find successive Householder reflec- ∇ n±constant p n p The Taylor expansion for f is k=1 k = + O(n ) tions H1,H2,... that zero out rows 2 and lower of column 1, In Armijo or backtrack line search, initialize α. While p+1 1 2 T T T T 2 3 (k) (k) (k) T " 2 b±√b 4ac c rows 3 and lower of column 2, etc. Then Q = H ··· H . f(x + h) = f(x) + h f(x) + 2 h f(x)h + O( h ) f(x + αh) > f(x ) + 0.1α f(x ) h, halve α. ax + bx + c = 0. r1, r2 = − − . r1r2 = 1 n 2 n ∇m ∇ / / ∇ 2a a 1: (0) (k) − Provided f : R R , the Jacobian is Exact arithmetic is slow, futile for inexact observations, A = A A is similar to A by → Secant/quasi Newton methods use an approximate al- δf /δx ··· δf /δx 2

2: for k = 0, 1, 2,... do orthog. trans. U (k) = 1 1 1 n ways PD f. In Broyden-Fletcher-Goldfarb-Shanno: and NA relies on approximate . 9 9 . . 9 9

D6 22 96

∇ C 4

A 5 6 8

. F 6

1 D 1 E

5 6 2 7 2 6 (k) (k) (k) (k)   2 3 3: (0) (k+1) f = . . . 02 88 27 01

1: E 9 03 99 2 8

σ 9 6 A 7 D 5 6 F 0 2 7 8 E 1 C Set A I = Q R Q ··· Q . Perhaps . . . 7 1

28 00 B = initial approximate Hessian {OK to use I.} A

0 C7

BA E 9 1 D

(k+1) − (k) (k) (k) ∇ 00 0 E

C 0 9 9 2 4 FA

4: (k) 2: 0 1

CE

8 2 E 6 8 C

A = R Q + σ I δ δ δ δ E 4

1 C 2 5 BA choose σ as eigenval- f / x ··· f / x for k = 0, 1, 2,... do 0 3  m 1 m n  20 02 22

21 FA 20 21 D6

C 0 2 4

1 F 1 F

C 8 1 E

2 5 7 1 2 7 9 B 8 6

  EC 28

AA 2 8 5: 2 1 (k) 0 E end for ues of submatrices of A. f’s Taylor expansion is f(x+h) = f(x)+ f(x)h+O( h ). 3: s = B− f(x ) 26 23 ∇ / / k − k ∇