<<

1 Useful Background Information

In this section of the notes, various definitions and results from calculus, linear/ algebra, and least-squares regression will be summarized. I will refer to these items at various times during the semester.

1.1 Taylor Series

(k) th 1. Let η (x) denote the k derivative of function η(x). For function η and x0 in some interval I, define (x − x )2 (x − x )n P (x, x ) = η(x ) + η(1)(x )(x − x ) + η(2)(x ) 0 + ··· + η(n)(x ) 0 n 0 0 0 0 0 2! 0 n! (x − c)n+1 R (x, c) = η(n+1)(c). n (n + 1)!

Then, there exists some number z between x and x0 such that

η(x) = Pn(x, x0) + Rn(x, z)

2. Taylor Series for functions of one variable: If η is a function that has derivatives of all orders throughout an interval I containing x0 and if lim Rn(x, x0) = 0 for every x0 in I, then n→∞ η(x) can be represented by the Taylor series about x0 for any x0 in I. That is,

∞ k X (x − x0) η(x) = η(x ) + η(k)(x ) 0 k! 0 k=1

th 3. Note that Pn(x, x0) is a polynomial of degree n. Thus, Pn(x, x0) is an n -order Taylor series approximation of η(x) because Rn(x, x0) vanishes as n increases. 4. Practically, this means that even if the true form of η(x) is unknown, we can use a polynomial f(x) = Pn(x, x0) to approximate it with the approximation improving as n increases. 5. In , we may fit a linear model

2 n f(x) = β0 + β1x + β2x + ··· + βnx .

What we are actually doing is fitting

(1) (2) 2 (n) n f(x) = Pn(x, 0) = η(0) + η (0)x + η (0)x + ··· + η (0)x

(i) i where β0 = η0 and βi = η (0)x for i = 1, 2, . . . , n and we assume the remainder Rn(x, 0) is negligible.

6. Taylor series can be generalized to higher dimensions. I will only review the 2-dimensional case. ∂ηn 7. For function η(x, y) let be the nth-order partial derivative with derivation taken k ∂xk∂yn−k times with respect to x and (n − k) times with respect to y.

8. If η is a function of (x, y) that has partial derivatives of all orders inside a ball B containing p0 and if lim Rn(p, p0) = 0 for every p0 in B, then η(p) can be represented by the 2-variable n→∞ Taylor series about p0 for any p0 in B.

4 9. For function η(x, y) and p0 = (x0, y0) in some open ball B containing p0, define p = (x, y) and

(x − x0) ∂η (y − y0) ∂η Pn(p, p ) = η(p ) + + 0 0 1! ∂x 1! ∂y p0 p0 2 2 2 2 2 (x − x0) ∂ η (x − x0)(y − y0) ∂ η (y − y0) ∂ η + + + 2! ∂x2 1!1! ∂x∂y 2! ∂y2 p0 p0 p0 + ···

n−1 k (n−1−k) k X (x − x0) (y − y0) ∂ η + k!(n − 1 − k)! ∂kx∂(n−1−k)y k=0 p0 n k (n−k) k X (x − x0) (y − y0) ∂ η + k!(n − k)! ∂kx∂(n−k)y k=0 p0

n+1 k (n+1−k) k ∗ X (x − x0) (y − y0) ∂ η Rn(p, p ) = k (n−k) k!(n + 1 − k)! ∂ x∂ y ∗ k=0 p

∗ where p is a point on the line segment joining p and p0.

10. Taylor Series for functions of two variables: There exists some point pz on the line segment joining p and p0 such that

η(p) = Pn(p, p0) + Rn(p, pz)

11. Note that Pn(p, p0) is a polynomial of degree n in variables x and y. Thus, Pn(p, p0) is an th n -order Taylor series approximation of η(p) because Rn(p, p0) vanishes as n increases. 12. Practically, this means that even if the true form of η(p) is unknown, we can use a polynomial f(p) = Pn(p, p0) to approximate it with the approximation improving as n increases. 13. In statistics, we may fit a linear model

n n−i X X i j f(x, y) = βi,jx y i=0 j=0

What we are actually doing is fitting f(x) = Pn(p, (0, 0)) where β0,0 = η(0, 0) and βi,j = (i+j) i j η (0, 0)x y for i + j = 1, 2, . . . , n, and we assume the remainder Rn(p, (0, 0)) is negligible.

∂2f ∂2f ∂2f 14. On the following page: f = f = f = . 12 ∂x∂y 11 ∂x2 22 ∂y2

 ∂2f 2 ∂2f  ∂2f  Thus, ∆ = − . ∂x∂y ∂x2 ∂y2

5 66 1.2 Matrix Theory Terminology and Useful Results   x11 x12 x13 ··· x1k  x21 x22 x23 ··· x2k    0 15. If x =  x31 x32 x33 ··· x3k  then the X X can be written as    ·····  xn1 xn2 xn3 ··· xnk

 Pn 2 Pn Pn Pn  p=1 xp1 p=1 xp1xp2 p=1 xp1xp3 ··· p=1 xp1xpk Pn 2 Pn Pn  p=1 xp2 p=1 xp2xp3 ··· p=1 xp2xpk  0  Pn 2 Pn  X X =  symmetric x ··· xp3xpk   p=1 p3 p=1   ······  Pn 2 p=1 xpk

16. of a product of two matrices: (AB)0 = B0A0.

0 0 0 0 0 17. Transpose of a product of k matrices: If B = A1A2 ··· Ak−1Ak then B = AkAk−1 ··· A2A1. 18. The trace of a square matrix A, denoted tr(A), is the sum of the diagonal elements of A.

19. For two k-square matrices A and B, tr(A ± B) = tr(A) ± tr(B).

20. Given an m × n matrix A and an n × m matrix B, then tr(AB) = tr(BA).

21. The rank of a matrix A, denoted rank(A), is the number of linearly independent rows (or columns) of A.

22. If the determinant is nonzero for at least one matrix formed from r rows and r columns of matrix A but no matrix formed from r + 1 rows and r + 1 columns of A has nonzero determinant, then the rank of A is r.

23. Consider a k-square matrix A with rank(A) = k. The k-square matrix A−1 where AA−1 = −1 A A = Ik is called the inverse matrix of A.

24. A k-square matrix A is singular if A is not invertible. This is equivalent to saying |A| = 0 or rank(A) < k.

25. Any nonsingular square matrix (i.e., its determinant =6 0) will have a unique inverse.

26. In the use of least squares as an estimation procedure, it is often required to invert matrices which are symmetric. The inverse matrix is also important as a means of solving sets of simultaneous independent linear equations. If the set of equations is not independent, there is no unique solution.

27. The set of k linearly independent equations

a11x1 + a12x2 + ··· + a1kxk = g1

a21x1 + a22x2 + ··· + a2kxk = g2 ·········

ak1x1 + ak2x2 + ··· + akkxk = gk

can be written in matrix form as Ax = g. Thus, the solution is x = A−1g.

7 28. If A = diag(a1, a2, ··· , ak) is a with nonzero diagonal elements −1 a1, a2, ··· , ak, then A = diag(1/a1, 1/a2, ··· , 1/ak) is a diagonal matrix with diagonal ele- ments 1/a1, 1/a2, ··· , 1/ak. 29. If S is a nonsingular symmetric matrix, then (S−1)0 = S−1. Thus, the inverse of a nonsingular symmetric matrix is itself symmetric. 30. A square matrix A is idempotent if A2 = A.

0 −1 0 31. A nonsingular k-square matrix P is orthogonal if P = P , or equivalently, PP = Ik. 32. Suppose P is a k-square , x is a k × 1 vector, and y = P x is a k × 1 vector. The transformation y = P x is called an orthogonal transformation. 33. If y = P x is an orthogonal transformation then y0y = x0P 0P x = x0x.

1.3 Eigenvalues, Eigenvectors, and Quadratic Forms

34. If A is a k-square matrix and λ is a scalar variable, then A − λIk is called the characteristic matrix of A.

35. The determinant |A − λIk| = h(λ) is called the characteristic function of A. 36. The roots of the equation h(λ) = 0 are called the characteristic roots or eigenvalues of A. 37. Suppose λ∗ is an eigenvalue of a k-square matrix A, then an eigenvector associated with λ∗ ∗ ∗ is defined as a column vector x which is a solution to Ax = λ x or (A − λ Ik)x = 0. 38. An important use of eigenvalues and eigenvectors in response surface methodology is in the application to problems of finding optimum experimental conditions.

39. The in k variables x1, x2, . . . , xk is k X 2 XX Q = biixi + 2 bijxixj (1) i=1 i

where we assume the elements bij (i = 1, . . . , k j = 1, . . . , k) are real-valued. 40. In matrix notation: Q = x0Bx where     x1 b11 b12 ··· b1k  x2   b22 ··· b2k  x =   B =    ···   ······  xk symmetric bkk

41. B and |B| are, respectively, called the matrix and determinant of the quadratic form Q.

42. If λ1, λ2, . . . , λk are the eigenvalues of the symmetric matrix B, then there exists an orthogonal 0 0 transformation x = P w with w = (w1, w2, . . . , wk) such that the quadratic form Q = x Bx is transformed to the canonical form 2 2 2 0 Q = λ1w1 + λ2w2 + ··· + λkwk = w Λw (2)

where Λ = diag(λ1, λ2, . . . , λk). That is, the quadratic form Q can be transformed to one whose matrix Λ is diagonal whose elements are the eigenvalues of B. A manipulation of this type is extremely useful in describing the nature of a response surface and locating regions of optimum conditions.

8 43. The rank of a quadratic form Q = x0Bx is defined to be rank(B) = the number of nonzero eigenvalues of B.

44. An indefinite quadratic form Q = x0Bx is one whose canonical form given in (2) contains both positive and negative coefficients, or equivalently, B has both positive and negative eigenvalues.

45. Suppose B is full (rank(B) = k).

(a) If all eigenvalues are positive, then the quadratic form Q is positive definite. (b) If all eigenvalues are negative, then the quadratic form Q is negative definite.

46. Suppose B is less than full rank (rank(B) < k). That is, suppose at least one eigenvalue is zero.

(a) If all nonzero eigenvalues are positive, then the quadratic form Q is positive semidef- inite. (b) If all nonzero eigenvalues are negative, then the quadratic form Q is negative semidef- inite.

47. The sign of a quadratic form Q = x0Bx and the quadratic form type (i.e., positive definite, negative definite, etc.) are linked in the following way:

(a) An indefinite quadratic form is positive for some (x1, x2, . . . , xk), and negative for others.

(b) A positive definite quadratic form is positive for all (x1, x2, . . . , xk) =6 (0,..., 0).

(c) A negative definite quadratic form is negative for all (x1, x2, . . . , xk) =6 (0,..., 0).

(d) A positive semidefinite quadratic form is nonnegative (≥ 0) for all real values of x1, x2, . . . , xk.

(e) A negative semidefinite quadratic form is nonpositive (≤ 0) for all real values of x1, x2, . . . , xk.

48. All of these definitions also apply to the symmetric matrix B in the quadratic form Q.

49. Theorem 1: If X is an n × p matrix (p < n) with rank(X) = p (i.e., full column rank), then the p × p matrix X0X is positive definite and the n × n matrix XX0 is positive semidefinite.

1.4 Matrix Differentiation 50. The column vector of partial derivatives of f(z) with respect to z is     ∂f/∂z1 z1  ∂f/∂z2   z2  ∂f/∂z = =   where z =   .  ···   ···  ∂f/∂zk zk

51. ∂f/∂z0 is the row vector of partial derivatives.   a1  a2  0 X 52. Rule 1: If a =   is column vector of k constants, and if f(z) = a z = aizi, then  ···  ak

∂(a0z)/∂z = a

9 0 X 2 53. Rule 2: If f(z) = z z = zi , then

∂(z0z)/∂z = 2z

54. Rule 3: If f(z) = z0Bz for a k-square matrix B, then

∂(z0Bz)/∂z = (B + B0)z

55. Rule 4: If B is a symmetric k-square matrix, then by Rule 3:

∂(z0Bz)/∂z = 2Bz

1.5 Means, Variances, and Covariances

56. Let y1, y2, . . . , yk be k random variables whose means are given by E(yi) = µi for i = 1, 2, . . . , k. In vector form we write     y1 µ1  y2   µ2  E(y) = E   = µ =    ···   ···  yk µk

That is, the expectation of a vector is the vector of expectations.

57. The same applies more generally to matrices. The expectation of a matrix of random variables is a matrix containing the expected values of the individual random variables.

58. We can use matrix notation to describe the variances and covariances of the elements of a vector of random variables. Suppose the variances of the yi are given by

2 2 var(yi) = E[(yi − µi) ] = σi for i = 1, 2, . . . , k

and the covariances are given by

cov(yi, yj) = E[(yi − µi)(yj − µj)] = σij for i, j = 1, 2, . . . , k and i =6 j.

59. The variance- Σ is the symmetric matrix which contains the variances 2 (σi ) on the main diagonal and the covariances (σij) as the off-diagonal elements. That is,

 2  σ1 σ12 ··· σ1k 2 0  σ ··· σ2k  Σ = E[(y − µ)(y − µ) ] =  2   sym− · · · · · ·  2 metric σk Σ is also referred to as cov(y) or var(y).

0 60. If the vector of random variables y = (y1, y2, . . . , yk) are jointly normally distributed with mean vector µ and variance-covariance matrix Σ, we write y ∼ N(µ, Σ).

61. For the special case where the random variables are uncorrelated (σij ≡ 0) and have equal 2 2 2 variances (σi ≡ σ ), we write y ∼ N(µ, σ Ik). 62. Rule E1: If a vector y is a vector of k random variables with E(y) = µ, then E(Ay) = Aµ where A is any n × k matrix of constants.

10 63. Rule E2: Rule E1 can be generalized to the case of a k × p matrix X of random variables xij. That is, if E(X) = M, then E(AX) = AE(X) = AM where A is any n × k matrix of constants. 64. Rule E3: Let y be a vector of random variables with E(y) = 0 and variance-covariance matrix 2 Σ = σ Ik. Then for a real symmetric matrix B E(y0By) = σ2trace(B).

65. Rule E4: Let y be a vector of random variables with E(y) = µ and cov(y) = Σ. If A is an n × k matrix of constants, and z = Ay, then cov(z) = cov(Ay) = AΣA0

0 P 66. Rule E5: A special case of Rule E4 is the situation for finding var(a y) = var( aiyi) = the variance of a of random variables. The variance is given by the quadratic form k 0 0 X 2 2 XX var(a y) = a Σa = ai σi + 2 aiajσij. i=1 i

1.6 Least Squares

67. Assume the response of interest y can be approximated by a low-order polynomial f(x1, x2, . . . , xk) where x1, x2, . . . , xk are k independent variables. Suppose that n experimental runs are taken for various combinations of the x’s which were determined by the experimenter. The is written in the form y1 x11 x21 x31 ··· xk1 y2 x12 x22 x32 ··· xk2 y3 x13 x23 x33 ··· xk3 ················ yn x1n x2n x3n ··· xkn where n > k. The plan of experimental levels of the x’s is called the experimental design. 68. If the approximating model assumed the experimenter can be written as

yi = β0 + β1x1i + β2x2i + ··· + βkxki + i (i = 1, 2, . . . , n)

where i is a random variable. It is assumed that the i are independent from run to run and 2 i ∼ (0, σ ). 69. In matrix form we can write y = Xβ +  where       β0 1 x11 x21 ··· xk1   y1 1  β1   1 x12 x22 ··· xk2   y2       2  y =   β =  β2  and X =  1 x13 x23 ··· xk3   =    ···       ···   ···   ···············  yn n βk 1 x1n x2n ··· xkn

2 and E() = 0 and cov() = σ In. This model is referred to as the .

11 70. The general linear model can be applied to polynomial models of degree higher than one. For example, suppose the assumed model is quadratic in two variables x1 and x2. That is, the th response for the i run involving x1i and x2i is given by

2 2 yi = β0 + β1x1i + β2x2i + β11x1i + β22x2i + β12x1ix2i + i

where i = 1, 2, . . . , n with n ≥ 6. For this example   β0  2 2  1 x11 x21 x11 x21 x11x21  β1  2 2    1 x12 x22 x12 x22 x12x22   β2   2 2  β =   and X =  1 x13 x23 x13 x23 x13x23   β11       ··················   β22  2 2 1 x1n x2n x1n x2n x1nx2n β12

71. Given the design matrix X and a vector y of responses, the method of least squares yields an estimate b of β which minimizes L, the sum of squares of the errors (or deviations) of the observed responses from the estimated values:

n X 2 0 L = ei = e e where ei = yi − xib i=1 or, equivalently,

L = (y − Xb)0(y − Xb) = y0y − 2 b0X0y + b0X0Xb.

72. To find the b which minimizes L we first note that X0X is symmetric and use differentiation rules (Rule 1 and Rule 4 on page 13):

∂L/∂b = −2 X0y + 2 (X0X)b.

Setting the partial derivatives to 0 and solving for b yields (X0X)b = X0y. These equations are called the normal equations.

73. Assuming X0X is nonsingular, we have the least squares estimator

b = (X0X)−1X0y.

74. E(b) = β. That is, the least squares estimator b = (X0X)−1X0y is unbiased, or equivalently, each element in b is unbiased for the parameter it is estimating.

75. In the development of experimental designs for response surface methodology, it is important to investigate the effect of the design on the variance-covariance matrix of b:

cov(b) = E[(b − β)(b − β)0] = σ2(X0X)−1.

This implies that the variances of the estimators in b are given by the main diagonal elements of (X0X)−1 multiplied by σ2, and the covariances between elements of b are the off-diagonal elements of (X0X)−1 multiplied by σ2.

12 1.7 Hypothesis Testing

2 76. If the additional assumption is made that i is normally distributed, that is,  ∼ N(0, σ In), 2 then the yi’s are also normally distributed as y ∼ N(Xβ, σ In).

77. This also implies that b ∼ N(β, σ2(X0X)−1).

78. Sums of squares in the regression Let J n be a n × 1 vector of ones.

(a) Total Sum of Squares:

n !2 1 X  1  S = SS = y0y − y = y0 I − J J 0 y yy T n i n n n n i=1

(b) Regression Sum of Squares:

n !2 1 X  1  SS = b0X0y − y = y0 X(X0X)−1X0 − J J 0 y R n i n n n i=1

(c) Error Sum of Squares:

0 0 0 0 0 −1 0 SSE = y y − b X y = y In − X(X X) X y

Note: SST = SSR + SSE.

79. ANOVA Table for significance of the regression:

Source of Sum of Degrees of Mean Variation Squares Freedom Square F0 Regression SSR k MSR MSR/MSE Error SSE n − k − 1 MSE Total Syy n − 1

where k = the number of parameters in the model excluding the intercept β0.

80. To find the expectation E(SSE) note that

0 0 −1 0 0 SSE = y In − X(X X) X y = y P y

0 −1 0 where P = In − X(X X) X . Thus, SSE is a quadratic form in the y’s. Then

0 0 E(SSE) = E(y P y) = E( P ) = σ2trace(P ) = σ2(n − k − 1) SS Thus, MS = E is unbiased estimator of σ2. Notationally, we write σ2 = MS . E n − k − 1 b E

13 ∗ 0 ∗ 81. Test for significance of the regression: We can write β = [ β0 | β ] where β is the param- eter vector excluding β0. To test for the significance of the regression, that is, to test

∗ ∗ Ho : β = 0 against Ha : at least one parameter in β does not equal 0

determine the p-value by comparing F0 to the F (p, n − p − 1) distribution — its reference distribution under H0.

2 0 −1 82. Let βj (j =6 0) be a parameter in β. From (63.) we know that b ∼ N(β, σ (X X) ). Thus, 2 0 −1 if βj = 0 then bj ∼ N(0, σ Cjj) where Cjj is the diagonal element of (X X) corresponding to βj.

83. Tests on individual regression coefficients: To test for the significance of βj, that is, to test

Ho : βj = 0 against Ha : βj =6 0,

first note that b b t = j = √ j 0 2 se(bj) σb Cii follows a t(n − k − 1) distribution if Ho : βj = 0 is true. Therefore, the p-value of this test is determined by comparing t0 to the t(n − p − 1) distribution — its reference distribution under H0.

14