<<

1 .mcd Fundamental concepts in : scalar product, , , , projection. Instructor: Nam Sun Wang

In a linear (LVS), there are rules on 1) of two vectors and 2) by a scalar. The resulting vector must also lie in the same LVS. We now impose another rule: multiplication of two vectors to yield a scalar. Real . Real Euclidean Space is a subspace of LVS where there is a real-valued function (called scalar product) defined on pairs of vectors x and y with the following four properties.

1. Commutative (x, y) (y, x) 2. Associative (.x, y) .(x, y) 3. Distributive (x y, z) (x, z) (y, z) 4. Positive magnitude (x, x)>0 unless x 0 math jargon: (x, x) 0 (x, x) 0 iff x 0 Note that it is up to us to come up with a definition of the scalar product. Any definition that satisfies that above four properties is a valid one. Not only do we define the scalar product accordingly for different types of vectors, we conveniently adopt different definitions of scalar product for different applications even for one type of vectors. A scalar product is also known as an inner product or a . We have vectors in linear vector space if we come up with rules of addition and multiplication by a scalar; furthermore, these vectors are in real Euclidean space if we come up with a rule on multiplication of two vectors. There are two metrics of a vector: magnitude and direction. They are defined in terms of the scalar product. Magnitude, or length or Euclidean , of x: x (x, x) (x, y) Direction or angle between two vectors x and y cos() x . y Orthogonal means (is defined as) =/2=90 or (x,y)=0. Schwarz Inequality. (x, y) x . y Schwarz inequality results from the four properties of scalar product. It does not depend on what types of vectors or how the scalar product is defined. Special case. For linearly dependent vectors, (x, y) x . y Triangular Inequality. x y x y Proof. ( x y )2 (x y, x y) (x, x) 2.(x, y) (y, y)

. ( x )2 2.(x, y) ( y )2 ( x )2 2. x . y ( y )2 ( x y )2 Thus, x y x y Special case. For orthogonal vectors x and y, the scalar product is 0, i.e., (x,y)=0. Thus, ( x y )2 ( x )2 ( y )2 ... Pythagorean theorem (True not just for the conventional , but for all vectors in general). 2 scalar.mcd Example -- A column of real . n Define (x, y) x .y i i i = 1 Show that this definition is valid by demonstrating that it satisfies the four properties. n n Property #1. (x, y) (y, x) (x, y) x .y y .x (y, x) ... ok i i i i i = 1 i = 1 n n Property #2. (.x, y) .(x, y) (.x, y) .x .y . x .y .(x, y) ... ok i i i i i = 1 i = 1 n n Property #3. (x y, z) (x, z) (y, z) (x y, z) x y .z x .z y .z i i i i i i i i = 1 i = 1 n n . x .z y .z (x, z) (y, z)... ok i i i i i = 1 i = 1 n Property #4. (x, x)>0 unless x 0 (x, x) x .x >0 unless all elements are 0. i i i = 1 Thus, the definition is a valid one. Being vectors in real Euclidean space, they automatically inherit all sorts of properties, including Schwarz inequality and triangular inequality, Pythagorean rule, etc. Schwarz inequality: n n n x .y x .x . y .y i i i i i i i = 1 i = 1 i = 1 Is the following definition of a scalar product a valid one? n (x, y) w .x .y i i i i = 1 What about the following? (x, y) x .y x .y 1 1 2 2 What about the following? (x, y) x .y x .y 1 2 2 1 3 scalar.mcd Example -- continuous functions f(t) in 0t1. 1 Define (f, g) f(t).g(t) dt 0 Show that this definition is valid by demonstrating that it satisfies the four properties. 1 1 Property #1. (f, g) (g, f) (f, g) f(t).g(t) dt g(t).f(t) dt (g, f) ... ok 0 0

1 1 Property #2. (.x, y) .(x, y) (.f, g) (.f(t)).g(t) dt . f(t).g(t) dt .(f, g) 0 0 ... ok 1 Property #3. (f g, h) (f, h) (g, h) (f g, h) (f(t) g(t)).h(t) dt 0 1 . (f(t).h(t) g(t).h(t)) dt 0 1 1 . f(t).h(t) dt g(t).h(t) dt 0 0 . (f, h) (g, h) ... ok 1 Property #4. (f, f)>0 unless f 0 (f, f) f(t).f(t) dt>0 ... ok 0 Thus, the definition is a valid one. Schwarz inequality: 1 1 1 f(t).g(t) dt f(t).f(t) dt. g(t).g(t) dt 0 0 0 Is the following definition of a scalar product valid one? 1 (f, g) w(t).f(t).g(t) dt 0 What about the following, where  and  are within [0, 1], say, [0.5 0.6]?  (f, g) f(t).g(t) dt  4 scalar.mcd

Gram-Schmidt Process. Given a set of n linearly independent vectors in real Euclidean space, f 1, f2,..., fn, we can construct a set of n vectors g1, g2,..., gn that are pair-wise orthogonal.

g i, g j 0 for i j

First, we choose any one of the given vectors fi to be the first vector. Here we simply choose f1. g 1 f 1 To construct a second vector that is orthogonal to the first vector, we take any one of the remaining given vectors (typically f2) and subtract from it the component aligned with the first vector g1. Whatever is left behind has nothing in common with the first vector, and, therefore, is orthogonal to the first vector. f , g 2 1 . g 2 f 2 g 1 g 1, g 1 To construct a third vector that is orthogonal to the first two vectors that have already been constructed, we take any one of the remaining given vectors (typically f 3) and subtract from it both the component aligned with the first vector g1 and that aligned with the second vector g2. Whatever is left behind has nothing in common with the first two vectors, and, therefore, is orthogonal to the first two vectors. f , g f , g 3 1 . 3 2 . g 3 f 3 g 1 g 2 g 1, g 1 g 2, g 2 f , g f , g f , g 4 1 . 4 2 . 4 3 . g 4 f 4 g 1 g 2 g 3 g 1, g 1 g 2, g 2 g 3, g 3 We repeat the process. i 1 f , g f , g f , g f , g i 1 . i 2 . i i 1 . i j . g i f i g 1 g 2 ... g f i g j g , g g , g g , g i 1 g , g 1 1 2 2 i 1 i 1 j = 1 j j Finally, the last orthogonal vector is, n 1 f , g f , g f , g f , g n 1 . n 2 . n n 1 . n j . g n f n g 1 g 2 ... g f n g j g , g g , g g , g n 1 g , g 1 1 2 2 n 1 n 1 j = 1 j j In general, we generate orthogonal function by the Gram-Schmidt orthogonalization process, i 1 g , f j i . g i f i g j for i 1, 2 .. n g , g j = 1 j j We have an orthonormal basis, if the basis vectors are normalized to have length of unity.

g i g i, g i 1  g i, g j 0 for i j g i, g j 1 for i j 5 scalar.mcd Example. All polynomials of degree n or less in -1t1. 1 Define scalar project. (p, q) p(t).q(t) dt 1 Given f (t) 1 f (t) t f (t) t2 ... f tn 1 2 3 n 1 The results of Gram-Schmidt orthogonalization process is a set of polynomials known as

Legendre polynomials, L1, L2,..., Ln+1, which form an for the set of all polynomials of degree n or less. Thus, we have the choice of expressing a continuous function in -1t1 as a of the power series (which are not orthogonal) or as a linear combination of the Legendre polynomials (which are orthogonal). Because the basis is defined only for finite dimensions, we can only approximate, not represent exactly, a general continuous function in -1t1.

g 1 1 g 2 t 2 1 g 3 t 3 3 3. g 4 t t 5 : i i 1 . d 2 g i t 1 2i.i! d ti Example. All functions f(t) of the form in 0t n n . . . . f(t)  j cos(j t)  j sin(j t) j = 0 j = 1 The following set of 2n+1 vectors provide a basis (not orthogonal in [0  ]).

f 1(t) 1 f 2(t) sin(t) f 3(t) cos(t) . f 4(t) sin(2 t) . f 5(t) cos(2 t) : . f 2n(t) sin(n t) . f 2n_1(t) cos(n t) 6 scalar.mcd Orthogonalization process

g 1 f 1 1  1.sin(t) dt f , g 2 1 . 0 . 2 g 2 f 2 g 1 sin(t) 1 sin(t) g 1, g 1   1.1 dt 0 f , g f , g 3 1 . 3 2 . g 3 f 3 g 1 g 2 g 1, g 1 g 2, g 2   2 sin(t) .cos(t) dt 1.cos(t) dt  0 0 2 . cos(t) .1 . sin(t)    . 2 2 1 1 dt sin(t) dt 0  0 . cos(t)

f , g f , g f , g 4 1 . 4 2 . 4 3 . g 4 f 3 g 1 g 2 g 3 g 1, g 1 g 2, g 2 g 3, g 3   2  sin(t) .sin(2.t) dt 1.sin(2.t) dt  cos(t).sin(2.t) dt 0 0 2 0 . sin(2.t) .1 . sin(t) .cos(t)     . 2 2 2 1 1 dt sin(t) dt cos(t) dt 0  0 0 8 . sin(2.t) .cos(t) 3. The fact that we can apply the Gram-Schmidt orthogonalization process means that we have proven the Existence (but not uniqueness) of orthogonal or orthonormal basis. In fact, the set of orthogonal basis is not unique. Had we started the orthogonalization process with a different function (say, sin(t)), we would have resulted in a different set of orthogonal basis. 7 scalar.mcd Non-Example. vector = a column of three numbers. Given three basis vectors x<0>, x<1>, x<2>

1 1 1 1 1 1 <0> <1> <2> <0> <1> <2> x x x x x 0 x 1 x 1 x = 0 1 1 0 0 1 0 0 1 Define scalar product prod(x, y) x .y x .y 0 0 1 1 Gram-Schmidt orthogonalization in the order x<0>, x<1>, x<2>

1 <0> g 0 x g 0 = 0 0 1 1 prod 1 , 0 <1> 1 1 0 x , g <1> 0 . 0 0 . g 1 x g 0 g 1 1 0 g 1 = 1 g , g 1 1 0 0 0 0 0 prod 0 , 0 0 0 <2> <2> x , g x , g <2> 0 . 1 . g 2 x g 0 g 1 g 0, g 0 g 1, g 1

1 1 1 0 prod 1 , 0 prod 1 , 1 1 1 0 0 1 0 . 1 0 . g 2 1 0 1 g 2 = 0 1 1 0 0 1 0 0 1 prod 0 , 0 prod 1 , 1 0 0 0 0 Check orthogonality Check magnitude

prod g 0, g 1 = 0 prod g 0, g 0 = 1 prod g 0, g 2 = 0 prod g 1, g 1 = 1 prod g 1, g 2 = 0 prod g 2, g 2 = 0  something is wrong!

Gram-Schmidt orthogonalization in the order x<0>, x<2>, x<1>

1 <0> g 0 x g 0 = 0 0 1 1 prod 1 , 0 <2> 1 1 0 x , g <2> 0 . 1 0 . g 1 x g 0 g 1 1 0 g 1 = 1 g , g 1 1 0 0 1 0 1 prod 0 , 0 0 0 8 scalar.mcd <1> <1> x , g x , g <1> 0 . 1 . g 2 x g 0 g 1 g 0, g 0 g 1, g 1

1 1 1 0 prod 1 , 0 prod 1 , 1 1 1 0 0 0 0 . 0 1 . g 2 1 0 1 g 2 = 0 1 1 0 0 0 0 1 1 prod 0 , 0 prod 1 , 1 0 0 1 1 Check orthogonality Check magnitude>0

prod g 0, g 1 = 0 prod g 0, g 0 = 1 prod g 0, g 2 = 0 prod g 1, g 1 = 1 prod g 1, g 2 = 0 prod g 2, g 2 = 0  something went wrong although the above process seems ok! Non-Example of projection (come back here after projection is introduced later in this worksheet) Task: express y as a linear combination of the basis vectors x <0>, x<1>, x<2>. 1 1 1 1 1 <0> <1> <2> y 1 y a .x a .x a .x 1 a . 0 a . 1 a . 1 0 1 2 0 1 2 1 1 0 0 1 A quick common-sense inspection of the above equation indicates y=x <2>. a 0 a 0 a 1 0 1 2 n 2 i 0 .. n j 0 .. n 1 1 1 1 XX prod x , x XX = 1 2 2 Xy prod y, x Xy = 2 XX.a Xy i, j i 1 2 2 2 cond2(XX) = 2.368 1034 ... very large; theoretically infinite . An inspection of the XX and vector Xy above yields (in a naive way) the following two cases Case 1 a 0 a 1 a 0 0 1 2 but this makes no sense! or Case 2 a 0 a 0 a 1 0 1 2 However, mathematically, we have a singular XXa XX 1.Xy and coefficient a cannot be determined

(Mathcad often fails to raise this red flag.) singularity

Cause of the above nonsense: invalid definition of scalar product (which violates Rule #4 by having w2=0. As shown below, the contradiction disappears when a weight (albeit very small) is included.

w 10 10 prod(x, y) x .y x .y w .x .y ... include a small weight 2 0 0 1 1 2 2 2 0 1 XX prod x , x Xy prod y, x a XX .Xy a = 0 Now, things make sense! i, j i 1 cond2(XX) = 9.123 1010... very large; but theoretically finite. 9 scalar.mcd

Non-Example. vector = a continuous function in x=[0 3]. Given three basis functions f0, f1, f2. f 0(x) 1 1 1 x f 1(x) 1 x f(x) ... in compact notation 2 2 f 2(x) 1 x x 1 x x 1 3 Define scalar product prod(f, g) x.f(x).g(x) dx x.f(x).g(x) dx i.e., we skip x=[1 2] 0 2 Non-Example of projection (come back here after projection is introduced later in this worksheet)

Task: express y(x) as a linear combination of the basis vectors f 0(x), f1(x), f2(x). y(x) 1 y a .f (x) a .f (x) a .f (x) 1 a .(1) a .(1 x) a . 1 x x2 0 0 1 1 2 2 0 1 2 A quick common-sense inspection of the above equation indicates y(x)=f (x). a 1 a 0 a 0 0 0 1 2 n 2 i 0 .. n j 0 .. n 1 3 1 3 P x.f(x) .f(x) dx x.f(x) .f(x) dx F x.y(x).f(x) dx x.y(x).f(x) dx i, j i j i j i i i 0 2 0 2 With a casual inspection, the numbers look reasonable. 3 9.667 26.167 3 P = 9.667 32.833 91.733 F = 9.667 cond2(P) = 2.18 104 26.167 91.733 261.633 26.167 1 The answer based on the projection idea seems to agree with common-sense.a P 1.F a = 0 0 The definition is equivalent to considering interval x=[0 3] with w(x)=0 for x=[1 2] and w(x)=x elsewhere. 1 2 3 Define scalar product w 0 prod(f, g) x.f(x).g(x) dx w.x.f(x).g(x) dx x.f(x).g(x) dx 0 1 2

1 2 3 P x.f(x) .f(x) dx w.x.f(x) .f(x) dx x.f(x) .f(x) dx i, j i j i j i j 0 1 2 1 2 3 1 F x.y(x).f(x) dx w.x.y(x).f(x) dx x.y(x).f(x) dx a P 1.F a = 0 i i i i 0 1 2 0 10 scalar.mcd The above example of y(x)=1 seems to be working, but that is misleading. However, strictly speaking, the weight w(x) needs to be positive (not zero). Otherwise, we violate Rule #4 that (y,y)=0 if and only if y=0. y(x) (1 x).(x 1.5).(x 1) (1.5

1

y( x) 0.5 prod(y, y) = 0 But y0!

0 0 1 2 3 x

Find coefficient a based on projection 1 3 1 3 P x.f(x) .f(x) dx x.f(x) .f(x) dx F x.y(x).f(x) dx x.y(x).f(x) dx i, j i j i j i i i 0 2 0 2 With a casual inspection, the numbers look reasonable. 3 9.667 26.167 0 P = 9.667 32.833 91.733 F = 0 cond2(P) = 2.18 104 26.167 91.733 261.633 0 0 But the answer based on the projection idea defies common-sense. a P 1.F a = 0 0 Non-Example. Here is another example where an invalid definition makes a big difference. (x 1.5)2 y(x) exp 2 3 4 5 6 7 8 9 10 T 0.1 f(x) 1 x x x x x x x x x x n 10 i 0 .. n j 0 .. n 1 3 1 3 P x.f(x) .f(x) dx x.f(x) .f(x) dx F x.y(x).f(x) dx x.y(x).f(x) dx a P 1.F i, j i j i j i i i invalid 0 2 0 2

T 3 3 3 3 a invalid = ( 2.084 45.103 318.255 1.07 10 2.006 10 2.262 10 1.591 10 702.199 188.876 28.293 1.809 ) 3 3 P x.f(x) .f(x) dx F x.y(x).f(x) dx a P 1.F i, j i j i i valid 0 0

T a valid = ( 0.215 5.492 36.822 101.793 138.004 98.519 37.299 6.745 0.316 0.035 0 ) 11 scalar.mcd

1

y( x) . f( x) a invalid 0

f( x).a valid

1 0 1 2 3 x Given function Invalid Scalar Product Valid Scalar Product Non-example (because w(x)=x<0 when x is negative) 1 prod(f, g) x.f(x) .f(x) dx i j 1 The following orthogonality seems to imply that J0(1x), J0(2x), & J0(3x) are mutually orthogonal. 1 1 1 x.J0(1.x).J0(2.x) dx = 0 x.J0(1.x).J0(3.x) dx = 0 x.J0(2.x).J0(3.x) dx = 0 1 1 1 However, magnitude=0, when the functions are not 0! 1 1 1 x.J0(1.x).J0(1.x) dx = 0 x.J0(2.x).J0(2.x) dx = 0 x.J0(3.x).J0(3.x) dx = 0 1 1 1 Thus, the results from and orthogonality are wrong! The vector projection idea is not wrong, nor is the orthogonality idea wrong, but the definition of a scalar product is not valid. Basically, if the scalar definition is invalid, everything we do that depends on scalar product (such as projection, orthogonality, Gram-Schmidt, magnitude, angle, ...) becomes garbage-in, garbage-out. That is why we have to make sure that we start off correctly. 12 scalar.mcd Gram-Schmidt Process in matrix- & cholesky decomposition . Given a set of n linearly independent vectors in real Euclidean space, f 1, f2,..., fn, we construct a set of n vectors g1, g2,..., gn that are pair-wise orthonormal. start with f1, then normalize . g 1 a 11 f 1 where a11 is the inverse of magnitude of f1(x) that makes g1(x) normal. add f2 to the bag to construct g2. f , g 2 1 . g 2 f 2 g 1 g , g . . 1 1  g 2 a 12 f 1 a 22 f 2 add f3 to the bag to construct g3. f , g f , g 3 1 . 3 2 . . . . g 3 f 3 g 1 g 2  g 3 a 13 f 1 a 23 f 2 a 33 f 3 g 1, g 1 g 2, g 2 f , g f , g f , g 4 1 . 4 2 . 4 3 . . . . . g 4 f 4 g 1 g 2 g 3 g 4 a 14 f 1 a 24 f 2 a 34 f 3 a 44 f 4 g 1, g 1 g 2, g 2 g 3, g 3 We repeat the process. n 1 n f , g n j . . g n f n g j  g n a in f i g , g j = 1 j j i = 1 In compact "matrix-vector" notation

a 11 a 12 a 13 a 14 ... a 1n

0 a 22 a 23 a 24 ... a 2n 0 0 a a ... a . 33 34 3n g 1 g 2 g 3 g 4 ... g n f 1 f 2 f 3 f 4 ... f n 0 0 0 a 44 ... a 4n ......

0 0 0 0 ... a nn g f.A where A is an upper triangular matrix g are all mutually orthonormal; thus, the matrix G that has as elements all scalar products of g is a symmetrical positive definite matrix. Likewise, the matrix F with all scalar products of f is also a symmetrical positive definite matrix. g , g g , g ... g , g f , f f , f ... f , f 1 1 1 2 1 n 1 0 ... 0 1 1 1 2 1 n g 2, g 1 g 2, g 2 ... g 2, g n 0 1 ... 0 f 2, f 1 f 2, f 2 ... f 2, f n I F ...... 0 0 ... 1 g n, g 1 g n, g 2 ... g n, g n f n, f 1 f n, f 2 ... f n, f n G I 13 scalar.mcd G g T, g (f.A) T, (f.A) AT.fT, f.A AT. fT, f .A AT.F.A

g 1, g 1 g 1, g 2 ... g 1, g n a 11 0 ... 0 f 1, f 1 f 1, f 2 ... f 1, f n a 11 a 12 ... a 1n

g 2, g 1 g 2, g 2 ... g 2, g n a 12 a 22 ... 0 f 2, f 1 f 2, f 2 ... f 2, f n 0 a 22 ... a 2n . . I ......

g n, g 1 g n, g 2 ... g n, g n a 1n a 2n ... a nn f n, f 1 f n, f 2 ... f n, f n 0 0 ... a nn

G AT.F.A I Pre-multiply the above expression by AT-1 and post-multiply by A-1. Note that U=A-1 is upper triangular, because A is upper triangular. Likewise, UT is lower triangular. 1 F AT .I.A 1

T 1 F A 1 .A 1 AT.A T F L.LT L.U where U A 1 L UT A 1 The above is a special case of LU decomposition called cholesky decomposition for a symmetrical positive definite matrix F. Summary: Step 0. Given n linearly independent basis f

Step 1. Construct matrix F, where Fij=(fi,fj) Step 2. L=cholesky(F) (LLT=F in Mathcad); or U=chol(F) (in Matlab) Step 3. A=(L-1)T =(LT)-1 in Mathcad; or A=U-1 in Matlab

Step 4. Construct g=fA, where g is a row of vectors g1, g2,.., gn; likewise for f. g f.A a 11 a 12 ... a 1n 0 a ... a . 22 2n g 1 g 2 ... g n f 1 f 2 ... f n ......

0 0 ... a nn Step 4 (alternate notation -- not recommended). Construct g=Af, where g is a column of vectors g1, g2,.., gn; likewise for f. g A.f

g 1 f 1 g 1 a 11 0 ... 0 f 1

g 2 f 2 g 2 a 12 a 22 ... 0 f 2 g f ......

g n f n g n a 1n a 2n ... a nn f n Caution: Negative diagonals in A get converted to positive diagonals upon cholesky decomposition. 1 2 3 1 2 3.5 1 2 3.5 1 A 0 1 5 F AT .A 1 L cholesky(F) Check: F = 2 5 9.5 L.LT = 2 5 9.5 0 0 2 1 2 3 3.5 9.5 18.75 3.5 9.5 18.75 T L 1 = 0 1 5 ... not A!. However, this is not a problem, because the whole column A <1> is 0 0 2 sign-flipped, meaning cholesky decomposition gives a sign-flipped basis vector g 1. 14 scalar.mcd

Express x in terms of other vectors. We can express any vector x in real Euclidean space as a linear combination of linearly independent basis. n . . . . x  i g i  1 g 1  2 g 2 ...  n g n i = 1 With the following equation, we find the coefficients . The above one (vector) equation contains n unknown parameters i, i=1..n To create n equations needed to solve for n unknown parameters, we scalar-product (multiply) the above vector equation with each of the n basis vectors g i, i=1..n. for i 1, 2, .., n n n . . . x, g i  j g j , g i  j g j, g i  i g i, g i  i   j x, g j for orthonormal g j = 1 j = 1 There are 4 equal signs in the above equation. We re-examine each step, because the intermediate results are applicable/useful for different cases. 1st = sign ... substitute x, which is expressed as a linear combination of basis; apply scalar product to both side of the equation. 2nd = sign ... apply distributive property of scalar product (i.e., scalar product of sum is sum of scalar product. Also apply associative property of scalar product to take j out of scalar product. n . g i, x  j g j, g i ... for any linearly independent basis vectors g j = 1 (non-orthogonal as well as orthogonal) The above equation expanded for each i=1,2..,n. n . g i, x g i, g j  j for i 1, 2, .., n j = 1

. . . i 1 g 1, x x, g 1 g 1, g 1  1 g 1, g 2  2 ... g 1, g n  n . . . i 2 g 2, x x, g 2 g 2, g 1  1 g 2, g 2  2 ... g 2, g n  n

: . . . i n g n, x x, g n g n, g 1  1 g n, g 2  2 ... g n, g n  n The same equation in a compact notation. (Each element in the following matrix-vector is a scalar).

g 1, x g 1, g 1 g 1, g 2 ... g 1, g n  1

g 2, x g 2, g 1 g 2, g 2 ... g 2, g n  2 . xg G.   G 1.xg ......

g n, x g n, g 1 g n, g 2 ... g n, g n  n 15 scalar.mcd Note that the above "normal" equation appears again and again in applied mathematics. It applies to many fields, including function approximation (Fourier series, Bessel series, ...), data compression, modeling, and of course where we express a dependent vector x as a linear combination of n independent vectors gi, i=1,2,...,n.

 1 n  linear regression problem: . . . . . 2 . x  i g i  1 g 1  2 g 2 ...  n g n g 1 g 2 ... g n g  ... i = 1   1 n

where g g g ... g  2 1 2 n  ...

 n

g 1 g 1 g 1  1 g g g  The above compact notation shows 2 2 2 . 2 , x x, , g 1 g 2 ... g n ......

g n g n g n  n

If we supply a scalar definition: (f, g) fT.g then the two scalar products in the above equation become 1 g T.x g T.g .   g T.g . g T.x If we supply a scalar definition: (f, g) fT.W.g then the two scalar products in the above equation become 1 g T.W.x g T.W.g .   g T.W.g . g T.W.x The above solution is the formula in function interpolation at finite of nodes. It is also the normal equation in linear regression. Note that we arrived at it (not even worthy of the word "derived it") simply by applying a scalar product. Note the simplicity. And the equation is independent on the type of vectors or how the scalar product is defined. The numerical value of  though does depend on the definition of the scalar product. Note that we did not explicitly minimize the sum of squares (we did not do least-squares). This is equivalent to the "hand-waving derivation" that does not go through minimization of sum of squared errors. . start with: x g  where g is a collection of basis vectors g g 1 g 2 ... g n We want to modify "g" into something that we can invert. This is done by taking a scalar product (or, equivalently, multiplying by gT in the linear regression matrix-vector notation)

g T, x g T, g . 1 With a flip of the matrix inverse, we are done!  g T, g . g T, x 16 scalar.mcd

3rd = sign ... For orthogonal basis vectors g, (gj,gi)=0 for ij; (gi,gi)0. In other words, the scalar product between any two different basis vectors is zero (the meaning of being "orthogonal"), and the scalar product between the same basis vector is non-zero (the meaning of vector magnitude, which is positive for g0). Thus, out of the summation of n terms, only one scalar product term remains. This simplifies the evaluation of . because each i is decoupled from other is and can be evaluated independently. x, g i x, g  . g , g   ... for orthogonal basis vectors g i i i i i g i, g i In the compact matrix-vector notation (actually not so compact anymore because the above equation is even more compact), orthogonal basis vectors make the off-diagonal elements in the G matrix to be all 0 and simplifies the inverse of G.

x, g 1 g 1, g 1 0 ... 0  1

x, g 2 0 g 2, g 2 ... 0  2 . xg G.   G 1.xg ...... x, g i scalar   x, g n 0 0 ... g n, g n  n i g i, g i scalar The above bold-faced words "evaluated independently" mean each i can be evaluated one at a time in a sequential manner (no matrix inverse, only scalar inverse). It also means that it matters not which i is evaluated first, which is evaluated next; we can randomly pick out a basis gi and evaluate the coefficient i corresponding to that basis gi. It also means that we can work successively on the residual quantity rather then the original x. It also means that, if we so choose, we can start off the problem by extracting out first the most important gi (most relevant to the problem on hand, the largest in magnitude as measured by i, ...).

Start with an initial residual xresidual.0=x . . . x residual.0 x  1 g 1  2 g 2 ...  n g n

We consider one of the basis vectors, say g 1. Any random one basis vector! After we take out this term, the residual becomes, . . . . x redisual.1 x residual.0  1 g 1 x  1 g 1  2 g 2 ...  n g n

We consider one of the remaining basis vectors, say g 2. Any random one remaining basis vector! After we take out this term, the residual becomes, . . . . . x redisual.2 x residual.1  2 g 2 x  1 g 1  2 g 2  3 g 3 ...  n g n Repeat until we have extracted all (or enough number of) vectors. 17 scalar.mcd

4th = sign ... For basis vectors of unit length, (gi,gi)=1. x, g  . g , g .1    x, g ... for orthonormal basis vectors g i i i i i i i

In the compact matrix-vector notation, orthonormal basis vectors make the diagonal elements in the G matrix to be all 1, and the G matrix is now an , G=I. x, g  1 1 0 ... 0 1 x, g 2 0 1 ... 0  2 . xg G.   G 1.xg xg ......   x, g 0 0 ... 1 i i x, g n  n

Thus, for an orthonormal set of basis vectors, the coefficient corresponding to the ith basis vector is simply the scalar product of the given vector x with that ith basis vector. With these coefficients, the given vector x is expressed as, n n . . . . . x  i g i  x x, g i g i x, g 1 g 1 x, g 2 g 2 ... x, g n g n i = 1 i = 1 18 scalar.mcd Projection. A vector x in real Euclidean space (not necessarily of finite ) is represented as a linear combination of m orthonormal basis vectors g1, g2,..., gm in a real Euclidean subspace. n m n . . . x  i g i  i g i  i g i x proj residual_or_error_to_be_ignored i = 1 i = 1 i = m 1 m . x proj  i g i i = 1

Black = n-dimensional vector space x=solid red arrow xproj=dotted red arrow residual error=dotted blue arrow Note: dotted blue arrow  dotted red arrow

For a general set of m basis vectors g1, g2,..., gm that are not necessarily orthogonal, the projection formula is (with coupled coefficients ), m . . 1. x, g i  j g j, g i xg G    G xg for linearly independent g j = 1  Solve for  where G is a matrix of scalar products. G ij g i, g j xg is a column of scalar products. xg i x, g i

If the basis vectors g1, g2,..., gm are merely orthogonal, but not orthonormal, the projection formula is, G diagonal m m x, g x, g . i . i x proj  i g i g i   i for orthogonal g g , g g , g i = 1 i = 1 i i i i For orthonormal basis vectors, projection of x into m-dimensional subspace is, m m G I . . x proj  i g i x, g i g i   i x, g i for orthonormal g i = 1 i = 1 19 scalar.mcd

Error e=x-xproj is  to any vector y in the m-dimensional subspace spanned by g1, g2,..., gm. Depending on the applications, error e is referred to as the error vector or the residual vector. Proof. m n . . x  i g i  i g i x proj e i = 1 i = m 1 m m m x, g . i . x proj  i g i . x proj g i y  i g i g , g i = 1 i = 1 i i i = 1 m m . . (e, y) x x proj, y (x, y) x proj, y x,  i g i x proj,  i g i i i = 1 i = 1 m m . . .  i x, g i  i x, g i 0  e  y for every y in the subspace. i i = 1 i = 1 y=xproj minimizes |e|=|x-y| among all vectors y in the subspace.

2 2 Proof. ( x y ) x x proj x proj y

. x x proj x proj y, x x proj x proj y . . x x proj, x x proj 2 x x proj, x proj y x proj y, x proj y

. x x proj, x x proj x proj y, x proj y because x x proj, x proj y 0 2 2 . x x proj x proj y 2 . x x proj  y=xproj , where xproj minimizes |e|

In other words, projection yields the best approximation of x in a reduced m-dimensional subspace. This idea is useful in many fields of applied mathematics: linear regression, function approximation, solution of ODE, orthogonal collocation, methods of weighted residuals, of variation, noise filtering, digital signal processing, Laplace transform, Fourier transform, etc.

Example. Linear regression a 1 General linear formula: a <1> <2> 2 y X.a e y X X ... X . e ... a m <1> <2> Error e=y-xproj=y-X*a is  to any vector y in the m-dimensional subspace spanned by X ,X ,..., X 1 0 (e, X) (y X.a, X) (y X.a) T.X y T.X aT.XT.X XT.X.a XT.y a XT.X .XT.a Note: no calculus d(sse)/da=0