<<

Part I

Linear Algebra

4 Chapter 1

Vectors

1.1 Vectors Basics

1.1.1 Definition

Vectors

• A vector is a collection of n real numbers, x1, x2, ..., xn, as a single in a n-dimensional arranged in a column or a row, and can be thought of as point in space, or as providing a direction.

• Each number xi is called a component or element of the vector. • The inner product define the length of a vector, as well as generalize the notion of angle between two vectors.

• Via the inner product, we can view a vector as a linear . We can also compute the projection of a vector onto a defined by another.

5 • We usually write vectors in column format:   x1  x2  x =    .   .  xm

• Geometry. A vector represents both a direction from the origin and a point in the multi-dimensional space Rn, where each component corresponds to coor- dinate of the point. • . If x is a column vector, xT (transpose) denotes the corresponding T   row vector, and vice-versa. x = x1, ...xn .

1.1.2 Independence

 n • A of m vectors x1, ..., xm in R is said to be independent if no vector in the set can be expressed as a of the others. This means that the condition m m X λ ∈ R : λixi = 0 i=1 implies λ = 0. • If two vectors are linear independent, then they cannot be a scaled version of each other. • Example. The vectors x1 = [1, 2, 3] and x2 = [3, 6, 9] are not independent, since 3x1 − x2 = 0, x2 is a scaled version of x1.

1.1.3 Subspace, Span, Affine Sets

Subspace and Span

• An nonempty subspace, V, of Rn is a that is closed under and multiplication. That is, for any scalars α, β x, y ∈ V → αx + βy ∈ V

6 • A subspace always contains the .

• Geometrically, subspaces are "" (like a line or in 3D) and pass through the origin.

n • A subspace S can always be represented as the span of a set of vectors xi in R , that is, a set of the form  m   X m S = span x1, ..., xm := λixi : λ ∈ R i=0

• The set of all possible linear combinations of the vectors in S = {x(1), ··· , x(m)} forms a subspace, which is called the subspace generated by S, or the span of S, denoted by span(S).

Direct Sum

• Given two subspaces X , Y ∈ Rn, the of X , Y, denoted X ⊕ Y, is the set of vectors of the form x + y, with x ∈ X , y ∈ Y.

• X ⊕ Y is itself a subspace.

Affine Sets

• An affine set is a of a subspace. It is "flat" but does not necessarily pass through 0, as a subspace would. (Like a line or a plane that does not go through the origin.)

• Affine set A can always be represented as the translation (a constant term) of the subspace spanned by some vectors:  m  X m A = x0 + λixi : λ ∈ R = x0 + S i=1

where x0 is a given point and S is a given subspace. Affine is linear plus a constant term.

7 • Subspaces (or sometimes called linear subspaces) are just affine spaces contain- ing the origin.

• Line. When S is the span of a single non-zero vector (1 ), the set A is called a line passing through the point x0. u is the direction of the line, t is the magnitude and x0 is a point through which it passes.   A = x0 + tu : t ∈ R

1.1.4 and Dimension

Basis

• A basis of Rn is a set of n independent (irreducible) vectors.

• If the vectors u1, ··· , un form a basis, we can express any vector as a linear Pn combination of ui, x = i=1 λiui for appropriate numbers λ1, ··· , λn.

n • Standard basis. Standard basis (natural basis) in R consists of the vector ei, where the i-th element is 1 and the rest are 0. 1 0 0 3 e1 = 0 , e2 = 1 , e3 = 0 ∈ R 0 0 1

Basis of a Subspace

• The basis of a given subspace S ⊆ Rn is any independent set of vectors whose span is S.

• If vectors (u1, ··· , ur) form a basis of S, we can express any vector in the Pr subspace S as a linear combination of (u1, ··· , ur), x = i=1 λiui. • Dimension. The number of vectors in the basis is independent of the choice of the basis. We will always find a fixed minimum number of independent (ir- reducible) vectors for the subspace S. This minimum number is called the di- mension of S.

8 • Example. In R3, you need 2 independent vectors to describe a plane contain- ing the origin. (dimension of 2). The dimension of a line is 1, since a line is x0 + span(x1) for non-zero x1.

Dimension of an Affine Subspace

• The set L in R3,

x1 − 13x2 + 4x3 = 2

3x2 − x3 = 9

is an affine subspace of dimension 1. The can be obtained by setting the constant term to 0,

x1 − 13x2 + 4x3 = 0

3x2 − x3 = 0

• Solve for x3 and we get x1 = x2, x3 = 3x2. The representation of linear subspace x ∈ R3: 1 x = 1 t, for scalar t = 2 3

• The linear subspace is the span of u = (1, 1, 3) of dimension 1. We can find a particular solution x0 = (38, 0, −9) and the affine subspace L is thus the line x0 + span(u).

1.2 and Orthogonal Complements

1.2.1 Orthogonal Vectors

• Orthogonal. Two vectors x, y in an X are orthogonal, denoted x ⊥ y, if hx, yi = 0.

9 • Mutually orthogonal. Nonzeros vectors x(1), x(2), ··· , x(d) are said to be mutually orthogonal if hx(i), x(j)i = 0 whenever i 6= j. In other words, each vector is orthogonal to all other vectors in the collection.

• Mutually orthogonal vectors are linearly independent but linearly independent vectors are not necessary mutually orthogonal.

1.2.2

• Orthogonal complement. A vector x ∈ X is orthogonal to a subset S of an inner product space X if x ⊥ s, ∀s ∈ S. The set of vectors in X that are orthogonal to S is called the orthogonal complement of S, denoted as S⊥.

• Direct sum and orthogonal decomposition. If X is a subspace of an inner product space X , then any vector x ∈ X can be written in a unique way of the sum of one element in S and one in the orthogonal complement S⊥.

X = S ⊕ S⊥, for any subspace S ⊆ X x = y + z, x ∈ X , y ∈ S, z ∈ S⊥

• Fundamental properties of inner product spaces. Let x, z be any two ele- ments of a inner product space X , let kxk = phx, xi, and let α be a scalar. Then:

– |hx, zi| ≤ kxkkzk, and holds iff x = αz, or z = 0 (Cauchy- Schwartz). – kx + zk2 + kx − zk2 = 2kxk2 + 2kzk2 (parallelogram law) – if x ⊥ z, then kx + zk2 = kxk2 + kzk2 (Pythagoras theorem) – for any subspace S ⊆ X it holds that X = S ⊕ S⊥ – for any subspace S ⊆ X it holds that dim X = dim S + dim S⊥

10 Figure 1.1: Left: Two dimension subspace X in R3 and its orthogonal complement S⊥. Right: Any vector can be written as the sum of an element x in a subspace S and one y in its orthogonal complement S⊥

1.3 Inner Product, Norms and Angles

1.3.1 Inner Product

• The Inner product. The inner product (scalar product, ) on a (real) X is a real-valued function which maps any pair of elements x, y ∈ X into a scalar denoted by hx, yi.

• Axioms The inner product satisfies the following axioms: for any x, y, z ∈ X and scalar α

– hx, yi ≥ 0 – hx, x = 0i if and only inf x = 0 – hx + y, zi = hx, zi + hy, zi – hαx, yi = αhx, yi – hx, yi = hy, xi

• The standard inner product defined in Rn is the "row-column" product of two

11 vectors

n T X hx, yi = x y = xiyi i=1

• Orthogonality. Two vectors x, y ∈ Rn are orthogonal if xT y = 0.

1.3.2 Norms

• When we try to define the notion of size, or length, of a vector in high dimen- sions (not just a scalar), we are faced with many choices. These choices are called norm.

• The norm of a vector x, denoted by kxk, is a real-valued function that maps any element x ∈ X into a kvk that satisfy a set of rules that the notion of size should involved.

• Definition of norm. A function from X to Rn is a norm if

1. kxk ≥ 0 ∀x ∈ X , and kxk = 0 if and only if x = 0 2. kx + yk ≤ kxk + kyk, for any x, y ∈ X (triangle inequality) 3. kαxk = |α|kxk, (for any scalar α and any x ∈ X )

• The Euclidean Norm (l2-norm). The euclidean norm corresponds to the usual notion of distance in two or three . The set of points with equal l2-norm is a circle (in 2D) and a sphere (in 3D), or a hyper-sphere in higher dimensions. v u m √ uX 2 T p ||x||2 = t xi = x x = hx, xi i=1

12 • The l1-norm. The l1-norm corresponds to the distance travelled on a rectan- gular grid to go from one point to another.

• The l∞-norm. The l∞-norm takes the largest component in the vector, ||x||∞ = max1≤i≤n |xi|. It is useful in measuring peak values.

13 • Cardinality (l0-norm). The cardinality of a vector x is defined as the number of nonzero elements in x: n ( . X . 1 if xk 6= 0 card(x) = I(xk 6= 0), where I(xk 6= 0) = 0 k=1 otherwise

It is often called the l0 norm, kxk0, although it is not a norm in the proper sense, since it doesn’t satisfy the third property.

1.3.3 Angels Between Vectors

• The corresponding angle θ of vectors x, y is xT y cos θ = ||x||2||y||2

• The notion above generalizes the usual notion of angle in two dimensions to higher dimensions. • It is useful in measuring the similarity (closeness) between two vectors. • When the two vectors are orthogonal, xT y = 0, the angle between them is θ = 90◦. • When the angle is 0◦ or 180◦, then x is aligned (parallel) with y, y = αx. In T 2 this situation,|x y| acheives its maximum value |α|kxk2.

14 1.3.4 Cauchy-Schwarz Inequality

• Since | cos θ| ≤ 1, it follows that for any two vectors x, y ∈ Rn, we have xT y cos θ = ≤ 1 ||x||2||y||2 T x y ≤ ||x||2 · ||y||2

• When x, y are collinear (lie on a single straight line), the above inequality is an equality,

1.4 Projection on a Line

1.4.1 Definition

n n n • A line in R passing through x0 ∈ R and with direction u ∈ R is   x0 + tu : t ∈ R

• The projection of a given point x on the line is a vector z located on the line, that is closet to x (in Euclidean norm).

• This corresponds to the optimization problem (least-squares):

min ||x − (x0 + tu)||2 t

• Example. Projection of the vector x = (1.6, 2.28) on a line passing through the origin x0 = 0 and with normalized direction u = (0.89, 0.45). At opti- mality the residual vector x − z is orthogonal to the line, hence z = tu, with magnitude t = xT u = 2.0035 and direction u. The scalar t = xT u or uT x (the scalar product between x and u) is the component of x along the - ized direction u. Any other point on the line is farther away from the point x than its projection z = tu = (xT u)u or (uT x)u

15 Figure 1.2

1.4.2 Closed-form Expression

• Assuming that u is normalized, ||u||2 = 1, the objective function of the projec- tion problem after squaring is

2 2 T 2 2 ||(x − x0) − tu||2 = ||x − x0||2 − 2tu (x − x0) + t ||u||2 (||u||2 = 1) 2 T 2 = ||x − x0||2 − 2tu (x − x0) + t T 2 = (t − u (x − x0)) + constant

Thus, the optimal solution to the projection problem is

∗ T t = u (x − x0)

and the projected vector is

∗ ∗ T z = x0 + t u = x0 + u (x − x0)u

T • The scalar product u (x − x0) is the component of x − x0 along the direction u.

• If u is not normalized, we replace u with its scaled version u/||u||2 (u is a vector

16 and ||u||2 is a scalar):

 T ∗ u u z = x0 + (x − x0) ||u||2 ||u||2 T u (x − x0) = x0 + 2 u ||u||2 uT (x − x ) = x + 0 u 0 uT u

1.4.3 Interpreting the Scalar Product

• We can interpret the scalar product (inner product) between two non-zero vectors x, u as the projection of x on the line of direction u passing through the origin.

∗ T • If u is normalized (||u||2 = 1), then the projection of x on L is z = (u x)u. ∗ T Its length is ||z ||2 = |u x|. (correspond to t in the above figure.) • In general, the scalar product uT x is simply the component of x along the nor- malized direction u/||u||2 defined by u .

1.4.4 Euclidean Projection on a Set

n n • A euclidean project of a point x0 ∈ R on a set S ⊆ R is a point that achieves the smallest Euclidean distance from x0 to the set. • This corresponds to the solution to the optimization problem:

min ||x − x0||2 : x ∈ S x

• When the set S is convex, there is a unique solution to the above problem. In particular, the projection on an affine subspace is unique.

3 • Example. S is the S = {x ∈ R : 2x1 + x2 − x3 = 1}. The projection of x0 = 0 on S is aligned with the coefficient vector a = (2, 1, −1). Setting x = ta (i.e. x is point on a direction with the magnitude t) defines the

17 hyperplane that is perpendicular to S. We can solve for the scalar t and obtain t = 1/(aT a) = 1/6.

x = ta = (2t, t, −t) → 2x1 + x2 − x3 = 4t + t + t = 6t = 1

So the projection x∗ = a/(aT a) = (1/3, 1/6, −1/6)

1.5 : The Gram-Schmidt Procedure

1.5.1 Orthogonalization

n T • A basis (ui)i=1 is orthogonal if ui uj = 0 if i 6= j. If ||ui||2 = 1, we say that the basis is orthornormal

• Orthogonalization refers to the procedure that finds an orthonormal basis of the span of given vectors.

n • Given vectors a1, ..., ak ∈ R , an orthogonalization procedure finds the an orthonormal basis for the span of the vectors a1, ..., am.

n • The orthogonalization procedure computes vectors q1, ..., qn ∈ R such that   S = span a1, ...am = span q1, ..., qr

where r is the dimension of S, and

T qi qj = 0, i 6= j (independent) T qi qi = 1, 1 ≤ i, j ≤ r (normalized)  The vectors q1, ..., qr form an orthonormal basis for the span of the vectors  a1, ..., am .

1.5.2 Basic Step: Projection on a Line

• Consider the line L(q) = {tq : t ∈ R} passing through zero, where q ∈ Rn is n given, and normalized (||q||2 = 1). The projection of a given point a ∈ R on

18 the line is a vector aproj located on the line that is closet to a, which corresponds to the problem:

min ||a − tq||2 t

∗ ∗ • The projection of a on the line L(q) is the vector aproj = t q, where t is the T  optimal value. The solution has a closed-form expression, aproj = q a q.

• The vector x can be written as a sum of its projection aproj, and the vector that T  is orthogonal to the project a − aproj where aproj = q a q and a − aproj = a − qT aq:

 T   T   a = a − aproj + aproj = a − q a q + q a q

The vector a−aproj can be interpreted as the result of removing the component of a along q.

1.5.3 Gram-Schimidt Procedure

The Gram-Schimidt procedure is a particular orthogonalization algorithm. The basic idea is to orthogonalize each vector w.r.t previous ones; then normalize result to have norm one.

When the vectors are independent

Assume that the vectors a1, ..., am are linearly independent. Because we assume that ai are linear independent, at each step q˜i 6= 0.

The GS Procedure

1. set q˜1 = a1.

q˜1 2. normalize q˜1: set q1 = . ||q˜1||2

T 3. remove componenet of q1 in a2: set q˜2 = a2 − (a2 q1)q1.

19 q˜2 4. normalize q˜2: set q2 = . ||q˜2||2

T T 5. remove componenets of q1, q2 in a3: set q˜3 = a3 − (a3 q1)q1 − (a3 q2)q2.

q˜2 6. normalize q˜3: set q˜3 = . ||q˜3||2 7. etc.

Example. The GS procedure for two vectors in two dimensions. We first set the first vector to be a normalized version of the first vector a1. We remove the componenet of T  a2 along the direction of q1, q˜2 = a2 − q1 a2 q1. The difference is the unmormalized direction q˜2 which becomes q2 after normalization. At the end, the vectors q1, q2 have both unit length and are orthogonal to each other.

Figure 1.3: The Gram-schimidt Procedure

General case: when the vectors are dependent

If at setp i, we find q˜i = 0 (linear dependent), then we directly jump to the next step. The r is the dimension of the span of the vectors a1, ..., ak.

1. set r = 0. 2. for i = 1, ..., n:

20 Figure 1.4: The Gram-schimidt Procedure

Pr T  (a) set q˜ = ai − j=1 qj ai qj.

(b) if q˜ 6= 0, r = r + 1; qr =q/ ˜ ||q˜||2.

1.6 Linear Functions and Maps

1.6.1 Functions and Maps

• Function.A function takes a vector argument in Rn, and returns a unique value in R. n f : R → R • Domain. The domain of a function f, domf, is defined as the set of points where the function is finite. Two functions can differ not by their formal ex- pression, but by their domains.

• Map. The term map refers to vector-valued functions. Maps are functions that return a vector of values. n m f : R → R • Components. The components of the map f are the (scalar-valued) functions fi, i = 1, ··· , m.

21 1.6.2 Sets Related to Functions

• Graph. The graph of f : Rn → R is a subset of Rn+1. It is the set of in- put/output pairs that f can attain.

n n+1 no graph f = (x, f(x)) ∈ R : x ∈ R

• Epigraph. The epigraph of f : Rn → R, denoted by epi f is also a subset of Rn+1. It describes the set of input/output pairs that f can achieve, as well as "anything above".

n n+1 n o epi f = (x, t) ∈ R : x ∈ R , t ≥ f(x)

Figure 1.5: The graph of the function is shown as a solid line. The epigraph corre- sponds to points on and above the the graph, in green.

• Level set. Level and sublevel set correspond to the notion of the contour of the function f. Both depend on some scalar value t, and are of Rn.A level set (or contour line) is the set of pitons that achieve exactly some value for the function f. The t-level set of the function f:

n n o Cf (t) = x ∈ R : f(x) = t

• Sublevel set. The t-sublevel set of f is the set of points that achieve at most a certain value for f:

n n o Lf (t) = x ∈ R : f(x) ≤ t

22 Figure 1.6: Level and sublevel sets, with domain R2

1.6.3 Linear and Affine Functions

• Linear functions. A function f : Rn → R is linear if and only if f preserves scaling and addition of its argument n For every x ∈ R , α ∈ R, f(αx) = αf(x) n For every x1, x2 ∈ R , f(x1 + x2) = f(x1) + f(x2)

• Affine functions. A function f is affine if and only if the function f˜ : Rn → R with values f˜(x) = f(x) − f(0) is linear. In other words, it is a linear plus constant function. (f0 is a constant)

n • Example. Consider the functions f1, f2, f3 : R → R with values

f1(x) = 3.2x1 + 2x2

f2(x) = 3.2x1 + 2x2 + 0.15 2 f3(x) = 0.001x2 + 2.3x1 + 0.3x2

f1 is linear, f2 is affine, f3 is neither.

23 1.6.4 Connection with Vectors via The Inner Product

Theorem 1 (Representation of affine function via the inner product).

• A function f : Rn → R is affine if and only if it can be expressed via a scalar product: f(x) = aT x + b n for some unique pair (a, b), a ∈ R , b ∈ R, given by ai = f(ei) − f(0), with n ei the i-th unit vector in R , and b = f(0). • The function is linear if and only if b = 0.

• The theorem shows that a vector can be seen as a (linear) function from the input space Rn to the output space R. • Both point of view (vectors as simple collections of numbers, or as linear func- tions) are useful.

1.6.5 of a Linear Function

2 • Gradient. Consider the function f : R → R, with values f(x) = x1 + 2x2. It’s gradient is constant, with values " # ∂f (x) 1 ∇f = ∂x1 = ∂f (x) 2 ∂x2

• For a given t ∈ R, the t-level set is the set of points such that f(x) = t  Lt(f) = (x1, x2): x1 + 2x2 = t

The level sets Lt(f) are , and are orthogonal to the gradient. • More generally, the gradient of the linear function f(x) = aT x is ∇f(x) = a.

24 Interpretation of a and b

• The b = f(0) is the constant term. It is sometimes referred to as the bias or intercept as it is the point where f intercepts the vertical axis if we were to plot the graph of the function.

• The terms aj, j = 1, ..., n, which correspond to the gradient of f, give the coefficients of influence of xj on f.

1.6.6 First-order Approximation of Non-linear Functions

A common engineering practice is to approximate a given non- with a linear (or affine) function, by taking derivatives.

One-dimensional Case

We can approximate the value function at a point x near a point x0 as follows:

0 f(x) ≈ l(x) = f(x0) + f (x)(x − x0) where f 0(x) denotes the derivative of f at x.

Multi-dimensional Case

We approximate a differentiable function f : Rn → R by a linear function l, so that f and l coincide up and including to the first derivatives. The first-order approximation

25 to f at x0 must be of the form l(x) = aT x + b where a ∈ Rn, b ∈ R. Our condition that l coincides with f up and including to the first derivatives shows that we must have

T ∇l(x) = a = ∇f(x0), a x0 + b = f(x0) where ∇f(x0) is the gradient of f at x0. Theorem 2 (First-order Expansion of a Function). Solving for a, b, the rst-order approximation of a differentiable function f at a point x0 is of the form

T f(x) ≈ l(x) = f(x0) + ∇f(x0) (x − x0)

n where ∇f(x0) ∈ R is the gradient of f at x0.

1.6.7 Other Sources of Linear Models

Linearity can arise from a simple change of variables such as n X T y˜ = log α + aj log xj = a x˜ + b j=1 where b = log α.

1.7 Hyperplanes and Half-Spaces

1.7.1 Hyperplanes

• A hyperplane in Rn is a set described by a single scalar product equality. It is of  T n the form H = x : a x = b where a ∈ R , a 6= 0, b ∈ R are given. We can think of hyperplanes as the level sets of linear functions. • When b = 0, the hyperplane is simply the set of points that are orthogonal to a (i.e. H is a (n − 1)-dimensional subspace), which means that it lies in the orthogonal complement of span(a).

26 Figure 1.7: Translation of a hyperplane

• When b 6= 0, the hyperplane is a translation of that set along direction a. (Move the hyperplane that originally passing through origin up or down; after the translation, the hyperplane won’t contain the origin.)

T T • If x0 ∈ H, then for any other element x ∈ H, we have b = a x0 = a x (then x0 is the point closet to the origin on H).

• The hyperplane can be characterized as the set of vectors x such that x − x0 is orthogonal to a:

 T H = x : a (x − x0) = 0

• Hyperplanes are affine sets of dimension n − 1. It allows us to separate the whole space in two regions (half-space).

• Equivalent representation of hyperplanes. Any affine set of dimension n− 1 is a hyperplane of the form

H = x : aT x = b

27 for some a ∈ Rn and b ∈ R. Teh following two representation of a hyperplane is equivalent:

n n T o n H = x ∈ R : a x = b , a ∈ R , b ∈ R

= x0 + span(u1, ··· , un−1)

n for some linearly independent vectors u1, ··· , un−1 ∈ R and some vector n x0 ∈ R .

1.7.2 Projection on a Hyperplane

• Consider the hyperplane H = x : aT x = b and assume that a is normal- T ized. We can represent H as a set of points x such that a x0 = b. One such vector is xproj = ba (see figure 1.7).

• By construction, xproj is the projection of 0 on H. It is the point on H closet to the origin, as it solves the projection problem minx ||x||2 : x ∈ H. • Using the Cauchy-Schwartz inequality, for any x ∈ H

T kx0k2 = |b| = |a x| ≤ kak2 · kxk2 = kxk2 (kak2 = 1)

and the minimum length |b| is attained with xproj = ba.

1.7.3 Geometry of Hyperplanes

 T • Geometrically, an hyperplane H = x : a x = b , with ||a||2 = 1, is a translation (shift) of the set of vectors orthogonal to a. The (normal) direction of the translation is determined by a, and the amount by b.

• |b| is the length of the closet point x0 on H from the origin, and the sign of b determines if H is away from the origin along the direction a or −a. If we increase the magnitude of b, the hyperplane is shifting further away along ±a depending on the sign of b.

28 Figure 1.8: Hyperplane. The scalar b is positive, as x0 and a point to the same direction

1.7.4 Half-Spaces

n  T n • A half-space in R is a set H = x : a x ≥ b where a ∈ R , a 6= 0, b ∈ R are given. Geometrically, the half-space above is the set of points defined by a T single affine inequality such that a (x − x0) ≥ 0.

◦ ◦ • The angle between (x − x0) and a is acute [−90 , +90 ].

• Here x0 is the point closet to the origin on the hyperplane defined by the eu- T −→ −→ qality a x = b. (when a is normalized, x0 = b a , b is the magnitude and a is the unit vector).

 T • The half-space H = x : a x ≥ b is the set of points such that x − x0 forms an acute angle where x0 is the projection of the origin on the boundary of the half-space.

• Hyperplanes correspond to level sets of linear functions.

• Half-spaces represent sub-level sets of linear functions: that the linear function x → aT x achieves the value b, or less. A quick way to check which half of the space the half-space describes is to look at where the origin is: if b ≥ 0, then x = 0 is in the half-space.

29 Figure 1.9: Projection on a half-space and acute angle

30