Topic 5: Principal Component Analysis 5.1 Covariance Matrices

Total Page:16

File Type:pdf, Size:1020Kb

Topic 5: Principal Component Analysis 5.1 Covariance Matrices Topic 5: Principal component analysis 5.1 Covariance matrices d Suppose we are interested in a population whose members are represented by vectors in R . We d model the population as a probability distribution P over R , and let X be a random vector with distribution P. The mean of X is the \center of mass" of P. The covariance of X is also a kind of \center of mass", but it turns out to reveal quite a lot of other information. d Note: if we have a finite collection of data points x1; x2;:::; xn 2 R , then it is common to n×d arrange these vectors as rows of a matrix A 2 R . In this case, we can think of P as the uniform distribution over the n points x1; x2;:::; xn. The mean of X ∼ P can be written as 1 (X) = A>1 ; E n and the covariance of X is > 1 1 1 1 > cov(X) = A>A − A>1 A>1 = Ae Ae n n n n where Ae = A − (1=n)11>A. We often call these the empirical mean and empirical covariance of the data x1; x2;:::; xn. Covariance matrices are always symmetric by definition. Moreover, they are always positive d semidefinite, since for any non-zero z 2 R , > > > h 2i z cov(X)z = z E (X − E(X))(X − E(X)) z = E hz; X − E(X)i ≥ 0 : This also shows that for any unit vector u, the variance of X in direction u is h 2i > var(hu; Xi) = E hu; X − E Xi = u cov(X)u : Consider the following question: in what direction does X have the highest variance? It turns out this is given by an eigenvector corresponding to the largest eigenvalue of cov(X). This follows the following variational characterization of eigenvalues of symmetric matrices. d×d Theorem 5.1. Let M 2 R be a symmetric matrix with eigenvalues λ1 ≥ λ2 ≥ · · · ≥ λd and corresponding orthonormal eigenvectors v1; v2;:::; vd. Then u>Mu max = λ1 ; u6=0 u>u u>Mu min = λd : u6=0 u>u > > These are achieved by v1 and vd, respectively. (The ratio u Mu=u u is called the Rayleigh quotient associated with M in direction u.) 32 TOPIC 5. PRINCIPAL COMPONENT ANALYSIS 33 Proof. Following Theorem 4.1, write the eigendecomposition of M as M = V ΛV > where V = [v1jv2j · · · jvd] is orthogonal and Λ = diag(λ1; λ2; : : : ; λd) is diagonal. For any u 6= 0, u>Mu u>V ΛV >u = (since VV > = I) u>u u>VV >u w>Λw = (using w := V >u) w>w 2 2 2 w1λ1 + w2λ2 + ··· + wdλd = 2 2 2 : w1 + w2 + ··· + wd This final ratio represents a convex combination of the scalars λ1; λ2; : : : ; λd. Its largest value is λ1, achieved by w = e1 (and hence u = V e1 = v1), and its smallest value is λd, achieved by w = ed (and hence u = V ed = vd). Corollary 5.1. Let v1 be a unit-length eigenvector of cov(X) corresponding to the largest eigen- value of cov(X). Then var(hv1; Xi) = max var(hu; Xi) : u2Sd−1 d Now suppose we are interested in the k-dimensional subspace of R that captures the \most" d variance of X. Recall that a k-dimensional subspace W ⊆ R can always be specified by a collection of k orthonormal vectors u1; u2;:::; uk 2 W . By the orthogonal projection to W , we mean the linear map 2 x x x 3 ? ? ? ? ? ? > 6 7 d×k x 7! U x ; where U = 6u1 u2 ··· uk7 2 R : 6 ? ? ? 7 4 ? ? ? 5 y y y The covariance of U >X, a k×k covariance matrix, is simply given by cov(U >X) = U > cov(X)U : The \total" variance in this subspace is often measured by the trace of the covariance: tr(cov(U >X)). Recall, the trace of a square matrix is the sum of its diagonal entries, and it is a linear function. d×k > > 2 > Fact 5.1. For any U 2 R , tr(cov(U X)) = E kU (X − E(X))k2. Furthermore, if U U = I, > > 2 then tr(cov(U X)) = E kUU (X − E(X))k2. d×d Theorem 5.2. Let M 2 R be a symmetric matrix with eigenvalues λ1 ≥ λ2 ≥ · · · ≥ λd and corresponding orthonormal eigenvectors v1; v2;:::; vd. Then for any k 2 [d], > max tr(U MU) = λ1 + λ2 + ··· + λk ; > U2Rd×k : U U=I > min tr(U MU) = λd−k+1 + λd−k+2 + ··· + λd : > U2Rd×k : U U=I The max is achieved by an orthogonal projection to the span of v1; v2;:::; vk, and the min is achieved by an orthogonal projection to the span of vd−k+1; vd−k+2;:::; vd. Pd > Proof. Let u1; u2;:::; uk denote the columns of U. Then, writing M = j=1 λjvjvj (Theo- rem 4.1), k k 0 d 1 d k d > X > X > X > X X 2 X tr(U MU) = ui Mui = ui @ λjvjvj Aui = λj hvj; uii = cjλj i=1 i=1 j=1 j=1 i=1 j=1 TOPIC 5. PRINCIPAL COMPONENT ANALYSIS 34 Pk 2 Pd where cj := i=1hvj; uii for each j 2 [d]. We'll show that each cj 2 [0; 1], and j=1 cj = k. First, it is clear that cj ≥ 0 for each j 2 [d]. Next, extending u1; u2;:::; uk to an orthonormal d basis u1; u2;:::; ud for R , we have for each j 2 [d], k d X 2 X 2 cj = hvj; uii ≤ hvj; uii = 1 : i=1 i=1 d Finally, since v1; v2;:::; vd is an orthonormal basis for R , d d k k d k X X X 2 X X 2 X 2 cj = hvj; uii = hvj; uii = kuik2 = k : j=1 j=1 i=1 i=1 j=1 i=1 Pd Pd The maximum value of j=1 cjλj over all choices of c1; c2; : : : ; cd 2 [0; 1] with j=1 cj = k is λ1 + λ2 + ··· + λk. This is achieved when c1 = c2 = ··· = ck = 1 and ck+1 = ··· = cd = 0, Pd i.e., when span(v1; v2;:::; vk) = span(u1; u2;:::; uk). The minimum value of j=1 cjλj over all Pd choices of c1; c2; : : : ; cd 2 [0; 1] with j=1 cj = k is λd−k+1 +λd−k+2 +···+λd. This is achieved when c1 = ··· = cd−k = 0 and cd−k+1 = cd−k+2 = ··· = cd = 1, i.e., when span(vd−k+1; vd−k+2;:::; vd) = span(u1; u2;:::; uk). We'll refer to the k largest eigenvalues of a symmetric matrix M as the top-k eigenvalues of M, and the k smallest eigenvalues as the bottom-k eigenvalues of M. We analogously use the term top-k (resp., bottom-k) eigenvectors to refer to orthonormal eigenvectors corresponding to the top-k (resp., bottom-k) eigenvalues. Note that the choice of top-k (or bottom-k) eigenvectors is not necessarily unique. Corollary 5.2. Let v1; v2;:::; vk be top-k eigenvectors of cov(X), and let V k := [v1jv2j · · · jvk]. Then tr(cov(V >X)) = max tr(cov(U >X)) : k > U2Rd×k : U U=I An orthogonal projection given by top-k eigenvectors of cov(X) is called a (rank-k) principal component analysis (PCA) projection. Corollary 5.2 reveals an important property of a PCA projection: it maximizes the variance captured by the subspace. 5.2 Best affine and linear subspaces d PCA has another important property: it gives an affine subspace A ⊆ R that minimizes the expected squared distance between X and A. Recall that a k-dimensional affine subspace A is specified by a k-dimensional (linear) subspace d d W ⊆ R |say, with orthonormal basis u1; u2;:::; uk|and a displacement vector u0 2 R : A = fu0 + α1u1 + α2u2 + ··· + αkuk : α1; α2; : : : ; αk 2 Rg : d Let U := [u1ju2j · · · juk]. Then, for any x 2 R , the point in A closest to x is given by u0 + > > 2 UU (x − u0), and hence the squared distance from x to A is k(I − UU )(x − u0)k2. Theorem 5.3. Let v1; v2;:::; vk be top-k eigenvectors of cov(X), let V k := [v1jv2j · · · jvk], and v0 := E(X). Then > 2 > 2 k(I − V kV )(X − v0)k = min k(I − UU )(X − u0)k : E k 2 d×k d E 2 U2R ; u02R : U >U=I TOPIC 5. PRINCIPAL COMPONENT ANALYSIS 35 2 Proof. For any matrix d×d matrix M, the function u0 7! E kM(X − u0)k2 is minimized when Mu0 = M E(X) (Fact 5.2). Therefore, we can plug-in E(X) for u0 in the minimization problem, whereupon it reduces to min k(I − UU >)(X − (X))k2 : > E E 2 U2Rd×k : U U=I The objective function is equivalent to > 2 2 > 2 E k(I − UU )(X − E(X))k2 = E kX − E(X)k2 − E kUU (X − E(X))k2 2 > = E kX − E(X)k2 − tr(cov(U X)) ; where the second equality comes from Fact 5.1. Therefore, minimizing the objective is equivalent to maximizing tr(cov(U >X)), which is achieved by PCA (Corollary 5.2). The proof of Theorem 5.3 depends on the following simple but useful fact. d d Fact 5.2 (Bias-variance decomposition). Let Y be a random vector in R , and b 2 R be any fixed vector. Then 2 2 2 E kY − bk2 = E kY − E(Y )k2 + kE(Y ) − bk2 (which, as a function of b, is minimized when b = E(Y )).
Recommended publications
  • C H a P T E R § Methods Related to the Norm Al Equations
    ¡ ¢ £ ¤ ¥ ¦ § ¨ © © © © ¨ ©! "# $% &('*),+-)(./+-)0.21434576/),+98,:%;<)>=,'41/? @%3*)BAC:D8E+%=B891GF*),+H;I?H1KJ2.21K8L1GMNAPO45Q5R)S;S+T? =RU ?H1*)>./+EAVOKAPM ;<),5 ?H1G;P8W.XAVO4575R)S;S+Y? =Z8L1*)/[%\Z1*)]AB3*=,'Q;<)B=,'41/? @%3*)RAP8LU F*)BA(;S'*)Z)>@%3/? F*.4U ),1G;^U ?H1*)>./+ APOKAP;_),5a`Rbc`^dfeg`Rbih,jk=>.4U U )Blm;B'*)n1K84+o5R.EUp)B@%3*.K;q? 8L1KA>[r\0:D;_),1Ejp;B'/? A2.4s4s/+t8/.,=,' b ? A7.KFK84? l4)>lu?H1vs,+D.*=S;q? =>)m6/)>=>.43KA<)W;B'*)w=B8E)IxW=K? ),1G;W5R.K;S+Y? yn` `z? AW5Q3*=,'|{}84+-A_) =B8L1*l9? ;I? 891*)>lX;B'*.41Q`7[r~k8K{i)IF*),+NjL;B'*)71K8E+o5R.4U4)>@%3*.G;I? 891KA0.4s4s/+t8.*=,'w5R.BOw6/)Z.,l4)IM @%3*.K;_)7?H17A_8L5R)QAI? ;S3*.K;q? 8L1KA>[p1*l4)>)>lj9;B'*),+-)Q./+-)Q)SF*),1.Es4s4U ? =>.K;q? 8L1KA(?H1Q{R'/? =,'W? ;R? A s/+-)S:Y),+o+D)Blm;_8|;B'*)W3KAB3*.4Urc+HO4U 8,F|AS346KABs*.,=B)Q;_)>=,'41/? @%3*)BAB[&('/? A]=*'*.EsG;_),+}=B8*F*),+tA7? ;PM ),+-.K;q? F*)75R)S;S'K8ElAi{^'/? =,'Z./+-)R)K? ;B'*),+pl9?+D)B=S;BU OR84+k?H5Qs4U ? =K? ;BU O2+-),U .G;<)Bl7;_82;S'*)Q1K84+o5R.EU )>@%3*.G;I? 891KA>[ mWX(fQ - uK In order to solve the linear system 7 when is nonsymmetric, we can solve the equivalent system b b [¥K¦ ¡¢ £N¤ which is Symmetric Positive Definite. This system is known as the system of the normal equations associated with the least-squares problem, [ ­¦ £N¤ minimize §* R¨©7ª§*«9¬ Note that (8.1) is typically used to solve the least-squares problem (8.2) for over- ®°¯± ±²³® determined systems, i.e., when is a rectangular matrix of size , .
    [Show full text]
  • Ph 21.5: Covariance and Principal Component Analysis (PCA)
    Ph 21.5: Covariance and Principal Component Analysis (PCA) -v20150527- Introduction Suppose we make a measurement for which each data sample consists of two measured quantities. A simple example would be temperature (T ) and pressure (P ) taken at time (t) at constant volume(V ). The data set is Ti;Pi ti N , which represents a set of N measurements. We wish to make sense of the data and determine thef dependencej g of, say, P on T . Suppose P and T were for some reason independent of each other; then the two variables would be uncorrelated. (Of course we are well aware that P and V are correlated and we know the ideal gas law: PV = nRT ). How might we infer the correlation from the data? The tools for quantifying correlations between random variables is the covariance. For two real-valued random variables (X; Y ), the covariance is defined as (under certain rather non-restrictive assumptions): Cov(X; Y ) σ2 (X X )(Y Y ) ≡ XY ≡ h − h i − h i i where ::: denotes the expectation (average) value of the quantity in brackets. For the case of P and T , we haveh i Cov(P; T ) = (P P )(T T ) h − h i − h i i = P T P T h × i − h i × h i N−1 ! N−1 ! N−1 ! 1 X 1 X 1 X = PiTi Pi Ti N − N N i=0 i=0 i=0 The extension of this to real-valued random vectors (X;~ Y~ ) is straighforward: D E Cov(X;~ Y~ ) σ2 (X~ < X~ >)(Y~ < Y~ >)T ≡ X~ Y~ ≡ − − This is a matrix, resulting from the product of a one vector and the transpose of another vector, where X~ T denotes the transpose of X~ .
    [Show full text]
  • 6 Probability Density Functions (Pdfs)
    CSC 411 / CSC D11 / CSC C11 Probability Density Functions (PDFs) 6 Probability Density Functions (PDFs) In many cases, we wish to handle data that can be represented as a real-valued random variable, T or a real-valued vector x =[x1,x2,...,xn] . Most of the intuitions from discrete variables transfer directly to the continuous case, although there are some subtleties. We describe the probabilities of a real-valued scalar variable x with a Probability Density Function (PDF), written p(x). Any real-valued function p(x) that satisfies: p(x) 0 for all x (1) ∞ ≥ p(x)dx = 1 (2) Z−∞ is a valid PDF. I will use the convention of upper-case P for discrete probabilities, and lower-case p for PDFs. With the PDF we can specify the probability that the random variable x falls within a given range: x1 P (x0 x x1)= p(x)dx (3) ≤ ≤ Zx0 This can be visualized by plotting the curve p(x). Then, to determine the probability that x falls within a range, we compute the area under the curve for that range. The PDF can be thought of as the infinite limit of a discrete distribution, i.e., a discrete dis- tribution with an infinite number of possible outcomes. Specifically, suppose we create a discrete distribution with N possible outcomes, each corresponding to a range on the real number line. Then, suppose we increase N towards infinity, so that each outcome shrinks to a single real num- ber; a PDF is defined as the limiting case of this discrete distribution.
    [Show full text]
  • The Column and Row Hilbert Operator Spaces
    The Column and Row Hilbert Operator Spaces Roy M. Araiza Department of Mathematics Purdue University Abstract Given a Hilbert space H we present the construction and some properties of the column and row Hilbert operator spaces Hc and Hr, respectively. The main proof of the survey is showing that CB(Hc; Kc) is completely isometric to B(H ; K ); in other words, the completely bounded maps be- tween the corresponding column Hilbert operator spaces is completely isometric to the bounded operators between the Hilbert spaces. A similar theorem for the row Hilbert operator spaces follows via duality of the operator spaces. In particular there exists a natural duality between the column and row Hilbert operator spaces which we also show. The presentation and proofs are due to Effros and Ruan [1]. 1 The Column Hilbert Operator Space Given a Hilbert space H , we define the column isometry C : H −! B(C; H ) given by ξ 7! C(ξ)α := αξ; α 2 C: Then given any n 2 N; we define the nth-matrix norm on Mn(H ); by (n) kξkc;n = C (ξ) ; ξ 2 Mn(H ): These are indeed operator space matrix norms by Ruan's Representation theorem and thus we define the column Hilbert operator space as n o Hc = H ; k·kc;n : n2N It follows that given ξ 2 Mn(H ) we have that the adjoint operator of C(ξ) is given by ∗ C(ξ) : H −! C; ζ 7! (ζj ξ) : Suppose C(ξ)∗(ζ) = d; c 2 C: Then we see that cd = (cj C(ξ)∗(ζ)) = (cξj ζ) = c(ζj ξ) = (cj (ζj ξ)) : We are also able to compute the matrix norms on rectangular matrices.
    [Show full text]
  • Covariance of Cross-Correlations: Towards Efficient Measures for Large-Scale Structure
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by RERO DOC Digital Library Mon. Not. R. Astron. Soc. 400, 851–865 (2009) doi:10.1111/j.1365-2966.2009.15490.x Covariance of cross-correlations: towards efficient measures for large-scale structure Robert E. Smith Institute for Theoretical Physics, University of Zurich, Zurich CH 8037, Switzerland Accepted 2009 August 4. Received 2009 July 17; in original form 2009 June 13 ABSTRACT We study the covariance of the cross-power spectrum of different tracers for the large-scale structure. We develop the counts-in-cells framework for the multitracer approach, and use this to derive expressions for the full non-Gaussian covariance matrix. We show that for the usual autopower statistic, besides the off-diagonal covariance generated through gravitational mode- coupling, the discreteness of the tracers and their associated sampling distribution can generate strong off-diagonal covariance, and that this becomes the dominant source of covariance as spatial frequencies become larger than the fundamental mode of the survey volume. On comparison with the derived expressions for the cross-power covariance, we show that the off-diagonal terms can be suppressed, if one cross-correlates a high tracer-density sample with a low one. Taking the effective estimator efficiency to be proportional to the signal-to-noise ratio (S/N), we show that, to probe clustering as a function of physical properties of the sample, i.e. cluster mass or galaxy luminosity, the cross-power approach can outperform the autopower one by factors of a few.
    [Show full text]
  • Lecture 4 Multivariate Normal Distribution and Multivariate CLT
    Lecture 4 Multivariate normal distribution and multivariate CLT. T We start with several simple observations. If X = (x1; : : : ; xk) is a k 1 random vector then its expectation is × T EX = (Ex1; : : : ; Exk) and its covariance matrix is Cov(X) = E(X EX)(X EX)T : − − Notice that a covariance matrix is always symmetric Cov(X)T = Cov(X) and nonnegative definite, i.e. for any k 1 vector a, × a T Cov(X)a = Ea T (X EX)(X EX)T a T = E a T (X EX) 2 0: − − j − j � We will often use that for any vector X its squared length can be written as X 2 = XT X: If we multiply a random k 1 vector X by a n k matrix A then the covariancej j of Y = AX is a n n matrix × × × Cov(Y ) = EA(X EX)(X EX)T AT = ACov(X)AT : − − T Multivariate normal distribution. Let us consider a k 1 vector g = (g1; : : : ; gk) of i.i.d. standard normal random variables. The covariance of g is,× obviously, a k k identity × matrix, Cov(g) = I: Given a n k matrix A, the covariance of Ag is a n n matrix × × � := Cov(Ag) = AIAT = AAT : Definition. The distribution of a vector Ag is called a (multivariate) normal distribution with covariance � and is denoted N(0; �): One can also shift this disrtibution, the distribution of Ag + a is called a normal distri­ bution with mean a and covariance � and is denoted N(a; �): There is one potential problem 23 with the above definition - we assume that the distribution depends only on covariance ma­ trix � and does not depend on the construction, i.e.
    [Show full text]
  • Matrices and Linear Algebra
    Chapter 26 Matrices and linear algebra ap,mat Contents 26.1 Matrix algebra........................................... 26.2 26.1.1 Determinant (s,mat,det)................................... 26.2 26.1.2 Eigenvalues and eigenvectors (s,mat,eig).......................... 26.2 26.1.3 Hermitian and symmetric matrices............................. 26.3 26.1.4 Matrix trace (s,mat,trace).................................. 26.3 26.1.5 Inversion formulas (s,mat,mil)............................... 26.3 26.1.6 Kronecker products (s,mat,kron).............................. 26.4 26.2 Positive-definite matrices (s,mat,pd) ................................ 26.4 26.2.1 Properties of positive-definite matrices........................... 26.5 26.2.2 Partial order......................................... 26.5 26.2.3 Diagonal dominance.................................... 26.5 26.2.4 Diagonal majorizers..................................... 26.5 26.2.5 Simultaneous diagonalization................................ 26.6 26.3 Vector norms (s,mat,vnorm) .................................... 26.6 26.3.1 Examples of vector norms.................................. 26.6 26.3.2 Inequalities......................................... 26.7 26.3.3 Properties.......................................... 26.7 26.4 Inner products (s,mat,inprod) ................................... 26.7 26.4.1 Examples.......................................... 26.7 26.4.2 Properties.......................................... 26.8 26.5 Matrix norms (s,mat,mnorm) .................................... 26.8 26.5.1
    [Show full text]
  • The Variance Ellipse
    The Variance Ellipse ✧ 1 / 28 The Variance Ellipse For bivariate data, like velocity, the variabililty can be spread out in not one but two dimensions. In this case, the variance is now a matrix, and the spread of the data is characterized by an ellipse. This variance ellipse eccentricity indicates the extent to which the variability is anisotropic or directional, and the orientation tells the direction in which the variability is concentrated. ✧ 2 / 28 Variance Ellipse Example Variance ellipses are a very useful way to analyze velocity data. This example compares velocities observed by a mooring array in Fram Strait with velocities in two numerical models. From Hattermann et al. (2016), “Eddy­driven recirculation of Atlantic Water in Fram Strait”, Geophysical Research Letters. Variance ellipses can be powerfully combined with lowpassing and bandpassing to reveal the geometric structure of variability in different frequency bands. ✧ 3 / 28 Understanding Ellipses This section will focus on understanding the properties of the variance ellipse. To do this, it is not really possible to avoid matrix algebra. Therefore we will first review some relevant mathematical background. ✧ 4 / 28 Review: Rotations T The most important action on a vector z ≡ [u v] is a ninety-degree rotation. This is carried out through the matrix multiplication 0 −1 0 −1 u −v z = = . [ 1 0 ] [ 1 0 ] [ v ] [ u ] Note the mathematically positive direction is counterclockwise. A general rotation is carried out by the rotation matrix cos θ − sin θ J(θ) ≡ [ sin θ cos θ ] cos θ − sin θ u u cos θ − v sin θ J(θ) z = = .
    [Show full text]
  • Lecture 5 Ch. 5, Norms for Vectors and Matrices
    KTH ROYAL INSTITUTE OF TECHNOLOGY Norms for vectors and matrices — Why? Lecture 5 Problem: Measure size of vector or matrix. What is “small” and what is “large”? Ch. 5, Norms for vectors and matrices Problem: Measure distance between vectors or matrices. When are they “close together” or “far apart”? Emil Björnson/Magnus Jansson/Mats Bengtsson April 27, 2016 Answers are given by norms. Also: Tool to analyze convergence and stability of algorithms. 2 / 21 Vector norm — axiomatic definition Inner product — axiomatic definition Definition: LetV be a vector space over afieldF(R orC). Definition: LetV be a vector space over afieldF(R orC). A function :V R is a vector norm if for all A function , :V V F is an inner product if for all || · || → �· ·� × → x,y V x,y,z V, ∈ ∈ (1) x 0 nonnegative (1) x,x 0 nonnegative || ||≥ � �≥ (1a) x =0iffx= 0 positive (1a) x,x =0iffx=0 positive || || � � (2) cx = c x for allc F homogeneous (2) x+y,z = x,z + y,z additive || || | | || || ∈ � � � � � � (3) x+y x + y triangle inequality (3) cx,y =c x,y for allc F homogeneous || ||≤ || || || || � � � � ∈ (4) x,y = y,x Hermitian property A function not satisfying (1a) is called a vector � � � � seminorm. Interpretation: “Angle” (distance) between vectors. Interpretation: Size/length of vector. 3 / 21 4 / 21 Connections between norm and inner products Examples / Corollary: If , is an inner product, then x =( x,x ) 1 2 � The Euclidean norm (l ) onC n: �· ·� || || � � 2 is a vector norm. 2 2 2 1/2 x 2 = ( x1 + x 2 + + x n ) . Called: Vector norm derived from an inner product.
    [Show full text]
  • Matrix Norms 30.4   
    Matrix Norms 30.4 Introduction A matrix norm is a number defined in terms of the entries of the matrix. The norm is a useful quantity which can give important information about a matrix. ' • be familiar with matrices and their use in $ writing systems of equations • revise material on matrix inverses, be able to find the inverse of a 2 × 2 matrix, and know Prerequisites when no inverse exists Before starting this Section you should ... • revise Gaussian elimination and partial pivoting • be aware of the discussion of ill-conditioned and well-conditioned problems earlier in Section 30.1 #& • calculate norms and condition numbers of % Learning Outcomes small matrices • adjust certain systems of equations with a On completion you should be able to ... view to better conditioning " ! 34 HELM (2008): Workbook 30: Introduction to Numerical Methods ® 1. Matrix norms The norm of a square matrix A is a non-negative real number denoted kAk. There are several different ways of defining a matrix norm, but they all share the following properties: 1. kAk ≥ 0 for any square matrix A. 2. kAk = 0 if and only if the matrix A = 0. 3. kkAk = |k| kAk, for any scalar k. 4. kA + Bk ≤ kAk + kBk. 5. kABk ≤ kAk kBk. The norm of a matrix is a measure of how large its elements are. It is a way of determining the “size” of a matrix that is not necessarily related to how many rows or columns the matrix has. Key Point 6 Matrix Norm The norm of a matrix is a real number which is a measure of the magnitude of the matrix.
    [Show full text]
  • 5 Norms on Linear Spaces
    5 Norms on Linear Spaces 5.1 Introduction In this chapter we study norms on real or complex linear linear spaces. Definition 5.1. Let F = (F, {0, 1, +, −, ·}) be the real or the complex field. A semi-norm on an F -linear space V is a mapping ν : V −→ R that satisfies the following conditions: (i) ν(x + y) ≤ ν(x)+ ν(y) (the subadditivity property of semi-norms), and (ii) ν(ax)= |a|ν(x), for x, y ∈ V and a ∈ F . By taking a = 0 in the second condition of the definition we have ν(0)=0 for every semi-norm on a real or complex space. A semi-norm can be defined on every linear space. Indeed, if B is a basis of V , B = {vi | i ∈ I}, J is a finite subset of I, and x = i∈I xivi, define νJ (x) as P 0 if x = 0, νJ (x)= ( j∈J |aj | for x ∈ V . We leave to the readerP the verification of the fact that νJ is indeed a seminorm. Theorem 5.2. If V is a real or complex linear space and ν : V −→ R is a semi-norm on V , then ν(x − y) ≥ |ν(x) − ν(y)|, for x, y ∈ V . Proof. We have ν(x) ≤ ν(x − y)+ ν(y), so ν(x) − ν(y) ≤ ν(x − y). (5.1) 160 5 Norms on Linear Spaces Since ν(x − y)= |− 1|ν(y − x) ≥ ν(y) − ν(x) we have −(ν(x) − ν(y)) ≤ ν(x) − ν(y). (5.2) The Inequalities 5.1 and 5.2 give the desired inequality.
    [Show full text]
  • A Degeneracy Framework for Scalable Graph Autoencoders
    Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19) A Degeneracy Framework for Scalable Graph Autoencoders Guillaume Salha1;2 , Romain Hennequin1 , Viet Anh Tran1 and Michalis Vazirgiannis2 1Deezer Research & Development, Paris, France 2Ecole´ Polytechnique, Palaiseau, France [email protected] Abstract In this paper, we focus on the graph extensions of autoen- coders and variational autoencoders. Introduced in the 1980’s In this paper, we present a general framework to [Rumelhart et al., 1986], autoencoders (AE) regained a sig- scale graph autoencoders (AE) and graph varia- nificant popularity in the last decade through neural network tional autoencoders (VAE). This framework lever- frameworks [Baldi, 2012] as efficient tools to learn reduced ages graph degeneracy concepts to train models encoding representations of input data in an unsupervised only from a dense subset of nodes instead of us- way. Furthermore, variational autoencoders (VAE) [Kingma ing the entire graph. Together with a simple yet ef- and Welling, 2013], described as extensions of AE but actu- fective propagation mechanism, our approach sig- ally based on quite different mathematical foundations, also nificantly improves scalability and training speed recently emerged as a successful approach for unsupervised while preserving performance. We evaluate and learning from complex distributions, assuming the input data discuss our method on several variants of existing is the observed part of a larger joint model involving low- graph AE and VAE, providing the first application dimensional latent variables, optimized via variational infer- of these models to large graphs with up to millions ence approximations. [Tschannen et al., 2018] review the of nodes and edges.
    [Show full text]