ML2R Theory Nuggets Centering Data- and Kernel Matrices
Total Page:16
File Type:pdf, Size:1020Kb
ML2R Theory Nuggets Centering Data- and Kernel Matrices Christian Bauckhage∗ Pascal Welke† Machine Learning Rhine-Ruhr Machine Learning Rhine-Ruhr Fraunhofer IAIS University of Bonn St. Augustin, Germany Bonn, Germany ABSTRACT What does any of this have to do with data matrices? Well, < We discuss the notion of centered data matrices and show how we may gather the = given data points x8 2 R in an < × = data to compute them using centering matrices. As centering matrices matrix have many applications in data science and machine learning, we ^ = x1, x2,..., x= (6) have a look at one such application and discuss how they allow for and apply linear algebra to facilitate procedures such as the one centering kernel matrices. we just went through. If we did this, we should hope that there is a linear algebraic mechanism to produce a centered data matrix 1 CENTERING DATA MATRICES ` = z1, z2,..., z= (7) In data mining, machine learning, or pattern recognition, we often < Indeed, linear algebra provides such a mechanism and we next have to analyze or process a sample of = data points x8 2 R . This, in turn, frequently requires us to consider zero mean or centered discuss how it works. = data. In what follows, we let 1 2 R denote the =-dimensional vector A simple canonical example of such a setting is the computation of all ones. of the < × < (biased) sample covariance matrix If = data points have been gathered in a data matrix ^, it is an easy exercise to verify that their sample mean in (2) can also be = 1 ∑︁ | computed like this I = x8 − x¯ x8 − x¯ (1) = 1 8=1 x¯ = = ^1 (8) where the vector When working with the vector of all ones, we can moreover | = consider the outer product x¯ 1 to obtain an < × = matrix 1 ∑︁ x¯ = x (2) | = 8 x¯ 1 = x¯, x¯,..., x¯ (9) 8=1 all of whose columns are copies of x¯. denotes the sample mean. This matrix now allows us to perform data centering, i.e. to While this is a specific example, we note its following general subtract the sample mean from each of the given data points, by aspect: The operation of subtracting the sample mean from each of means of just a single matrix subtraction, namely the given data points ` = ^ − x¯ 1| (10) z = x − x¯ (3) 8 8 Given what we saw in (4), it is clear that the column mean of causes the resulting transformed data z8 to have the following mean the resulting matrix ` will be 0. In other words, ` is a centered = = version of our original data matrix ^. Regarding our introductory 1 ∑︁ 1 ∑︁ z¯ = z = x − x¯ example of the sample covariance matrix, this means that we can = 8 = 8 8=1 8=1 now compute is as = = 1 1 ∑︁ 1 ∑︁ 1 I = `` | (11) = x − x¯ = x¯ − · = · x¯ = 0 (4) = = 8 = = 8=1 8=1 But let us keep going and develop even deeper insights into data centering! If we plug (8) into (10), we obtain the following In other words, the transformation in (3) causes the transformed crucial result data to be “of zero mean” or to be “centered at 0”. 1 | 1 | With respect to the sample covariance matrix in our introductory ` = ^ − = ^11 = ^ O − = 11 ≡ ^ P (12) example, this means that we could also compute it as Here, O denotes the = × = identity matrix and the matrix = 1 ∑︁ P = O − 1 11| (13) I = z z| (5) = = 8 8 8=1 is called a centering matrix because it centers the columns of matrix ^ at 0. ∗ 0000-0001-6615-2128 Centering matrices have a couple of important or convenient † 0000-0002-2123-3781 characteristics. For instance, it is easy to show that C. Bauckhage and P. Welke (1) P is square, i.e. P 2 R=×= 2 CENTERING KERNEL MATRICES | (2) P is symmetric, i.e. P = P Many algorithms for data mining, machine learning, and pattern : (3) P is idempotent, i.e. P = P for all 1 ≤ : 2 N recognition work with Gram matrices (4) P is positive semi-definite, i.e. ~| P ~ ⪰ 0 for all ~ 2 R= (5) P is singular, i.e. not invertible, i.e. P −1 does not exists M = ^ |^ (27) (6) because of (4), all eigenvalues of P are non-negative » ¼ | (7) because of (3), all eigenvalues of P are either 0 or 1 which are matrices whose elements are given by M 8 9 = x8 x 9 (8) because of (2) and (3), P is an orthogonal projection matrix and thus only depend on inner products between given data points. (9) the trace of P amounts to This allows for invoking the kernel trick where we replace inner 1 products by evaluations of a Mercer kernel tr P = = · 1 − = = = − 1 Regarding our introductory example of computing the sample x8, x 9 = >8 >9 (28) covariance matrix, we can make use of some of the above properties Here, > ≡ >¹x º where > : R< ! H is some generally unknown to obtain 8 8 transformation from the data space into a potentially infinite di- 1 | I = = `` (14) mensional feature space. The function h· j ·i : H × H ! R denotes the inner product in that feature space. = 1 ^ P ^ P | (15) = To stress the whole point of our following discussion, we remark = 1 ^ P P |^ | (16) that it sounds formidable to work with unknown transformations = into potentially infinite dimensional spaces. Recall, however, that 1 | we only do this implicitly when using explicitly given kernel func- = = ^ P P ^ (17) tions. In other words, while it may be impossible to practically 1 | = = ^ P ^ (18) compute the inner product on the right of equation (28), it is usu- Since this result may look curious at first sight, we will briefly ally possible to evaluate the kernel function on its left hand side. verify it from a different point of view. With respect to Gram matrices, this is to say that invoking the If we expand the outer products in the summation in (1), we kernel trick is to replace the = × = matrix M by an = × = kernel obtain the following fairly well known decomposition of the sample matrix Q where covariance matrix = = = Q 8 9 = x8, x 9 (29) 1 ∑︁ 2 ∑︁ 1 ∑︁ I = x x| − x x¯| ¸ x¯x¯| (19) = 8 8 = 8 = 8=1 8=1 8=1 Conceptually, or from the point of view of “pen-and-paper math”, this is the same as introducing a feature space data matrix 1 | | | = ^^ − 2 x¯x¯ ¸ x¯x¯ (20) h i = 횽 = >1, >2,..., >= (30) 1 = ^^ | − x¯x¯| (21) = and then replacing M by an = × = matrix 횽|횽 where If we next plug (8) into this expression, we realize that the sample | = > > (31) covariance matrix can also be written as 횽 횽 8 9 8 9 = 1 | − 1 | I = ^^ =2 ^1 ^1 (22) Now, if we kernelize an algorithm that requires us to work with < centered data z8 = x8 − x¯ 2 R , we must make sure that their = 1 ^^ | − 1 ^ 11|^ | (23) = =2 feature space representations 78 = >8 − >¯ 2 H are centered, too. Again, this sounds formidable, because we just emphasized that = 1 ^ ^ | − 1 11|^ | (24) = = it may be practically impossible to compute the underlying trans- 1 1 | | formations. How are we then supposed to compute the feature = = ^ O − = 11 ^ (25) space mean required for feature space centering? But let us do 1 | = = ^ P ^ (26) some “pen-and-paper math” to see what we are actually dealing which does indeed corroborate the result in (18). with. On a conceptual or symbolic level, we can easily work with the feature space data matrix 횽 in (30). Our insights form the previous section then tell us that 1 | 횿 = 횽 P = 횽 − = 횽11 (32) will be a centered matrix. However, if we cannot even write com- puter code to compute 횽, we certainly cannot write computer code to center it. Yet, what we are actually interested in when kernelizing an algorithm is to compute the Gramian 횿|횿 and we will see next that this is practically feasible. Centering Data- and Kernel Matrices In fact, this is easy to see: Using the definition in (32) together Using the definition of the feature space sample meanin (44) with general properties of the centering matrix P , we have and the bilinearity of the inner product, this can also be written as | | 횿 횿 = 횽 P 횽 P (33) >8 − >¯ >9 − >¯ = >8 >9 | | = = P 횽 횽 P (34) 1 ∑︁ − = >8 >9 = P 횽|횽 P (35) 9=1 = At this point, we note that it is actually possible to compute the 1 ∑︁ − >8 >9 entries of the feature space Gramian 횽|횽, because = 8=1 | = = = = = 횽 횽 8 9 >8 >9 x8, x 9 Q 8 9 (36) ∑︁ ∑︁ ¸ 1 =2 >8 >9 (47) In other words, we may not be able to practically compute matrix 횽 8=1 9=1 but we are able to compute 횽|횽 because its entries can be obtained by evaluating a computable kernel function. All in all, we therefore Given this expansion, we can now resort to (28) and write all have the crucial result that the inner products on the right hand side of (47) in terms of kernel functions 횿|횿 = PQP (37) > − >¯ > − >¯ = x , x At this point, we have successfully shown that it is feasible to 8 9 8 9 = practically compute (and therefore work with) matrix 횿|횿. But ∑︁ − 1 x , x let us once again keep going and develop deeper insights into = 8 9 9=1 what it means to center a kernel matrix.