NEW INITIALIZATION STRATEGY FOR NONNEGATIVE FACTORIZATION

A Thesis Presented

By

Yueyang Wang

to

The Department of Electrical and Computer Engineering

In partial fulfillment of requirements for the degree of

Master of Science

in the field of

Electrical & Computer Engineering

Northeastern University Boston, Massachusetts

May 2018

ii

ABSTRACT

Nonnegative matrix factorization (NMF) has been proved to be a powerful data representation method, and has shown success in applications such as data representation and document clustering. In this thesis, we propose a new initialization strategy for NMF. This new method is entitled square factorization, SQR-NMF. In this method, we first transform the non-square nonnegative matrix to a square one. Several strategies are proposed to achieve SQR step. Then we take the positive section of eigenvalues and eigenvectors for initialization. Simulation results show that SQR-NMF has faster convergence rate and provides an approximation with lower error rate as compared to SVD-NMF and random initialization methods. Complementing different elements in data matrix also affect the results. The experiments show that complementary elements should be 0 for small data sets and mean values of each row or column of the original nonnegative matrix for large data sets.

Key words: Nonnegative matrix factorization, complementary elements

iii

ACKNOWLEDGEMENT

I would like to express my sincere gratitude to my thesis advisor, Professor Shafai for his patience, motivation, and immense knowledge. His guidance helped me a lot in all the time of research and writing of the thesis. I could not have imagined having a better advisor for my master study. Besides my advisor, I would like to thank the rest of my thesis committee: Professor Nian Sun and Professor Vinay Ingle, for their insightful comments and encouragement. Finally, I would like to express my profound gratitude to my parents for providing me with support and continuous encouragement throughout the writing of this thesis and my life in general.

iv

TABLE OF CONTENTS

ABSTRACT ...... ii

LIST OF TABLES ...... vi

LIST OF FIGURES ...... vii

Chapter 1 Introduction ...... 1

Chapter 2 Nonnegative Matrix...... 3

2.1. Nonnegative Matrices and Perron-Frobenious Theorem ...... 3 2.2. Reducible matrix ...... 4 2.3. Irreducible Matrix ...... 5 2.4. Primitive Matrix ...... 5 2.5. ...... 6 2.6. M-Matrix ...... 6 2.7. ...... 7 2.8. ...... 8 2.9. Non-square Nonnegative Matrix...... 8

Chapter 3 SQR-NMF for the initialization of NMF ...... 10

3.1. Nonnegative Matrix Factorization ...... 10 3.2. SVD based initialization: SVD-NMF ...... 12 3.3. Eigenvalue decomposition based initialization: SQR-NMF ...... 13 3.3.1 SQR-NMF initialize with eigenvalue and eigenvector ...... 14 3.3.2 SQR-NMF initialize with symmetric matrix ...... 14

Chapter 4 Simulation Results and Comparison ...... 16

4.1. Simulation results for MU algorithm ...... 16 v

4.1.1 Comparison w.r.t fixed factorization rank and increasing iteration times ...... 16 4.1.2 Comparison w.r.t fixed iteration time and increasing factorization rank ...... 17 4.1.3 Reconstruction images ...... 18 4.2. Simulation results for divergence-reducing algorithm ...... 19 4.2.1 Comparison w.r.t fixed factorization rank and increasing iteration times ...... 19 4.2.2 Comparison w.r.t fixed iteration time and increasing factorization rank ...... 19 4.3. Numerical results for different complementary elements ...... 20 4.3.1 Comparison w.r.t fixed factorization rank and increasing iteration times ...... 20 4.3.2 Reconstruction images ...... 21

Chapter 5 Conclusion ...... 23

REFERENCES ...... 24

APPENDIX ...... 26

vi

LIST OF TABLES

Table 1 Factorization rank of processing 1 image with different extraction ratio ...... 13 Table 2 Factorization rank of processing 5 images with different extraction ratio ...... 13

vii

LIST OF FIGURES

Figure 1 Errors of reconstruction using MU algorithm ...... 16 Figure 2 Errors of reconstruction using MU algorithm ...... 17 Figure 3 Errors of reconstruction using MU algorithm with fixed iteration ...... 17 Figure 4 Reconstruction images...... 18 Figure 5 Errors of reconstruction using divergence-reducing algorithm ...... 19 Figure 6 Errors using divergence-reducing algorithm with fixed iteration ...... 20 Figure 7 Errors of adding different complementary elements ...... 21 Figure 8 Reconstruction images...... 22 1

Chapter 1 Introduction

Nonnegative matrix factorization (NMF) has become a widely used method in analyzing large datasets, since it extracts features from a large set of data vectors. It is a kind of factorization that constrains the elements of both components and the expansion coefficients to be non-negative. Non-negative matrix factorization is a useful method in reducing the dimension of large datasets. In the paper Learning the parts of objects by non-negative matrix factorization published by Lee and Seung[1], they presented how NMF could learn parts of objects in facial images. It is intuited by the idea of combining the parts to form the whole. Negative values is admissible, but lose physical meaning in practice. The factorization of matrices representing complex multidimensional datasets is the basis of several commonly applied techniques for pattern recognition and unsupervised clustering. Similarly to principal components analysis or independent component analysis, the objective of non-negative matrix factorization (NMF) is to explain the original data with limited basis components and coefficients, which when combined together approximate the original data as accurately as possible. The NMF specializes in that it constrained both the matrix representing the basis components as well as the matrix of mixture coefficients to have non-negative entries, and that no orthogonality or independence constraints are imposed on the basis components. This leads to a simple and intuitive interpretation of the factors in NMF, and allows the basis components to overlap. Because of its definition[1], the NMF method has been successfully applied in several fields including image recognition and pattern recognition, signal processing and text mining[2]. NMF has also been applied in biological industries. It is used to obtain new insights into cancer type discovery based on gene expression microarrays[3], for the functional characterization of genes[4], to predict cis-regulating elements from positional word count matrices[5] and, for phenotype prediction using cross-platform microarray data[6]. The popularity of the NMF approach derives essentially from three properties that distinguish it from standard decomposition techniques. 2

Firstly, the matrix factors are nonnegative by definition, which allows their intuitive interpretation as real underlying components within the context defined by the original data. The basic components can be directly interpreted as parts or basis samples, present in different proportions in each observed sample. In the context of gene expression microarrays, Brunet et al.[3]interpreted them as metagenes that capture gene expression patterns specific to different groups of samples. When decomposing positional word count matrices of k-mers in DNA sequences, Hutchins et al.[5] interpreted the basis samples as specific sequence patterns or putative regulatory motifs. Secondly, NMF generally produces sparse results, which means that the basis and/or mixture coefficients have only a few non-zero entries. This provides a more compact and local representation, emphasizing even more the parts-based decomposition of the data [1]. NMF based representations performs very well in the identification of clusters of samples and their characterization with a small set of marker features[2]. For example, Carmona-Saez et al.[7]used this feature to define a bi-clustering approach for gene expression microarrays. Finally, unlike other decomposition methods such as SVD or ICA, NMF does not aim at finding components that are orthogonal or independent, but allows them to overlap instead. This unique feature is particularly useful in the context of gene expression microarrays, where overlapping mega-genes could identify genes that belong to multiple pathways or processes[3]. In this thesis, we first introduce the basic idea of NMF. Secondly, we show some properties of nonnegative matrix. Thirdly, we propose a new initializing strategy, the Jordan form based method for NMF. At last, numerical experiments show the effectiveness of the new method.

3

Chapter 2 Nonnegative Matrix

In this chapter, we consider square nonnegative matrices with important properties. A more general treatment of such matrices can be found in [18].

2.1. Nonnegative Matrices and Perron-Frobenious Theorem

Definition 2.1. A square matrix 퐴 which satisfy 퐴 ≥ 0 is called a square nonnegative matrix i.e. a square matrix with all of its elements having nonnegative values.

Theorem 2.1. If 퐴 is a nonnegative square matrix, then (a) 휌(퐴), the spectral radius of A, is an eigenvalue, (b) 퐴 has a nonnegative eigenvector corresponding to 휌(퐴), (c) 퐴푡 has a nonnegative eigenvector corresponding to 휌(퐴).

The Perron-Frobenius theorem asserts that a real square matrix with positive entries has a unique largest real eigenvalue and the corresponding eigenvector can be chosen to have strictly positive components, and also assets a similar statement for certain classes of nonnegative matrices[8]. It has been applied into many fields, not only in many branches of mathematics, such as Markov chains, and , but in various fields of science, such as, economics and compact operators as well.

Theorem 2.2. (a) If 퐴 is positive, then 휌(퐴) is a simple eigenvalue, greater than the magnitude of any other eigenvalue. (b) If 퐴 ≥ 0 is irreducible, then 휌(퐴) is a simple eigenvalue, any eigenvalue of 퐴 of the same modulus is also simple, 퐴 has a positive eigenvector 푥 corresponding to 휌(퐴), and any nonnegative eigenvector of 퐴 is a multiple of 푥.

Theorem 2.2 is the first part of the classical Perron-Frobenius theorem. Perron[9] proved it for positive matrices, and Frobenius[9] extended it to irreducible matrices. 4

The following is the second part of the Perron-Frobenius theorem. Theorem 2.3. (a) If an irreducible 퐴 has eigenvalues

푖휃0 푖휃1 푖휃푛−1 휆0 = 푟푒 , 휆1 = 푟푒 , ⋯ , 휆푛−1 = 푟푒 of modulus 휌(퐴) = 푟,

0 = 휃0 < 휃1 < ⋯ 휃푛−1 < 2휋, then these numbers are distinct roots of 휆푘 − 푟푘 = 0.

(b) More generally, the whole spectrum 푆 = {휆0, 휆1, ⋯ , 휆푛−1} of 퐴 goes over into itself under a rotation of the complex plane by 2휋⁄ℎ. (c) If ℎ > 1, then 퐴 can be transformed to 0 퐴 0 … 0 12 0 0 퐴23 … 0 푡 푃퐴푃 = ⋮ ⋮ ⋮ ⋱ ⋮ , 0 0 0 … 퐴ℎ−1 ℎ [ 퐴ℎ1 0 0 … 0 ] by a congruent transformation, where the zero blocks along the diagonal are square.

2.2. Reducible matrix

An 푛 × 푛 matrix 퐴 is cogredient to a matrix 퐸 if for some 푃, 푃퐴푃푡 = 퐸. 퐴 is reducible of it can be transformed to 퐵 0 퐸 = [ ], 퐶 퐷 where 퐵 and 퐷 are square matrices, or if 푛 = 1 and 퐴 = 0. Otherwise, 퐴 is irreducible.

Theorem 2.4. To the spectral radius 푟 of 퐴 ≥ 0 there corresponds a positive eigenvector if and only if the final classes of 퐴 are exactly its basic ones.

Theorem 2.5. To the spectral radius of 퐴 ≥ 0 there corresponds a positive eigenvector of 퐴 and a positive eigenvector of 퐴푡 if and only if all the classes of 퐴 are basic and final. ( In other words, the triangular block of 퐴 is a direct sum of matrices having 푟 as a spectral radius. 5

Theorem 2.6.

Let 퐴 ≥ 0 have spectral radius 푟 and 푚 basic classes 훼1, 훼2, ⋯ 훼푚. Then the algebraic (1) (푚) (푗) eigenspace of 퐴 contains nonnegative vectors, 푥 , ⋯ , 푥 , such that 푥푖 > 0 if and only if 푖 has access to 훼푗, and any such collection is a basis of the algebraic eigenspace of 퐴.

2.3. Irreducible Matrix

Each of the following conditions characterize the irreducibility of a nonnegative matrix 퐴 of order 푛 (푛 > 1). (a) No nonnegative eigenvector of 퐴 has a zero coordinate. (b) 퐴 has exactly one (up to scalar multiplication) nonnegative eigenvector, and this eigenvector is positive. (c) 훼푥 ≥ 퐴푥, 푥 > 0 → 푥 ≫ 0. (d) (퐼 + 퐴)푛−1 ≫ 0. (e) 퐴푡 is irreducible. Corollary 2.1: An irreducible matrix is primitive if its is positive.

2.4. Primitive Matrix

Theorem 2.7. The following conditions, on a nonnegative matrix 퐴, is equivalent: (a) 퐴 is irreducible and 휌(퐴) is greater in magnitude than any other eigenvalue. (b) There exists a natural number 푚 such that 퐴푚 is positive. Matrices satisfy the conditions in Theorem 2.6 are called primitive matrices. The index of primitivity, 훾(퐴), of a primitive matrix 퐴 is the smallest positive integer 푘 such that 퐴푘 ≫ 0. 6

2.5. Stochastic matrix

Consider 푛 possible states, 푠1, 푠2, ⋯ , 푠푛, of a certain process. Suppose that the probability of the process moving from state 푠푖 to 푠푗 is time independent and denote this probability by 푡푖푗. Such a process is called a finite homogeneous Markov chin. These processes clearly satisfy: 푛

푡푖푗 ≥ 0, ∑ 푡푖푗 = 1, 푖, 푗 = 1, ⋯ 푛. 푗=1 A square matrix of order 푛 is called (row) stochastic if it satisfies the following rule and doubly stochastic , if , in addition, 푛

∑ 푡푖푗 = 1, 푗 = 1, ⋯ 푛 푖=1 Theorem 2.8. The maximal eigenvalue of a stochastic matrix is one. A nonnegative matrix 푇 is stochastic if and only if 푒 is an eigenvector of 푇 corresponding to the eigenvalue one. Theorem 2.8: If 퐴 ≥ 0, 휌(퐴) > 0, 푧 ≫ 0, and 퐴푧 = 휌(퐴)푧, then 퐴/휌(퐴) is similar to a stochastic matrix.

2.6. M-Matrix

Many problems in the biological, physical and sciences can be treated as problems of solving matrices with certain constraints. One of the most common situation is where matrix 퐴 in the question has nonpositive off-diagonal and nonnegative diagonal entries, that is, 퐴 is a finite matrix of the type

푎11 −푎12 −푎13 ⋯ −푎 푎 −푎 ⋯ 퐴 = [ 21 22 23 ] −푎31 −푎32 푎33 ⋯ ⋮ ⋮ ⋮ ⋱ where 푎푖푗 are nonnegative. Since 퐴 can be expressed in the form 퐴 = 푠퐼 − 퐵, 푠 > 0, 퐵 ≥ 0, 7

We can tell that the theory of nonnegative matrices plays a dominant role in the study of these matrices. Matrices in this certain form are called M-matrices. Nonsingular M-matrix If 퐴 is a nonsingular M-matrix, the following conditions are equivalent: (a) All of the principal minors of 퐴 is positive. (b) Every real eigenvalue of each principal submatrix of 퐴 is positive. (c) 퐴 + 훼퐼 is nonsingular for each 훼 ≥ 0. (d) Every real eigenvalue of 퐴 is positive. (e) All the leading principal minors of 퐴 are positive. (f) There exist lower and upper triangular matrices 퐿 and 푈, respectively, with positive diagonals such that 퐴 = 퐿푈. (g) 퐴 is inverse-positive; that is, 퐴−1 exists and 퐴−1 ≥ 0.

2.7. Metzler Matrix

A Metzler matrix is a matrix in which all the off-diagonal components are nonnegative. It can be expressed as

−푎11 푎12 푎13 ⋯ 푎 −푎 푎 ⋯ 퐴 = [ 21 22 23 ], 푎31 푎32 −푎33 ⋯ ⋮ ⋮ ⋮ ⋱ where 푎푖푗 are nonnegative. Since 퐴 can also be expressed in the form, 퐴 = 퐵 − 푠퐼, 푠 > 0, 퐵 ≥ 0, the study of Metzler matrix is also based on the study of nonnegative matrix. The exponential of a Metzler matrix is a nonnegative matrix because of the corresponding property for the exponential of a nonnegative matrix. This is natural, once one observes that the generator matrices of continuous-time finite-state Markov processes are always Metzler matrices, and that probability distributions are always nonnegative. A Metzler matrix has an eigenvector in the nonnegative orthant because of the corresponding property for nonnegative matrices[10]. 8

2.8. Symmetric matrix

Theorem 2.9. (a) The sum and difference of two symmetric matrices is symmetric; (b) Given symmetric matrices 퐴 and 퐵, then 퐴퐵 is symmetric if and only if 퐴 and 퐵 commute; (c) For integer 푛, 퐴푛 is symmetric if 퐴 is symmetric; (d) If 퐴−1 exists, it is symmetric if and only if 퐴−1 is symmetric.

2.9. Non-square Nonnegative Matrix

A matrix 퐴 is called nonnegative non-square matrix if all its entries 푎푖푗 ≥ 0, for all 푖 = 1,2, ⋯ 푚, and 푗 = 1,2, ⋯ 푛, and 푚 ≠ 푛. However, we can transform these non- square nonnegative matrices to square nonnegative matrices as we defined at the beginning of this chapter. The (푛 − 푚) rows or columns of complementary elements can be added to make A a square nonnegative matrix. The complementary elements can be 0, 1, or the mean values of each row or column. Different complementary elements generate different eigenvalues, thus it will affect the factorization procedure. The factorization results of complementing different elements will be further shown in Chapter 4. 9

10

Chapter 3 SQR-NMF for the initialization of NMF

In this chapter, we present two initialization strategies for nonnegative matrices. One is called SVD-NMF, which is based on the singular value decomposition. The other is called SQR-NMF, the first step of which is the square transformation, the initialization step is based on eigenvalue decomposition.

3.1. Nonnegative Matrix Factorization

NMF can be described as follows: for a non-negative matrix 푉푚∗푛 of size 푚 ∗ 푛, 푚∗푟 푟∗푛 each column of 푉 is a sample vector, we can find 푊 ∈ 푅+ and 퐻 ∈ 푅+ with respective sizes 푚 ∗ 푟 and 푟 ∗ 푛 satisfies: 푉푚∗푛 ≈ 푊푚∗푟 × 퐻푟∗푛 where 푊 is called the basis matrix, and 퐻 is called the coefficient matrix and 푟 is the rank of factorization [11]. 푟 should satisfy the rule: (푚 + 푛)푟 < 푚푛 It is important to determine the size of the low rank matrices. For the accuracy of approximation, the larger 푟 is, the higher accuracy will be. On the other hand, to reduce the size of data, we need 푟 to be small. Researchers set 푟 randomly at the beginning of other algorithms. In this thesis, we set 푟 as different numbers by the choosing rule introduced in SVD-NMF. As Boutsidis and Gallopoulos referred[12], a good initialization strategy should satisfy the following conditions: (1) one that leads to rapid error reduction and faster convergence; (2) one that leads to overall error at convergence[11]. We only evaluate the property of the Jordan form based NMF with the first condition, since the second one is very difficult to be satisfied for most algorithms. There are normally two ways to define the cost functions, which can represent the approximation in the NMF reconstruction. One useful measure is simply the square of the Euclidean distance between two matrices[13]:

2 2 퐶1 = ||퐴 − 퐵|| = ∑(퐴푖푗 − 퐵푖푗) 푖푗 11

Another useful measure is the divergence between two matrices[13]:

퐴푖푗 퐶2 = 퐷(퐴||퐵) = ∑(퐴푖푗푙표푔 − 퐴푖푗 + 퐵푖푗) 퐵푖푗 푖푗 We now consider the update rules of NMF as the following problems: Problem 1: Minimize ||푉 − 푊퐻||2 with respect to 푊 and 퐻, subject to the constraints 푊, 퐻 ≥ 0 푛 푚 2 min 푓(푊, 퐻) = ∑ ∑(푉푖푗 − (푊퐻)푖푗) 푖=1 푗=1

푠푢푏푗푒푐푡 푡표 푊푖푎 ≥ 0, 퐻푏푗 ≥ 0, ∀푖, 푎, 푏, 푗 Problem 2: Minimize 퐷(푉||푊퐻) with respect to 푊 and 퐻, subject to the constraints 푊, 퐻 ≥ 0

푉푖푗 퐷(푉||푊퐻) = ∑(푉푖푗푙표푔 − 푉푖푗 + 푊퐻푖푗) 푊퐻푖푗 푖푗

푠푢푏푗푒푐푡 푡표 푊푖푎 ≥ 0, 퐻푏푗 ≥ 0, ∀푖, 푎, 푏, 푗 Lee and Seung proposed two update algorithms for solving Problem 1 and 2. Algorithm 1: The Euclidean distance ‖푉 − 푊퐻‖ is non-increasing under the multiplicative update algorithm[13]: 퐻 ← 퐻.∗ ((푊푇푉)./(푊푇푊퐻)) 푊 ← 푊.∗ ((푉퐻푇)./(푊퐻퐻푇)) Algorithm 2: The divergence 퐷(푉||푊퐻) is non-increasing under the divergence- reducing update rules[13]:

∑푖 푊푖푎 푉푖휇⁄(푊퐻)푖휇 퐻푎휇 ← 퐻푎휇 ∑푘 푊푘푎

∑휇 퐻푎휇 푉푖휇⁄(푊퐻)푖휇 푊푖푎 ← 푊푖푎 ∑휈 퐻푎휈 Currently there are some literatures which propose different methods to improve the initializations of NMF algorithm. S.Wild[14] proposed to produce a structured initialization by Spherical K-Means. This method is very effective, but it increases the computational complexity. A.Janecek and Y.Tan[15] applied population based algorithms to initialize NMF. They used five population based algorithm to find optimal starting 12 points for single rows of 푊 and single columns of 퐻. This method also increase the computational complextity.

3.2. SVD based initialization: SVD-NMF

Let 푉 ∈ 푅푚×푛 be a singular value decomposition 푇 푚×푛 푉 = 푃Σ̂푄 Σ̂ = 푑푖푎푔(휎1, 휎2, ⋯ 휎푝) ∈ 푅 , 푝 = min (푚, 푛) 푚×푚 where 휎1 ≥ 휎2 ≥ ⋯ ≥ 휎푝 > 0 are the singular values of 푉, and 푃 ∈ 푅 and 푄 ∈ 푅푛×푛 are orthogonal matrices. There can be negative elements in matrix 푃 and matrix 푄, so we cannot directly use these two matrices for initialization. Thus, we set the negative entries to the absolute value of themselves, so we get |푃| and |푄푇|. We get the initialization formulas of 푊 and 퐻:

푊0 = |푃| 푇 퐻0 = |Σ̂푄 | where |. |indicates that all entries of are their absolute values. We set 푟 by the following choosing rule. Factorization rank 풓 choosing rule The singular values obtained by SVD are sorted in the descending order, then the sum of the first few singular values take a large portion of the sum of all singular values, which means they have contained enough information for reconstruction.

We define the sum of all singular values as 푠푢푚푝 and the sum of first few as

푠푢푚푟, thus we have:

푠푢푚푝 = 휎1 + 휎2 + ⋯ + 휎푝

푠푢푚푟 = 휎1 + 휎2 + ⋯ + 휎푟 Then we define the choosing rule when the extraction ratio is 90% in the following form:

푠푢푚푟⁄푠푢푚푝 ≤ 90% 푠푢푚푟+1⁄푠푢푚푝 ≥ 90%

13

Extraction ratio Factorization rank 80% 10 12 8 7 9 90% 23 26 20 20 21 100% 92 92 92 92 92 Table 1 Factorization rank of processing 1 image with different extraction ratio

Extraction ratio Factorization rank 80% 17 26 18 25 16 90% 37 47 37 47 37 100% 112 112 112 112 112 Table 2 Factorization rank of processing 5 images with different extraction ratio

Table 1 gives us the factorization rank 푟 for different facial images using choosing rule. These images are derive from ORL database[16]. The ORL database has 40 different persons and each has 10 different facial images. The size of each image is 112 ∗ 푚∗푛 92. From Section 1, we get that 푟 should satisfies: 푟 < ≈ 51 when processing 1 푚+푛 푚∗푛 image and 푟 < ≈ 90 when processing 5 images for ORL database, where 푚, 푛 is the 푚+푛 number of rows and columns of image matrix. Table 1 and Table 2 show that all of factorization rank satisfy the rule when the extraction ratio is 80% and 90%.

3.3. Eigenvalue decomposition based initialization: SQR-NMF

In this method, we first transform a non-square nonnegative matrix into a square one. We add complementary elements to each row or column to make it a square matrix. The complementary elements can be 0, 1, or the mean values of each row or column. Then we apply eigenvalue decomposition for the square nonnegative and set the factorization rank by the choosing rule introduced in the SVD-NMF. Since there is no randomization in initialization, it can converge to the same solution, which is critical to iterative algorithms. The basic initialization is based on square matrices with eigenvalue decomposition, therefore we title this method as SQR-NMF. Let 푉 ∈ 푅푚×푛 be an eigenvalue decomposition 14

−1 푉푚×푛 → 푉푝×푝 = 푄푉̂푄 푉̂ = 푑푖푎푔(휎1,휎2, ⋯ 휎푝), 푝 = max (푚, 푛) where 휎1 ≥ 휎2 ≥ ⋯ ≥ 휎푝 are the eigenvalues of 푉, and 푄 is a matrix, consists of the eigenvectors of 푉.

3.3.1 SQR-NMF initialize with eigenvalue and eigenvector

In this method, we do the initialization with eigenvalue and eigenvectors. The entries of eigenvectors can be negative, we take the positive part for initialization. We get the initialization formulas of 푊 and 퐻:

푊0 = |푄| −1 퐻0 = |푉̂푄 | where |. |indicates that all entries of are their absolute values.

3.3.2 SQR-NMF initialize with symmetric matrix

Using the , we can prove that any square real matrix can be written as the product of two real symmetric matrices[17].

푉푚×푛 → 푉푝×푝 = 푆1푆2, 푝 = max (푚, 푛) where matrix 푆1 and 푆2 are symmetric matrices, with the size 푝 × 푝. We generate a symmetric matrix 푆 in the following form: 푆 = 푄푄푇

Then the square nonnegative matrix 푉푝×푝 can be factorized as 푇 −1 푉푝×푝 = 푆1푆2 = 푆푉푝×푝푆 푇 푆1 = 푆푉푝×푝 −1 푆2 = 푆

We initialize 푊 and 퐻 with the positive section of 푆1 and 푆2 respectively.

푊0 = |푆1|

퐻0 = |푆2| where |. |indicates that all entries of are their absolute values. In Chapter 4, we use numerical results to show errors with increasing iterations and factorization rank for SQR-NMF, SVD-NMF and random NMF. We define the error in the following form to evaluate the approximation: 15

‖푉 − 푊퐻‖ 푒푟푟표푟 = 퐹 ‖푉‖퐹 where‖. ‖퐹 is the Frobenius norm of the matrix.

16

Chapter 4 Simulation Results and Comparison

In this chapter, we carry out experiments to evaluate different initialization method. Firstly, we show the errors we defined in chapter 3 to evaluate them. Then we show the representation images using different initialization methods. The input data is the ORL face dataset[16], which contains 40 persons and 400 face images in total, each person corresponding to 10 grayscale face image with resolution of 112 × 92, and the grayscale is [0,255]. In the test, we will present data by matrixes. For image data, each pixel value represents a single value in the data matrix.

4.1. Simulation results for MU algorithm

4.1.1 Comparison w.r.t fixed factorization rank and increasing iteration times

5 images, r=25 5 images, r=45 0.24 0.22 svd svd 0.22 sqr 0.2 sqr ran ran

0.2 0.18

0.18 0.16

0.16 0.14

r r

o o r

r 0.14 0.12

r r

e e

0.12 0.1

0.1 0.08

0.08 0.06

0.06 0.04

0.04 0.02 0 100 200 300 400 500 600 700 800 0 100 200 300 400 500 600 700 800 iterations iterations (a) Errors of processing 5 images

1 image,r=10 1 image, r=25 0.24 0.22 svd svd sqr 0.2 sqr 0.22 ran ran

0.18 0.2

0.16 0.18

0.14

0.16

r r

o o r

r 0.12

r r

e e 0.14 0.1

0.12 0.08

0.1 0.06

0.08 0.04

0.06 0.02 0 100 200 300 400 500 600 700 800 0 100 200 300 400 500 600 700 800 iterations iterations (b)Errors of processing 1 image Figure 1 Errors of reconstruction using MU algorithm 17

The error of processing with fixed factorization rank is shown in Figure 1. 푟 in the title is the rank of factorization. In these cases, the larger the factorization rank is, the lower error rate will be. We can obtain that SQR-NMF has lower error rate at the beginning, and it convergence faster than SVD-NMF and random NMF when processing 5 images. However, SQR-NMF has a higher error rate when processing 1 image. Thus we do the initialization with symmetric matrices instead of eigenvalues and eigenvectors in SQR-NMF. Figure 2 shows the errors of reconstruction using symmetric matrices. The performance of this new strategy is convincing, since it has lower error rate as iteration goes on.

1 image, r=10 1 image,r=25 0.26 0.22 svd svd 0.24 sqr 0.2 sqr ran ran

0.22 0.18

0.2 0.16

0.18 0.14

r r

o o r

r 0.16 0.12

r r

e e

0.14 0.1

0.12 0.08

0.1 0.06

0.08 0.04

0.06 0.02 0 100 200 300 400 500 600 700 800 0 100 200 300 400 500 600 700 800 iterations iterations Figure 2 Errors of reconstruction using MU algorithm

From Figure 1 and Figure 2, we conclude that SQR-NMF performs better than the other two methods, and it has significant advantage when the factorization rank is small.

4.1.2 Comparison w.r.t fixed iteration time and increasing factorization rank

ORL,it=18 ORL,it=100 0.2 0.22 svd svd sqr sqr 0.19 ran 0.2 ran

0.18 0.18

0.17 0.16

0.16 0.14

r r

o o

r r

r r

e e 0.15 0.12

0.14 0.1

0.13 0.08

0.12 0.06

0.11 0.04 0 5 10 15 20 25 30 35 40 45 50 0 10 20 30 40 50 60 70 80 90 factorization rank factorization rank Figure 3 Errors of reconstruction using MU algorithm with fixed iteration 18

Figure 3 gives us the reconstruction error with fixed iteration time. We conclude that in the ORL database, SQR-NMF has rapid error reduction rate and smaller error rate when iterations are equal with SVD-NMF and random NMF.

4.1.3 Reconstruction images

(a) P=45 (b) p=25 Figure 4 Reconstruction images. The first row using SQR-NMF, the second row using SVD-NMF, the last row using random-NMF for MU algrithm.

Figure 4 gives us the reconstruction images using SQR-NMF, SVD-NMF and random NMF with different extraction portion. They are 80% and 90% respectively. We conclude when extraction ratio is 80%, the factorization rank is small 푟 = 25 and has higher error rate, which means the factorizations are not good. We can only find the outline of the faces from the reconstruction image. When the extraction portion is 90%, the reconstruction image has more details, such as, clearer noses and eyes, and more different expressions. 19

4.2. Simulation results for divergence-reducing algorithm

4.2.1 Comparison w.r.t fixed factorization rank and increasing iteration times

5 images, r=25 5 images, r=45 0.24 0.22 svd svd 0.22 sqr 0.2 sqr ran ran

0.2 0.18

0.18 0.16

0.16 0.14

r r

o o r

r 0.14 0.12

r r

e e

0.12 0.1

0.1 0.08

0.08 0.06

0.06 0.04

0.04 0.02 0 100 200 300 400 500 600 700 800 0 100 200 300 400 500 600 700 800 iterations iterations (a) Processing 5 images

1 image, r=10 1 image, r=25 0.24 0.22 svd svd sqr 0.2 sqr 0.22 ran ran

0.18 0.2 0.16

0.18

0.14

r r

o o r

r 0.16 0.12

r r

e e

0.1 0.14

0.08 0.12 0.06

0.1 0.04

0.08 0.02 0 100 200 300 400 500 600 700 800 0 100 200 300 400 500 600 700 800 iterations iterations (b) Processing 1 image Figure 5 Errors of reconstruction using divergence-reducing algorithm

4.2.2 Comparison w.r.t fixed iteration time and increasing factorization rank

1 image, it=18 1 image, it=100 0.2 0.2 svd svd sqr sqr 0.19 ran 0.18 ran

0.18 0.16

0.17 0.14

r r

o o r

r 0.16 0.12

r r

e e

0.15 0.1

0.14 0.08

0.13 0.06

0.12 0.04 0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50 factorization rank factorization rank (a) Processing 1 image 20

5 image, it=100 5 image, it=200 0.22 0.22 svd svd sqr 0.2 sqr 0.2 ran ran

0.18 0.18

0.16 0.16

0.14

0.14

r r

o o r

r 0.12

r r

e e 0.12 0.1

0.1 0.08

0.08 0.06

0.06 0.04

0.04 0.02 0 10 20 30 40 50 60 70 80 90 0 10 20 30 40 50 60 70 80 90 factorization rank factorization rank (b) Processing 5 images Figure 6 Errors using divergence-reducing algorithm with fixed iteration Figure 5 and Figure 6 give us the simulation results by using divergence-reducing algorithm. SQR-NMF performs better than SVD-NMF and random NMF when processing larger dataset with lower factorization rank. We can also find that MU algorithm has lower error rate. Thus we apply MU algorithm as the iterative algorithm in the following simulations.

4.3. Numerical results for different complementary elements

4.3.1 Comparison w.r.t fixed factorization rank and increasing iteration times

The former simulation results show that SQR-NMF performs better than SVD- NMF and random NMF. Then we carry our experiments to show the impact of different complementary elements on the factorization. In this section, we add different complementary elements to make the original matrix a square matrix. We calculate the errors when the complementary elements are 0, 1, and mean values of each row or column. 21

1 image, r=10 1 image, r=25 0.22 0.22 1 1 mean mean 0 0.2 0 0.2

0.18

0.18 0.16

0.16

0.14

r r

o o

r r

r r

e e 0.12 0.14

0.1 0.12

0.08

0.1 0.06

0.08 0.04 0 50 100 150 200 250 300 350 400 0 100 200 300 400 500 600 iterations iterations (a) Processing 1 image

5 images, r=25 5 images, r=45 0.24 0.22 1 1 0.22 mean 0.2 mean 0 0

0.2 0.18

0.18 0.16

0.16 0.14

r r

o o r

r 0.14 0.12

r r

e e

0.12 0.1

0.1 0.08

0.08 0.06

0.06 0.04

0.04 0.02 0 100 200 300 400 500 600 700 800 0 100 200 300 400 500 600 700 800 iterations iterations (b) Processing 5 images Figure 7 Errors of adding different complementary elements Figure 7 gives us the simulation results of different complementary elements. When the size of the original matrix is small, and the factorization rank is large, complementing 0 is better. However, the method of complementing mean values of each row or column have lower error rate and converges faster than adding 0 and 1. We conclude that complementing mean values of each row or column performs better when the factorization rank is small, because the complementary element is more reliable. Thus, we can get a more approximate result.

4.3.2 Reconstruction images

Figure 8 shows the reconstruction images using random NMF with divergence- reducing algorithm and SQR-NMF with MU algorithm for comparison. The extraction ratio is 80% with the factorization rank 푟 = 25. The complementary elements are mean 22 values of each row in SQR-NMF. The reconstruction image using SQR-NMF has more details of the faces.

Figure 8 Reconstruction images. The upper row using random NMF, and lower row using SQR-NMF. As we mentioned in section 1, NMF can be applied in many fields. Good initialization can bring good factorization, then NMF can have more and more applications. Eigenvalue decomposition based as a new strategy of NMF is proposed in this paper, it has some advantages: (1) it can be easily combined with other NMF algorithms; (2) it can reach faster convergence, especially processing large amount of data; (3) it is simple and can be applied easily.

23

Chapter 5 Conclusion

In this thesis, we have proposed a new initialization method for NMF in image representation. The simulation results on facial images show that eigenvalue decomposition based initialization method, SQR-NMF, can achieve better performance in image representation than SVD-NMF and random NMF. We use eigenvalues and eigenvectors for initialization in SQR-NMF for small data sets and symmetric matrices for large data sets. It is critical to find complementary elements for different data set. We choose 0 for small data sets and mean values for large ones .

24

REFERENCES

1. Lee, D.D. and H.S. Seung, Learning the parts of objects by non-negative matrix factorization. Nature, 1999. 401(6755): p. 788. 2. Devarajan, K., Nonnegative matrix factorization: an analytical and interpretive tool in computational biology. PLoS computational biology, 2008. 4(7): p. e1000029. 3. Brunet, J.-P., et al., Metagenes and molecular pattern discovery using matrix factorization. Proceedings of the national academy of sciences, 2004. 101(12): p. 4164-4169. 4. Pehkonen, P., G. Wong, and P. Törönen, Theme discovery from gene lists for identification and viewing of multiple functional groups. BMC bioinformatics, 2005. 6(1): p. 162. 5. Hutchins, L.N., et al., Position-dependent motif characterization using non- negative matrix factorization. Bioinformatics, 2008. 24(23): p. 2684-2690. 6. Xu, M., et al., Automated multidimensional phenotypic profiling using large public microarray repositories. Proceedings of the National Academy of Sciences, 2009. 106(30): p. 12323-12328. 7. Carmona-Saez, P., et al., Biclustering of gene expression data by non-smooth non-negative matrix factorization. BMC bioinformatics, 2006. 7(1): p. 78. 8. Perron-Frobenius theorem. 2018 . 9. Chang, K.-C., K. Pearson, and T. Zhang, Perron-Frobenius theorem for nonnegative tensors. Communications in Mathematical Sciences, 2008. 6(2): p. 507-520. 10. WIKIPEDIA. Medzler matrix. 16 Novomber, 2016; Available from: https://en.wikipedia.org/wiki/Metzler_matrix. 11. Qiao, H., New SVD based initialization strategy for non-negative matrix factorization. Pattern Recognition Letters, 2015. 63: p. 71-77. 12. Boutsidis, C. and E. Gallopoulos, SVD based initialization: A head start for nonnegative matrix factorization. Pattern Recognition, 2008. 41(4): p. 1350- 1362. 13. Lee, D.D. and H.S. Seung. Algorithms for non-negative matrix factorization. in Advances in neural information processing systems. 2001. 14. Wild, S., et al., Seeding non-negative matrix factorizations with the spherical k- means clustering. 2003, University of Colorado. 25

15. Janecek, A. and Y. Tan. Using population based algorithms for initializing nonnegative matrix factorization. in International Conference in Swarm Intelligence. 2011. Springer. 16. Laboratory, C.U.C. The Database of Faces. Available from: http://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html. 17. WIKIPEDIA. Symmetric matrix. 21 April 2018; Available from: https://en.wikipedia.org/wiki/Symmetric_matrix#Decomposition. 18. Berman, A. and Plemmons, R., Nonnegative Matrices in the Mathematical Sciences, Academic Press, New York, 1979. 26

APPENDIX

Matlab code: 1. SQR step function [u s v]=getP(V) [m n]=size(V); % complementing mean values if m>n for i=1:m V(i,n+1:m)=sum(V(i,1:n))/n; end else for i=1:n V(m+1:n, :)=sum(V(1:m,i))/m; end end % compelementing zeros if m>n V(:,n+1:m)=0; else V(m+1:n, : )=0; end % compelementing ones if m>n V(:,n+1:m)=0; else V(m+1:n, : )=0; end 27

[A B]=eig(V); %descend by abs [B_sort,index]= sort(diag(B),'descend'); B_sort=B_sort(index); u=A(:,index); v=inv(u); s=diag(B_sort);

2.factorization rank 푟 choosing rule Extraction ration is 90%: function [u s v p]=ChoosingR(V) [u,s,v]=svd(V); sum1=sum(s); sum2=sum(sum1); dsum=0; p=0; while(dsum/sum2<0.9) p=p+1; dsum=dsum+s(p,p); end

3. update rules for NMF %MU update algorithm W=abs(u(1:m,1:p)); H=abs(s(1:p,1:p)*v(1:p,1:n)); for i=1:it H=H.* (W'*V)./((W'*W)*H); W=W.* (V*H')./ (W*(H*H')); E=V-W*H; error=norm(E,'fro')/norm(V,'fro'); end 28

% divergence-reducing NMF iterations W=abs(u(1:m,1:p)); H=abs(s(1:p,1:p)*v(1:p,1:n)); for i=1:it x1=repmat(sum(W,1)',1,n); H=H.*(W'*(V./(W*H)))./x1; x2=repmat(sum(H,2)',m,1); W=W.*((V./(W*H))*H')./x2; E=V-W*H; error=norm(E,'fro')/norm(V,'fro'); end

4. SVD-NMF initialization strategy function [error_svd,W_svd,H_svd]=nnmf_svd(V,p,it) [m n]=size(V); [u,s,v]=svd(V); W=abs(u(:,1:p)); H=abs(s(1:p,:)*v'); for i=1:it H=H.* (W'*V)./((W'*W)*H); W=W.* (V*H')./ (W*(H*H')); E=V-W*H; error_svd(i)=norm(E,'fro')/norm(V,'fro'); end W_svd=W; H_svd=H;

5. SQR-NMF initialization strategy % SQR-NMF initialize with eigenvalues and eigenvectors 29 function [error_sqr,W_sqr,H_sqr]=nnmf_eigen(V,p,it) [m n]=size(V); [u s v]=getP(V); W=abs(u(1:m,1:p)); H=abs(s(1:p,1:p)*v(1:p,1:n)); for i=1:it H=H.* (W'*V)./((W'*W)*H); W=W.* (V*H')./ (W*(H*H')); E=V-W*H; error_sqr(i)=norm(E,'fro')/norm(V,'fro'); end W_sqr=W; H_sqr=H;

% SQR-NMF initialize with symmetric mactrices [m n]=size(V); [A u v]=getP(V); S=A*A'; B=V'; C=inv(S); W=abs(S(1:m,1:p)*B(1:p,1:p)); H=abs(C(1:p,1:n)); for i=1:it H=H.* (W'*V)./((W'*W)*H); W=W.* (V*H')./ (W*(H*H')); E=V-W*H; error_sym(i)=norm(E,'fro')/norm(V,'fro'); end W_sym=W; H_sym=H;