Multiview Pattern Recognition Methods for Data Visualization, Embedding and Clustering

Doctoral Thesis Multiview pattern recognition methods for data visualization, embedding and clustering Ph.D. Student: Samir Kanaan Izquierdo Ph.D. Advisor: Alexandre Perera Lluna Thesis Presented in Partial Fulfillment of the Requirements for the Degree of Ph.D. Biomedical Engineering Ph.D. Program Automatic Control Department Universitat Politècnicade Catalunya Barcelona, 2017 i Abstract Multiview data is defined as data for whose samples there exist several different data views, i.e. different data matrices obtained through different experiments, methods or situations. Multiview dimensionality reduction methods transform a high-dimensional, multiview dataset into a single, low-dimensional space or projection. Their goal is to provide a more manageable representation of the original data, either for data visualization or to simplify the following analysis stages. Multiview clustering methods receive a multiview dataset and propose a single clustering as- signment of the data samples in the dataset, considering the information from all the input data views. The main hypothesis defended in this work is that using multiview data along with methods able to exploit their information richness produces better dimensionality reduction and clustering results than simply using single views or concatenating all views into a single matrix. Consequently, the objectives of this thesis are to develop and test multiview pattern recognition methods based on well known single-view dimensionality reduction and clustering methods. Three multiview pattern recognition methods are presented: multiview t-distributed stochas- tic neighbourhood embedding (MV-tSNE), multiview multimodal scal- ing (MV-MDS) and a novel formulation of multiview spectral clustering (MVSC-CEV). These methods can be applied both to dimensionality reduction tasks and to clustering tasks. The MV-tSNE method computes a matrix of probabilities based on distances between samples for each input view. Then it merges the different probability matrices using results from expert opinion pooling theory to get a common matrix of probabilities, which is then used as reference to build a low-dimensional projection of the data whose probabilities are similar. The MV-MDS method computes the common eigenvectors of all the normalized distance matrices in order to obtain a single low-dimensional space that embeds the essential information from all the input spaces, avoiding redundant information to be included. The MVSC-CEV method computes the symmetric Laplacian matrices of the similarity matrices of all data views. Then it generates a single, low-dimensional representation of the input data by computing the common eigenvectors of the Laplacian matrices, obtaining a projection of the data that embeds the most relevant information of the input data views, also avoiding the addition of redundant information. A thorough set of experiments has been designed and run in order to compare the proposed methods with their single view counterpart. Also, the proposed methods have been compared with all the available results of equivalent methods in the state of the art. Finally, a comparison between the three proposed methods is presented in order to provide guidelines on which method to use for a given task. MVSC-CEV consistently produces better clustering results than other multiview methods in the state of the art. MV-MDS produces overall better results than the reference methods in dimensionality reduction ii experiments. MV-tSNE does not excel on any of these tasks. As a con- sequence, for multiview clustering tasks it is recommended to use MVSC- CEV, and MV-MDS for multiview dimensionality reduction tasks. Although several multiview dimensionality reduction or clustering methods have been proposed in the state of the art, there is no software implementation available. In order to compensate for this fact and to provide the community with a potentially useful set of multiview pattern recognition methods, an R software package containg the proposed methods has been developed and released to the public. Unesco codes: 120304, 120804, 120110 Resumen Los datos multivista se definen como aquellos datos para cuyas muestras existen varias vistas de datos distintas, es decir diferentes matrices de datos obtenidas mediante diferentes experimentos, métodos o situa- ciones. Los métodos multivista de reducciónde la dimensionalidad trans- forman un conjunto de datos multivista y de alta dimensionalidad en un únicoespacio o proyecciónde baja dimensionalidad. Su objetivo es pro- ducir una representaciónmásmanejable de los datos originales, bien para su visualizacióno para simplificar las etapas de análisissubsigu- ientes. Los métodos de agrupamiento multivista reciben un conjunto de datos multivista y proponen una únicaasignaciónde grupos para sus muestras, considerando la informaciónde todas las vistas de datos de entrada. La principal hipótesisdefendida en este trabajo es que el uso de datos multivista junto con métodos capaces de aprovechar su riqueza informa- tiva producen mejores resultados en reducciónde la dimensionalidad y agrupamiento frente al uso de vistas únicaso la concatenaciónde varias vistas en una únicamatriz. Por lo tanto, los objetivos de esta tesis son desarrollar y probar métodos multivista de reconocimiento de patrones basados en métodos univista reconocidos. Se presentan tres métodos multivista de reconocimiento de patrones: proyecciónestocásticade vecinos multivista (MV- tSNE), escalado multidimensional multivista (MV-MDS) y una nueva formulaciónde agrupamiento espectral multivista (MVSC-CEV). Estos métodos pueden aplicarse tanto a tareas de reducciónde la dimensionalidad como de agrupamiento. MV-tSNE calcula una matriz de probabilidades basada en distancias entre muestras para cada vista de datos. A continuacióncombina las matrices de probabilidad usando resultados de la teor´ıade combinación de expertos para obtener una matriz comúnde probabilidades, que se usa como referencia para construir una proyecciónde baja dimensionalidad de los datos. MV-MDS calcula los vectores propios comunes de todas las matrices normalizadas de distancia para obtener un únicoespacio de baja dimen- iii sionalidad que integre la informaciónesencial de todos los espacios de entrada, evitando informaciónredundante. MVSC-CEV calcula las matrices Laplacianas de las matrices de simil- itud de los datos. A continuacióngenera una única representaciónde baja dimensionalidad calculando los vectores propios comunes de las Lapla- cianas. As´ıobtiene una proyecciónde los datos que integra la información másrelevante y evita a~nadirinformaciónredundante. Se ha dise~nadoy ejecutado una bater´ıade experimentos completa para comparar los métodos propuestos con sus equivalentes univista. Ademáslos métodos propuestos se han comparado con los resultados disponibles en la literatura. Finalmente, se presenta una comparación entre los tres métodos para proporcionar orientaciones sobre el método másadecuado para cada tarea. MVSC-CEV produce mejores agrupamientos que los métodos equivalentes en la literatura. MV-MDS produce en general mejores resultados que los métodos de referencia en experimentos de reducciónde la dimensionalidad. MV-tSNE no destaca en ninguna de esas tareas. Conse- cuentemente, para agrupamiento multivista se recomienda usar MVSC- CEV, y para reducciónde la dimensionalidad multivista MV-MDS. Aunque se han propuesto varios métodos multivista en la literatura, no existen programas disponibles públicamente. Para remediar este he- cho y para dotar a la comunidad de un conjunto de métodos potencial- mente útil,se ha desarrollado un paquete de programas en R y se ha puesto a disposicióndel público. CódigosUnesco: 120304, 120804, 120110 To J&B, with love. vii Acknowledgements First I would like to express my gratitude to my advisor Dr. Alexandre Perera for his support in the creation of this thesis. As my advisor, he has shown me the way to go when I was disoriented, he has encouraged me when I was stranded in my research, and he has also taught me to realize that, eventually, things get working. As a researcher I have learnt a lot from him as well: he is careful with his work, always willing to learn new things, and has a keen eye for spotting flaws! Finally, he is a kind person and it has been a pleasure to work with him all these years. In second place I want to thank my colleagues at my workplace, the EEBE (former EUETIB), for their support, advice and encouragement during all these years. Very specially I want to thank Dr. Gerard Escudero for his generous help assuming so many common tasks to relieve me and let me have more spare time to devote to my thesis. I also want to thank Drs. Pedro Gomis and Montserrat Vallverdúfor introducing me to my advisor. Seemingly they had a perfect notion of what I could and wanted to do and I think they made a perfect match. Raimon Janéand Noem´ıZapata also deserve my gratitude for their help with all the academic requirements and the paperwork. At times I have caused them extra work and I am very thankful for this. Of course I am not forgetting my fellow mates, with whom I have shared all these years of work, exchange of ideas and hopes. And some mountain hiking too! Thank you Andrey, Francesc, Helena, Maria, Gio, Sergi and many others along the way. Finally I want to thank my family and friends for supporting (and stand- ing) me during all these long years. Hopefully they (and me) will be finally relieved from the old question \How's your thesis going?". It is about time to move on to new questions. Contents Contents ix List of Tables xiii List of Figures xx 1 Multiview unsupervised pattern recognition methods 1 1.1 Introduction . 1 1.1.1 Multiview data . 1 1.1.2 Unsupervised pattern recognition methods . 11 1.2 State of the art: multiview dimensionality reduction . 12 1.2.1 Multiview dimensionality reduction methods . 13 1.3 State of the art: multiview clustering .

Multiview Pattern Recognition Methods for Data Visualization, Embedding and Clustering

CV Page 1 THOMAS JEFFERSON SHARPTON Rank

Nandita R. Garud

MSTATNEWS Athe Membership Magazine of the American Statistical Association What Makes Y O U R V O T E Matter?

Cvauc: Cross- Validated Area Under the ROC Curve Conﬁdence Intervals

Downloaded from NCBI Genbank Or Sequence

Patrick J. H. Bradley

CV 2015 Sharpton, Thomas.Pdf

ASA Commends NSF for Initiative To

Clinical Efficacy and Increased Donor Strain Engraftment After Antibiotic

Exploring the Behaviour of the Hidden Markov Model on Cpg Island Prediction

Chromatin Features Constrain Structural Variation Across Evolutionary Timescales

Exploring the Behaviour of the Hidden Markov Model on Cpg Island Prediction