Data Fusion by Matrix Factorization
Total Page:16
File Type:pdf, Size:1020Kb
Note: Please refer to the extended version of this paper published in the IEEE Transactions on Pattern Analysis and Machine Intelligence (10.1109/TPAMI.2014.2343973). Data Fusion by Matrix Factorization Marinka Zitnikˇ [email protected] Faculty of Computer and Information Science, University of Ljubljana, Trˇzaˇska 25, SI-1000 Ljubljana, Slovenia BlaˇzZupan [email protected] Faculty of Computer and Information Science, University of Ljubljana, Trˇzaˇska 25, SI-1000 Ljubljana, Slovenia Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX-77030, USA Abstract our in the study of disease. Let us say that we ex- For most problems in science and engineer- amine susceptibility to a particular disease and have ing we can obtain data that describe the access to the patients’ clinical data together with data system from various perspectives and record on their demographics, habits, living environments, the behaviour of its individual components. friends, relatives, and movie-watching habits and genre Heterogeneous data sources can be collec- ontology. Mining such a diverse data collection may tively mined by data fusion. Fusion can fo- reveal interesting patterns that would remain hidden if cus on a specific target relation and exploit considered only directly related, clinical data. What if directly associated data together with data the disease was less common in living areas with more on the context or additional constraints. In open spaces or in environments where people need to the paper we describe a data fusion approach walk instead of drive to the nearest grocery? Is the with penalized matrix tri-factorization that disease less common among those that watch come- simultaneously factorizes data matrices to dies and ignore politics and news? reveal hidden associations. The approach Methods for data fusion can collectively treat data sets can directly consider any data sets that can and combine diverse data sources even when they di↵er be expressed in a matrix, including those in their conceptual, contextual and typographical rep- from attribute-based representations, ontolo- resentation (Aerts et al., 2006; Bostr¨omet al., 2007). gies, associations and networks. We demon- Individual data sets may be incomplete, yet because of strate its utility on a gene function prediction their diversity and complementarity, fusion improves problem in a case study with eleven di↵erent the robustness and predictive performance of the re- data sources. Our fusion algorithm compares sulting models. favourably to state-of-the-art multiple kernel learning and achieves higher accuracy than Data fusion approaches can be classified into three can be obtained from any single data source main categories according to the modeling stage alone. at which fusion takes place (Pavlidis et al., 2002; Sch¨olkopf et al., 2004; Maragos et al., 2008; Greene & Cunningham, 2009). Early (or full) integration trans- arXiv:1307.0803v2 [cs.LG] 6 Feb 2015 1. Introduction forms all data sources into a single, feature-based ta- ble and treats this as a single data set. This is theo- Data abounds in all areas of human endeavour. Big retically the most powerful scheme of multimodal fu- data (Cuzzocrea et al., 2011) is not only large in vol- sion because the inferred model can contain any type ume and may include thousands of features, but it of relationships between the features from within and is also heterogeneous. We may gather various data between the data sources. Early integration relies on sets that are directly related to the problem, or data procedures for feature construction. For our exposome sets that are loosely related to our study but could example, patient-specific data would need to include be useful when combined with other data sets. Con- both clinical data and information from the movie sider, for example, the exposome (Rappaport & Smith, genre ontologies. The former may be trivial as this 2010) that encompasses the totality of human endeav- data is already related to each specific patient, while the latter requires more complex feature engineering. Note: Please refer to the extended version of this paper published in the IEEE Transactions on Pattern Analysis and Machine Intelligence (10.1109/TPAMI.2014.2343973). Data Fusion by Matrix Factorization Early integration neglects the modular structure of the imize Bregman divergence (Singh & Gordon, 2008). data. The cost function can be further extended with vari- ous constraints (Zhang et al., 2011; Wang et al., 2012). In late (decision) integration, each data source gives Wang et al. (2008) (Wang et al., 2008)deviseda rise to a separate model. Predictions of these models penalized matrix tri-factorization to include a set of are fused by model weighting. Again, prior to model must-link and cannot-link constraints. We exploit this inference, it is necessary to transform each data set approach to include relations between objects of the to encode relations to the target concept. For our ex- same type. Constraints would for instance allow us ample, information on the movie preferences of friends to include information from movie genre ontologies or and relatives would need to be mapped to disease asso- social network friendships. ciations. Such transformations may not be trivial and would need to be crafted independently for each data Although often used in data analysis for dimension- source. Once the data are transformed, the fusion can ality reduction, clustering or low-rank approximation, utilize any of the existing ensemble methods. there have been few applications of matrix factoriza- tion for data fusion. Lange et al. (2005) (Lange The youngest branch of data fusion algorithms is in- & Buhmann, 2005) proposed an early integration by termediate (partial) integration. It relies on algorithms non-negative matrix factorization of a target matrix, that explicitly address the multiplicity of data and fuse which was a convex combination of similarity matri- them through inference of a joint model. Intermediate ces obtained from multiple information sources. Their integration does not fuse the input data, nor does it de- work is similar to that of Wang et al. (2012) (Wang velop separate models for each data source. It instead et al., 2012), who applied non-negative matrix tri- retains the structure of the data sources and merges factorization with input matrix completion. them at the level of a predictive model. This particular approach is often preferred because of its superior pre- Zhang et al. (2012) (Zhang et al., 2012) proposed a dictive accuracy (Pavlidis et al., 2002; Lanckriet et al., joint matrix factorization to decompose a number of 2004b; Gevaert et al., 2006; Tang et al., 2009; van Vliet data matrices Ri into a common basis matrix W and et al., 2012), but for a given model type, it requires the di↵erent coefficient matrices H , such that R WH i i ⇡ i development of a new inference algorithm. by minimizing i Ri WHi . This is an interme- diate integration approach|| − with|| di↵erent data sources In this paper we report on the development of a new that describe objectsP of the same type. A similar ap- method for intermediate data fusion based on con- proach but with added network-regularized constraints strained matrix factorization. Our aim was to con- has also been proposed (Zhang et al., 2011). Our work struct an algorithm that requires only minimal trans- extends these two approaches by simultaneously deal- formation (if any at all) of input data and can fuse ing with heterogeneous data sets and objects of di↵er- attribute-based representations, ontologies, associa- ent types. tions and networks. We begin with a description of related work and the background of matrix factoriza- In the paper we use a variant of three-factor matrix n k tion. We then present our data fusion algorithm and factorization that decomposes R into G R ⇥ 1 , F m k k k 2 T 2 finally demonstrate its utility in a study comparing it R ⇥ 2 and S R 1⇥ 2 such that R GSF (Wang to state-of-the-art intermediate integration with multi- et al., 2008).2 Approximation can be⇡ rewritten such ple kernel learning and early integration with random that entry R(p, q) is approximated by an inner product forests. of the p-th row of matrix G and a linear combination of the columns of matrix S, weighted by the q-th column 2. Background and Related Work of F. The matrix S, which has relatively few vectors compared to R (k n, k m), is used to represent 1 ⌧ 2 ⌧ Approximate matrix factorization estimates a data many data vectors, and a good approximation can only matrix R as a product of low-rank matrix factors that be achieved in the presence of the latent structure in are found by solving an optimization problem. In two- the original data. factor decomposition, R = n m is decomposed to a R ⇥ We are currently witnessing increasing interest in the product WH,whereW = n k, H = k m and R ⇥ R ⇥ joint treatment of heterogeneous data sets and the k min(n, m). A large class of matrix factoriza- emergence of approaches specifically designed for data tion⌧ algorithms minimize discrepancy between the ob- fusion. These include canonical correlation analy- served matrix and its low-rank approximation, such sis (Chaudhuri et al., 2009), combining many inter- that R WH. For instance, SVD, non-negative ma- action networks into a composite network (Mostafavi trix factorization⇡ and exponential family PCA all min- & Morris, 2012), multiple graph clustering with linked Note: Please refer to the extended version of this