Note: Please refer to the extended version of this paper published in the IEEE Transactions on Pattern Analysis and Machine Intelligence (10.1109/TPAMI.2014.2343973).

Data Fusion by Matrix Factorization

Marinka Zitnikˇ [email protected] Faculty of Computer and Information Science, University of Ljubljana, Trˇzaˇska 25, SI-1000 Ljubljana, Slovenia BlaˇzZupan [email protected] Faculty of Computer and Information Science, University of Ljubljana, Trˇzaˇska 25, SI-1000 Ljubljana, Slovenia Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX-77030, USA

Abstract our in the study of disease. Let us say that we ex- For most problems in science and engineer- amine susceptibility to a particular disease and have ing we can obtain that describe the access to the patients’ clinical data together with data system from various perspectives and record on their demographics, habits, living environments, the behaviour of its individual components. friends, relatives, and movie-watching habits and genre Heterogeneous data sources can be collec- ontology. Mining such a diverse may tively mined by data fusion. Fusion can fo- reveal interesting patterns that would remain hidden if cus on a specific target relation and exploit considered only directly related, clinical data. What if directly associated data together with data the disease was less common in living areas with more on the context or additional constraints. In open spaces or in environments where people need to the paper we describe a data fusion approach walk instead of drive to the nearest grocery? Is the with penalized matrix tri-factorization that disease less common among those that watch come- simultaneously factorizes data matrices to dies and ignore politics and news? reveal hidden associations. The approach Methods for data fusion can collectively treat data sets can directly consider any data sets that can and combine diverse data sources even when they di↵er be expressed in a matrix, including those in their conceptual, contextual and typographical rep- from attribute-based representations, ontolo- resentation (Aerts et al., 2006; Bostr¨omet al., 2007). gies, associations and networks. We demon- Individual data sets may be incomplete, yet because of strate its utility on a gene function prediction their diversity and complementarity, fusion improves problem in a case study with eleven di↵erent the robustness and predictive performance of the re- data sources. Our fusion algorithm compares sulting models. favourably to state-of-the-art multiple kernel learning and achieves higher accuracy than Data fusion approaches can be classified into three can be obtained from any single data source main categories according to the modeling stage alone. at which fusion takes place (Pavlidis et al., 2002; Sch¨olkopf et al., 2004; Maragos et al., 2008; Greene & Cunningham, 2009). Early (or full) integration trans- arXiv:1307.0803v2 [cs.LG] 6 Feb 2015 1. Introduction forms all data sources into a single, feature-based ta- ble and treats this as a single data set. This is theo- Data abounds in all areas of human endeavour. Big retically the most powerful scheme of multimodal fu- data (Cuzzocrea et al., 2011) is not only large in vol- sion because the inferred model can contain any type ume and may include thousands of features, but it of relationships between the features from within and is also heterogeneous. We may gather various data between the data sources. Early integration relies on sets that are directly related to the problem, or data procedures for feature construction. For our exposome sets that are loosely related to our study but could example, patient-specific data would need to include be useful when combined with other data sets. Con- both clinical data and information from the movie sider, for example, the exposome (Rappaport & Smith, genre ontologies. The former may be trivial as this 2010) that encompasses the totality of human endeav- data is already related to each specific patient, while the latter requires more complex feature engineering. Note: Please refer to the extended version of this paper published in the IEEE Transactions on Pattern Analysis and Machine Intelligence (10.1109/TPAMI.2014.2343973).

Data Fusion by Matrix Factorization

Early integration neglects the modular structure of the imize Bregman divergence (Singh & Gordon, 2008). data. The cost function can be further extended with vari- ous constraints (Zhang et al., 2011; Wang et al., 2012). In late (decision) integration, each data source gives Wang et al. (2008) (Wang et al., 2008)deviseda rise to a separate model. Predictions of these models penalized matrix tri-factorization to include a set of are fused by model weighting. Again, prior to model must-link and cannot-link constraints. We exploit this inference, it is necessary to transform each data set approach to include relations between objects of the to encode relations to the target concept. For our ex- same type. Constraints would for instance allow us ample, information on the movie preferences of friends to include information from movie genre ontologies or and relatives would need to be mapped to disease asso- social network friendships. ciations. Such transformations may not be trivial and would need to be crafted independently for each data Although often used in for dimension- source. Once the data are transformed, the fusion can ality reduction, clustering or low-rank approximation, utilize any of the existing ensemble methods. there have been few applications of matrix factoriza- tion for data fusion. Lange et al. (2005) (Lange The youngest branch of data fusion algorithms is in- & Buhmann, 2005) proposed an early integration by termediate (partial) integration. It relies on algorithms non-negative matrix factorization of a target matrix, that explicitly address the multiplicity of data and fuse which was a convex combination of similarity matri- them through inference of a joint model. Intermediate ces obtained from multiple information sources. Their integration does not fuse the input data, nor does it de- work is similar to that of Wang et al. (2012) (Wang velop separate models for each data source. It instead et al., 2012), who applied non-negative matrix tri- retains the structure of the data sources and merges factorization with input matrix completion. them at the level of a predictive model. This particular approach is often preferred because of its superior pre- Zhang et al. (2012) (Zhang et al., 2012) proposed a dictive accuracy (Pavlidis et al., 2002; Lanckriet et al., joint matrix factorization to decompose a number of 2004b; Gevaert et al., 2006; Tang et al., 2009; van Vliet data matrices Ri into a common basis matrix W and et al., 2012), but for a given model type, it requires the di↵erent coecient matrices H , such that R WH i i ⇡ i development of a new inference algorithm. by minimizing i Ri WHi . This is an interme- diate integration approach|| with|| di↵erent data sources In this paper we report on the development of a new that describe objectsP of the same type. A similar ap- method for intermediate data fusion based on con- proach but with added network-regularized constraints strained matrix factorization. Our aim was to con- has also been proposed (Zhang et al., 2011). Our work struct an algorithm that requires only minimal trans- extends these two approaches by simultaneously deal- formation (if any at all) of input data and can fuse ing with heterogeneous data sets and objects of di↵er- attribute-based representations, ontologies, associa- ent types. tions and networks. We begin with a description of related work and the background of matrix factoriza- In the paper we use a variant of three-factor matrix n k tion. We then present our data fusion algorithm and factorization that decomposes R into G R ⇥ 1 , F m k k k 2 T 2 finally demonstrate its utility in a study comparing it R ⇥ 2 and S R 1⇥ 2 such that R GSF (Wang to state-of-the-art intermediate integration with multi- et al., 2008).2 Approximation can be⇡ rewritten such ple kernel learning and early integration with random that entry R(p, q) is approximated by an inner product forests. of the p-th row of matrix G and a linear combination of the columns of matrix S, weighted by the q-th column 2. Background and Related Work of F. The matrix S, which has relatively few vectors compared to R (k n, k m), is used to represent 1 ⌧ 2 ⌧ Approximate matrix factorization estimates a data many data vectors, and a good approximation can only matrix R as a product of low-rank matrix factors that be achieved in the presence of the latent structure in are found by solving an optimization problem. In two- the original data. factor decomposition, R = n m is decomposed to a R ⇥ We are currently witnessing increasing interest in the product WH,whereW = n k, H = k m and R ⇥ R ⇥ joint treatment of heterogeneous data sets and the k min(n, m). A large class of matrix factoriza- emergence of approaches specifically designed for data tion⌧ algorithms minimize discrepancy between the ob- fusion. These include canonical correlation analy- served matrix and its low-rank approximation, such sis (Chaudhuri et al., 2009), combining many inter- that R WH. For instance, SVD, non-negative ma- action networks into a composite network (Mostafavi trix factorization⇡ and exponential family PCA all min- & Morris, 2012), multiple graph clustering with linked Note: Please refer to the extended version of this paper published in the IEEE Transactions on Pattern Analysis and Machine Intelligence (10.1109/TPAMI.2014.2343973).

Data Fusion by Matrix Factorization matrix factorization (Tang et al., 2009), a mixture of ε1 ε2 ε3 ε4 Markov chains associated with di↵erent graphs (Zhou & Burges, 2007), dependency-seeking clustering algo- ε1 rithms with variational Bayes (Klami & Kaski, 2008), latent factor analysis (Lopes et al., 2011; Luttinen &Ilin, 2009), nonparametric Bayes ensemble learn- ing (Xing & Dunson, 2011), approaches based on ε2 Bayesian theory (Zhang & Ji, 2006; Alexeyenko & Sonnhammer, 2009; Huttenhower et al., 2009), neural networks (Carpenter et al., 2005) and module guided random forests (Chen & Zhang, 2013). These ap- ε3 proaches either fuse input data (early integration) or predictions (late integration) and do not directly com- bine heterogeneous representation of objects of di↵er- ent types. ε4 A state-of-the-art approach to intermediate data in- tegration is kernel-based learning. Multiple kernel learning (MKL) has been pioneered by Lanckriet et al. (2004) (Lanckriet et al., 2004a) and is an additive extension of single kernel SVM to incorporate mul- tiple kernels in classification, regression and cluster- Figure 1. Conceptual fusion configuration of four object types, 1, 2, 3 and 4. Data sources relate pairs of ob- ing. The MKL assumes that 1,..., r are r di↵erent E E E E ject types (matrices with shades of gray). For example, representations of the same setE of nEobjects. Exten- data matrix R42 relates object types 4 and 2. Some re- sion from single to multiple data sources is achieved E E lations are missing; there is no data source relating 3 and by additive combination of kernel matrices, given by E 1. Constraint matrices (in blue) relate objects of the same ⌦ = r ✓ K i : ✓ 0, r ✓ =1, K 0 , E i=1 i i 8 i i=1 i i ⌫ type. In our example, constraints are provided for object where ✓i are weights of the kernel matrices, is a pa- types 2 (one constraint matrix) and 4 (two constraint P P E E rameter determining the norm of constraint posed on matrices). coecients (for L2,Lp-norm MKL, see (Kloft et al., 2009; Yu et al., 2010; 2012)) and Ki are normal- ni nj ized kernel matrices centered in the Hilbert space. for i = j in a sparse matrix Rij R ⇥ . An exam- 6 2 Among other improvements, Yu et al. (2010) ex- ple of such a matrix would be one that relates patients tended the framework of the MKL in Lanckriet et al. and drugs by reporting on their current prescriptions. (2004) (Lanckriet et al., 2004a) by optimizing vari- Notice that in general matrices Rij and Rji need not ous norms in the dual problem of SVMs that allows be symmetric. A data source that provides relations non-sparse optimal kernel coecients ✓i⇤.Thehet- between objects of the same type i is represented by ni ni E erogeneity of data sources in the MKL is resolved by a constraint matrix ⇥i R ⇥ . Examples of such 2 transforming di↵erent object types and data struc- constraints are social networks and drug interactions. tures (e.g., strings, vectors, graphs) into kernel ma- In real-world scenarios we might not have access to trices. These transformations depend on the choice of relations between all pairs of object types. Our data the kernels, which in turn a↵ect the method’s perfor- fusion algorithm still integrates all available data if an mance (Debnath & Takahashi, 2004). underlying graph of relations between object types is connected. Fig. 1 shows an example of a block config- 3. Data Fusion Algorithm uration of the fusion system with four object types. To retain the block structure of our system, we propose Our data fusion algorithm considers r object types the simultaneous factorization of all relation matrices 1,..., r and a collection of data sources, each relat- Rij constrained by ⇥i. The resulting system contains ingE a pairE of object types ( , ). In our introductory Ei Ej factors that are specific to each data source and fac- example of the exposome, object types could be a pa- tors that are specific to each object type. Through tient, a disease or a living environment, among others. i factor sharing we fuse the data but also identify source- If there are ni objects of type i (op is p-th object of specific patterns. type ) and n objects of typeE ,werepresentthe Ei j Ej observations from the data source that relates ( i, j) The proposed fusion approach is di↵erent from treat- E E ing an entire system (e.g., from Fig. 1) as a large single Note: Please refer to the extended version of this paper published in the IEEE Transactions on Pattern Analysis and Machine Intelligence (10.1109/TPAMI.2014.2343973).

Data Fusion by Matrix Factorization matrix. Factorization of such a matrix would yield fac- constraints because they impose penalties on the cur- tors that are not object type-specific and would thus rent approximation of the matrix factors, and the lat- disregard the structure of the system. We also show ter are must-link constraints, which are rewards that (Sec. 5.5) that such an approach is inferior in terms of reduce the value of the cost function during optimiza- predictive performance. tion. We apply data fusion to infer relations between two The block matrix R is tri-factorized into block matrix target object types, and (Sec. 3.4 and Sec. 3.5). factors G and S: Ei Ej This relation, encoded in a target matrix Rij,will n1 k1 G1 ⇥ 0 0 be observed in the context of all other data sources n2 k2 ··· 0G2 ⇥ 0 (Sec. 3.1). We assume that Rij is a [0, 1]-matrix that 2 3 G = . . ···. . , is only partially observed. Its entries indicate a de- . . .. . 6 n k 7 gree of relation, 0 denoting no relation and 1 denot- 6 00 G r ⇥ r 7 6 ··· r 7 ing the strongest relation. We aim to predict unob- 4 k k k k 5 (3) 0S1⇥ 2 S 1⇥ r served entries in R by reconstructing them through 12 1r ij k2 k1 ··· k2 kr matrix factorization. Such treatment in general ap- S21⇥ 0 S2r⇥ S = 2 . . ···. . 3 . plies to multi-class or multi-label classification tasks, . . .. . 6 k k k k 7 which are conveniently addressed by multiple kernel 6S r ⇥ 1 S r ⇥ 2 0 7 6 r1 r2 7 fusion (Yu et al., 2010), with which we compare our 4 ··· 5 performance in this paper. A factorization rank ki is assigned to i during infer- E ence of the factorized system. Factors Sij define the relation between object types and , while factors 3.1. Factorization Ei Ej Gi are specific to objects of type i and are used in An input to data fusion is a relation block matrix R the reconstruction of every relationE with this object that conceptually represents all relation matrices: type. In this way, each relation matrix Rij obtains its T own factorization GiSijGj with factor Gi (Gj) that 0R12 R1r ··· is shared across relations which involve object types i R21 0 R2r E 2 ··· 3 ( j). This can also be observed from the block struc- R = . . . . . (1) E . . .. . ture of the reconstructed system GSGT : 6 7 6Rr1 Rr2 0 7 T T 6 ··· 7 0G1S12G2 G1S1rGr 4 5 T ··· T Ablockinthei-th row and j-th column (Rij) of ma- G2S21G1 0 G2S2rGr 2 . . ···. . 3 . (4) trix R represents the relationship between object type . . .. . i . . . i and j.Thep-th object of type i (i.e. op) and q-th 6 T T 7 E E E 6GrSr1G1 GrSr2G2 0 7 object of type (i.e. oj ) are related by R (p, q). 6 ··· 7 Ej q ij 4 5 We additionally consider constraints relating objects of The objective function minimized by penalized matrix the same type. Several data sources may be available tri-factorization ensures good approximation of the in- for each object type. For instance, personal relations put data and adherence to must-link and cannot-link may be observed from a social network or a family tree. constraints. We extend it to include multiple con- Assume there are ti 0 data sources for object type straint matrices for each object type: (t) i represented by a set of constraint matrices ⇥i for maxi ti E T t 1, 2,...,ti . Constraints are collectively encoded min R GSG + tr(GT ⇥(t)G), (5) 2 { } (t) G 0 || || in a set of constraint block diagonal matrices ⇥ for t=1 t 1, 2,...,max t : X 2 { i i} where and tr( ) denote the Frobenius norm and || · || · (t) trace, respectively. ⇥1 0 0 (t) ··· 2 0 ⇥2 0 3 For decomposition of relation matrices we derive ⇥(t) = ··· . (2) . . .. . updating rules based on the penalized matrix tri- 6 . . . . 7 6 (t)7 factorization of Wang et al. (2008) (Wang et al., 2008). 6 00 ⇥r 7 6 ··· 7 4 5 The algorithm for solving the optimization problem in The i-th block along the main diagonal of ⇥(t) is zero Eq. (5) initializes matrix factors (see Sec. 3.6) and im- if t>ti. Entries in constraint matrices are positive for proves them iteratively. Successive updates of Gi and objects that are not similar and negative for objects Sij converge to a local minimum of the optimization that are similar. The former are known as cannot-link problem. Note: Please refer to the extended version of this paper published in the IEEE Transactions on Pattern Analysis and Machine Intelligence (10.1109/TPAMI.2014.2343973).

Data Fusion by Matrix Factorization

Multiplicative updating rules are derived from Wang explained variance, the residual sum of squares (RSS) et al. (2008) (Wang et al., 2008) by fixing one matrix and a measure based on the cophenetic correlation factor (i.e. G) and considering the roots of the par- coecient ⇢. We compute these measures from tial derivative with respect to the other matrix factor the target relation matrix Rij. The RSS is com- (i.e. S, and vice-versa) of the Lagrangian function. puted from observed entries in Rij as RSS(Rij)= The latter is constructed from the objective function T 2 (oi ,oj ) ( , ) (Rij GiSijGj )(p, q) , where given in Eq. (5). The update rule alternates between p q 2A Ei Ej ( , ) is the set of known associations between first fixing G and updating S, and then fixing S and P i j ⇥ ⇤ objectsA E E of and . The explained variance for updating G, until convergence. The update rule for S i j R is r2(RE )=1E RSS(R )/ [R (p, q)]2. is: ij ij ij p,q ij T 1 T T 1 The cophenetic correlation score was implemented as S (G G) G RG(G G) (6) P described in (Brunet et al., 2004). The rule for factor G incorporates data on constraints. We assess the three quality scores through internal We update each element of G at position (p, q)by cross-validation and observe how r2(R ), RSS(R ) multiplying it with: ij ij and ⇢(Rij) vary as factorization ranks change. We select ranks k1,k2,...,kr where the cophenetic coef- + T (t) (RGS) +[G(SG GS)](p,q) + ((⇥ )G)(p,q) (p,q) t . ficient begins to fall, the explained variance is high v T + (t) + u (RGS)(p,q) +[G(SG GS) ](p,q) + Pt((⇥ ) G)(p,q) and the RSS curve shows an inflection point (Hutchins u t P (7) et al., 2008). Here, X+ is defined as X(p, q)ifentryX(p, q) 0 (p,q) 3.4. Prediction from Matrix Factors else 0. Similarly, the X(p,q) is X(p, q)ifX(p, q) 0 +  else it is set to 0. Therefore, both X and X are The approximate relation matrix Rij for the target non-negative matrices. This definition is applied to pair of object types i and j is reconstructed as: T (t) E E matrices in Eq. (7), i.e. RGS, SG GS, and ⇥ for b t 1, 2,...,max t . R = G S GT . (9) 2 { i i} ij i ij j When the model is requested to propose relations for 3.2. Stopping Criteria i b a new object oni+1 of type i that was not included In this paper we apply data fusion to infer relations in the training data, we needE to estimate its factorized between two target object types, and .Wehence representation and use the resulting factors for predic- Ei Ej define the stopping criteria that observes convergence tion. We formulate a non-negative linear least-squares in approximating only the target matrix Rij. Our con- (NNLS) and solve it with an ecient interior point vergence criteria is: Newton-like method (Van Benthem & Keenan, 2004) T T i i ni for minx 0 (SG ) x on +1 2,whereon +1 R T || i || i 2 Rij GiSijGj < ✏, (8) is the original description of object oi acrossP all || || ni+1 available relation matrices and x R ki is its fac- where ✏ is a user-defined parameter, possibly refined 2 torized representation. A solution vectorP xT is added through observing log entries of several runs of the fac- (ni+1) nj to G and a new Rij R ⇥ is computed using torization algorithm. In our experiments ✏ was set to 2 5 Eq. (9). 10 . To reduce the computational load, the conver- b i j gence criteria was assessed only every fifth iteration We would like to identify object pairs (op,oq) for which only. the predicted degree of relation Rij(p, q) is unusually i j high. We are interested in candidate pairs (op,oq) 3.3. Parameter Estimation for which the estimated associationb score Rij(p, q)is To fuse data sources on r object types, it is necessary greater than the mean estimated score of all known i to estimate r factorization ranks, k1,k2,...,kr. These relations of op: b are chosen from a predefined interval of possible 1 values for each rank by estimating the model quality. Rij(p, q) > i Rij(p, m), (10) (o , j) p j i To reduce the number of needed factorization runs |A E | om (o , j ) 2AXp E we mimic the bisection method by first testing rank b b i values at the midpoint and borders of specified where (op, j) is the set of all objects of j related to i A E E ranges and then for each rank value selecting the op. Notice that this rule is row-centric, that is, given subinterval for which better model quality was an object of type i, it searches for objects of the other achieved. We evaluate the models through the type ( ) that itE could be related to. We can modify Ej Note: Please refer to the extended version of this paper published in the IEEE Transactions on Pattern Analysis and Machine Intelligence (10.1109/TPAMI.2014.2343973).

Data Fusion by Matrix Factorization the rule to become column-centric, or even combine latter approaches are indeed better over a random ini- the two rules. tialization (Sec. 5.4). We use random Acol in our case study. Random Acol computes each column of G as For example, let us consider that we are studying dis- i an element-wise average of a random subset of columns ease predispositions for a set of patients. Let the pa- in R . tients be objects of type and diseases objects of type ij Ei j. A patient-centric rule would consider a patient and hisE medical history and through Eq. (10) propose a 4. Experiments set of new disease associations. A disease-centric rule We considered a gene function prediction problem would instead consider all patients already associated from molecular biology, where recent technological ad- with a specific disease and identify other patients with vancements have allowed researchers to collect large a suciently high association score. and diverse experimental data sets. In our experiments we combine row-centric and is an important aspect of and the sub- column-centric approaches. We first apply a row- ject of much recent research (Yu et al., 2010; Vaske centric approach to identify candidates of type and et al., 2010; Moreau & Tranchevent, 2012; Mostafavi & Ei then estimate the strength of association to a specific Morris, 2012). It is expected that various data sources j object oq by reporting a percentile of association score are complementary and that high accuracy of predic- in the distribution of scores for all true associations tions can be achieved through data fusion (Parikh & j of oq, that is, by considering the scores in the q-ed Polikar, 2007; Pandey et al., 2010; Savage et al., 2010; column of Rij. Xing & Dunson, 2011). We studied the fusion of eleven di↵erent data sources to predict gene function in the 3.5. An Ensembleb Approach to Prediction social amoeba Dictyostelium discoideum and report on the cross-validated accuracy for 148 gene annotation Instead of a single model, we construct an ensemble of terms (classes). factorization models. The resulting matrix factors in each model di↵er due to the initial random conditions We compare our data fusion algorithm to an early in- or small random perturbations of selected factorization tegration by a random forest (Boulesteix et al., 2008) ranks. We use each factorization system for inference and an intermediate integration by multiple kernel of associations (Sec. 3.4) and then select the candidate learning (MKL) (Yu et al., 2010). Kernel-based fu- pair through a majority vote. That is, the rule from sion used a multi-class L2 norm MKL with Vapnik’s Eq. (10) must apply in more than one half of factorized SVM (Ye et al., 2008). The MKL was formulated as a systems of the ensemble. Ensembles improved the pre- second order cone program (SOCP) and its dual prob- lem was solved by the conic optimization solver Se- dictive accuracy and stability of the factorized system 1 2 and the robustness of the results. In our experiments DuMi . Random forests from the Orange data min- the ensembles combined 15 factorization models. ing suite were used with default parameters.

3.6. Matrix Factor Initialization 4.1. Data The inference of the factorized system in Sec. 3.1 is The social amoeba D. discoideum, a popular model or- sensitive to the initialization of factor G. Proper ini- ganism in biomedical research, has about 12,000 genes, some of which are functionally annotated with terms tialization sidesteps the issue of local convergence and 3 reduces the number of iterations needed to obtain ma- from Gene Ontology (GO). The annotation is rather trix factors of equal quality. We initialize G by sepa- sparse and only 1, 400 genes have experimentally de- rived annotations.⇠ The Dictyostelium community can rately initializing each Gi, using algorithms for single- matrix factorization. Factors S are computed from G thus gain from computational models with accurate (Eq. (6)) and do not require initialization. function predictions. Wang et al. (2008) (Wang et al., 2008) and several We observed six object types: genes (type 1), ontology other authors (Lee & Seung, 2000) use simple random terms (type 2), experimental conditions (type 3), pub- initialization. Other more informed initialization algo- lications from the PubMed database (PMID) (type 4), rithms include random C (Albright et al., 2006), ran- Medical Subject Headings (MeSH) descriptors (type dom Acol (Albright et al., 2006), non-negative double 1http://sedumi.ie.lehigh.edu SVD and its variants (Boutsidis & Gallopoulos, 2008), 2http://orange.biolab.si and k-means clustering or relaxed SVD-centroid ini- 3http://www.geneontology.org tialization (Albright et al., 2006). We show that the Note: Please refer to the extended version of this paper published in the IEEE Transactions on Pattern Analysis and Machine Intelligence (10.1109/TPAMI.2014.2343973).

Data Fusion by Matrix Factorization

5), and KEGG4 pathways (type 6). Object types Θ5 and their observed relations are shown in Fig. 2.The data included gene expression measured during di↵er- ent time-points of a 24-hour development cycle (Parikh R45 MeSH PMID et al., 2010)(R13, 14 experimental conditions), gene Descriptor annotations with experimental evidence code to 148 generic slim terms from the GO (R12), PMIDs and R14 R42 their associated D. discoideum genes from Dictybase5 Θ1 (R14), genes participating in KEGG pathways (R16), Experimental assignments of MeSH descriptors to publications from R Condition 13 Θ2 PubMed (R45), references to published work on asso- ciations between a specific GO term and gene product Gene (R42), and associations of enzymes involved in KEGG R12 GO Term pathways and related to GO terms (R62). R16 To balance R , our target relation matrix, we added 12 R62 an equal number of non-associations for which there Θ6 is no evidence of any type in the GO. We constrain our system by considering gene interaction scores from 6 KEGG STRING v9.0 (⇥1) and slim term similarity scores hops Pathway (⇥2) computed as 0.8 , where hops is the length of the shortest path between two terms in the GO graph. Similarly, MeSH descriptors are constrained with the average number of hops in the MeSH hierarchy be- Figure 2. The fusion configuration. Nodes represent object types used in our study. Edges correspond to relation and tween each pair of descriptors (⇥5). Constraints be- tween KEGG pathways correspond to the number of constraint matrices. The arc that represents the target matrix R12 and its object types are highlighted. common ortholog groups (⇥6). The slim subset of GO terms was used to limit the optimization complexity of the MKL and the number of variables in the SOCP, and to ease the computational burden of early inte- 4.2. Scoring gration by random forests, which inferred a separate We estimated the quality of inferred models by ten- model for each of the terms. fold cross-validation. In each iteration, we split the To study the e↵ects of data sparseness, we conducted gene set to a train and test set. The data on genes an experiment in which we selected either the 100 or from the test set was entirely omitted from the train- 1000 most GO-annotated genes (see second column of ing data. We developed prediction models from the Table 1 for sparsity). training data and tested them on the genes from the test set. The performance was evaluated using an F1 In a separate experiment we examined predictions of score, a harmonic mean of precision and recall, which gene association with any of nine GO terms that are was averaged across cross-validation runs. of specific relevance to the current research in the Dic- tyostelium community (upon consultations with Gad 4.3. Preprocessing for Kernel-Based Fusion Shaulsky, Baylor College of Medicine, Houston, TX; see Table 2). Instead of using a generic slim subset of We generated an RBF kernel for gene expression mea- terms, we examined the predictions in the context of a surements from R13 with width =0.5 and the RBF 2 2 complete set of GO terms. This resulted in a data set function (xi, xj)=exp( xi xj /2 ), and a || || with 2.000 terms, each term having on average 9.64 linear kernel for [0, 1]-protein-interaction matrix from ⇠ direct gene annotations. ⇥1. Kernels were applied to data matrices. We used a linear kernel to generate a kernel matrix from D. 4 http://www.kegg.jp discoideum specific genes that participate in pathways 5http://dictybase.org/Downloads (R ), and a kernel matrix from PMIDs and their as- 6http://string-db.org 16 sociated genes (R14). Several data sources describe relations between object types other than genes. For kernel-based fusion we had to transform them to ex- plicitly relate to genes. For instance, to relate genes and MeSH descriptors, we counted the number of Note: Please refer to the extended version of this paper published in the IEEE Transactions on Pattern Analysis and Machine Intelligence (10.1109/TPAMI.2014.2343973).

Data Fusion by Matrix Factorization publications that were associated with a specific gene sparser profiles also increased the overall data sparsity, (R14) and were assigned a specific MeSH descriptor to which the factorization approach was least sensitive. (R , see also Fig. 2). A linear kernel was applied 45 The accuracy for nine selected GO terms is given in to the resulting matrix. Kernel matrices that incor- Table 2. Our factorization approach yields consis- porated relations between KEGG pathways and GO tently higher F scores than the other two approaches. terms (R ), and publications and GO terms were ob- 1 62 Again, the early integration by random forests is in- tained in similar fashion. ferior to both intermediate integration methods. No- To represent the hierarchical structure of MeSH de- tice that, with one or two exceptions, F1 scores are scriptors (⇥5), the semantic structure of the GO graph very high. This is important, as all nine gene pro- (⇥2) and ortholog groups that correspond to KEGG cesses and functions observed are relevant for current pathways (⇥6), we considered the genes as nodes in research of D. discoideum where the methods for data three distinct large weighted graphs. In the graph fusion can yield new candidate genes for focused ex- for ⇥5, the link between two genes was weighted by perimental studies. the similarity of their associated sets of MeSH descrip- Our fusion approach is faster than multiple kernel tors using information from R and R . We consid- 14 45 learning. Factorization required 18 minutes of runtime ered the MeSH hierarchy to measure these similarities. on a standard desktop computer compared to 77 min- Similarly, for the graph for ⇥ we considered the GO 2 utes for MKL to finish one iteration of cross-validation semantic structure in computing similarities of sets of on a whole-genome data set. GO terms associated with genes. In the graph for ⇥6, the gene edges were weighted by the number of com- mon KEGG ortholog groups. Kernel matrices were Table 1. Cross-validated F1 scores for fusion by matrix fac- constructed with a di↵usion kernel (Kondor & Laf- torization (MF), a kernel-based method (MKL) and ran- ferty, 2002). dom forests (RF). D. discoideum task MF MKL RF The resulting kernel matrices were centered and nor- 100 genes 0.799 0.781 0.761 malized. In cross-validation, only the training part of 1000 genes 0.826 0.787 0.767 the matrices was preprocessed for learning, while pre- Whole genome 0.831 0.800 0.782 diction, centering and normalization were performed on the entire data set. The prediction task was de- fined through the classification matrix of genes and 5.2. Sensitivity to Inclusion of Data Sources their associated GO slim terms from R12. Inclusion of additional data sources improves the ac- 4.4. Preprocessing for Early Integration curacy of prediction models. We illustrate this in Fig. 3(a), where we started with only the target data The gene-related data matrices prepared for kernel- source R12 and then added either R13 or ⇥1 or both. based fusion were also used for early integration and Similar e↵ects were observed when we studied other were concatenated into a single table. Each row in combinations of data sources (not shown here for the table represented a gene profile obtained from all brevity). available data sources. For our case study, each gene was characterized by a fixed 9,362-dimensional feature 5.3. Sensitivity to Inclusion of Constraints vector. For each GO slim term, we then separately developed a classifier with a random forest of classifi- We varied the sparseness of gene constraint matrix ⇥1 cation trees and reported cross-validated results. by holding out a random subset of protein-protein in- teractions. We set the entries of ⇥1 that correspond 5. Results and Discussion to hold-out constraints to zero so that they did not a↵ect the cost function during optimization. Fig. 3(b) 5.1. Predictive Performance shows that including additional information on genes in the form of constraints improves the predictive per- Table 1 presents the cross-validated F scores the data 1 formance of the factorization model. set of slim terms. The accuracy of our matrix factor- ization approach is at least comparable to MKL and 5.4. Matrix Factor Initialization Study substantially higher than that of early integration by random forests. The performance of all three fusion We studied the e↵ect of initialization by observing the approaches improved when more genes and hence more error of the resulting factorization after one and af- data were included in the study. Adding genes with ter twenty iterations of factorization matrix updates, Note: Please refer to the extended version of this paper published in the IEEE Transactions on Pattern Analysis and Machine Intelligence (10.1109/TPAMI.2014.2343973).

Data Fusion by Matrix Factorization

Table 2. Gene ontology term-specific cross-validated F1 scores for fusion by matrix factorization (MF), a kernel-based method (MKL) and random forests (RF). GO term name Term identifier Namespace Size MF MKL RF Activation of adenylate cyclase activity 0007190 BP 11 0.834 0.770 0.758 Chemotaxis 0006935 BP 58 0.981 0.794 0.538 Chemotaxis to cAM 0043327 BP 21 0.922 0.835 0.798 Phagocytosis 0006909 BP 33 0.956 0.892 0.789 Response to bacterium 0009617 BP 51 0.899 0.788 0.785 Cell-cell adhesion 0016337 BP 14 0.883 0.867 0.728 Actin binding 0003779 MF 43 0.676 0.664 0.642 Lysozyme activity 0003796 MF 4 0.782 0.774 0.754 Sequence-specific DNA binding TFA 0003700 MF 79 0.956 0.894 0.732

1.0 0.835

0.8 0.830

0.6 0.825 score score 1 1 F F

0.4 0.820 Averaged Averaged

0.2 0.815

0.0 0.810

R12 R12, 1 R12, R13 R12, R13, 1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Proportion of 1 included in training

Figure 3. Adding new data sources (a) or incorporating more object-type-specific constraints in ⇥1 (b) both increase the accuracy of the matrix factorization-based models. the latter being about one fourth of the iterations re- rank approximation given by the SVD, where k = quired for factorization to converge. We estimate the max(ki,kj) is the approximation rank. Errij(v)isa error relative to the optimal (k1,k2,...,k6)-rank ap- pessimistic measure of quantitative accuracy because proximation given by the SVD. For iteration v and of the choice of k. This error measure is similar to the matrix Rij the error is computed by: error of the two-factor non-negative matrix factoriza- tion from (Albright et al., 2006). (v) (v) T (v) Rij Gi Sij (Gj ) dF (Rij , [Rij ]k) Errij (v)= || || , Table 3 shows the results for the experiment with 1000 dF (Rij , [Rij ]k) most GO-annotated D. discoideum genes and selected (11) factorization ranks k = 65, k = 35, k = 13, k = 35, where G(v), G(v) and S(v) are matrix factors obtained 1 2 3 4 i j ij k = 30 and k = 10. The informed initialization after executing v iterations of factorization algorithm. 5 6 algorithms surpass the random initialization. Of these, In Eq. (11) d (R , [R ] )= R U ⌃ VT de- F ij ij k || ij k k k || the random Acol algorithm performs best in terms of notes the Frobenius distance between Rij and its k- Note: Please refer to the extended version of this paper published in the IEEE Transactions on Pattern Analysis and Machine Intelligence (10.1109/TPAMI.2014.2343973).

Data Fusion by Matrix Factorization

Table 3. E↵ect of initialization algorithm on predictive power of factorization model. (0) (0) Method Time G Storage G Err12(1) Err12(20) Rand. 0.0011 s 618K 5.11 3.61 Rand. C 0.1027 s 553K 2.97 1.67 Rand. Acol 0.0654 s 505K 1.59 1.30 K-means 0.4029 s 562K 2.47 2.20 NNDSVDa 0.1193 s 562K 3.50 2.01 accuracy and is also one of the simplest. Acknowledgments We thank Janez Demˇsar and Gad Shaulsky for their 5.5. Early Integration by Matrix Factorization comments on the early version of this manuscript. We Our data fusion approach simultaneously factorizes in- acknowledge the support for our work from the Slove- dividual blocks of data in R. Alternatively, we could nian Research Agency (P2-0209, J2-9699, L2-1112), also disregard the data structure, and treat R as a National Institute of Health (P01-HD39691) and Eu- single data matrix. Such data treatment would trans- ropean Commission (Health-F5-2010-242038). form our data fusion approach to that of early in- tegration and lose the benefits of structured system References and source-specific factorization. To prove this exper- imentally, we considered the 1,000 most GO-annotated Aerts, Stein, Lambrechts, Diether, Maity, Sunit, Van D. discoideum genes. The resulting cross-validated Loo, Peter, Coessens, Bert, De Smet, Frederik, F1 score for factorization-based early integration was Tranchevent, Leon-Charles, De Moor, Bart, Mary- 0.576, compared to 0.826 obtained with our proposed nen, Peter, Hassan, Bassem, Carmeliet, Peter, and data fusion algorithm. This result is not surprising Moreau, Yves. Gene prioritization through genomic as neglecting the structure of the system also causes data fusion. Nature Biotechnology, 24(5):537–544, the loss of the structure in matrix factors and the 2006. ISSN 1087-0156. doi: 10.1038/nbt1203. loss of zero blocks in factors S and G from Eq. (3). Albright, R., Cox, J., Duling, D., Langville, A. N., Clearly, data structure carries substantial information and Meyer, C. D. Algorithms, initializations, and and should be retained in the model. convergence for the nonnegative matrix factoriza- tion. Technical report, Department of Mathematics, 6. Conclusion North Carolina State University, 2006.

We have proposed a new data fusion algorithm. The Alexeyenko, Andrey and Sonnhammer, Erik L L. approach is flexible and, in contrast to state-of-the-art Global networks of functional coupling in eukary- kernel-based methods, requires minimal, if any, pre- otes from comprehensive data integration. Genome processing of data. This latter feature, together with Research, 19(6):1107–16, 2009. ISSN 1088-9051. doi: its excellent accuracy and time response, are the prin- 10.1101/gr.087528.108. cipal advantages of our new algorithm. Bostr¨om, Henrik, Andler, Sten F., Brohede, Mar- Our approach can model any collection of data sets, cus, Johansson, Ronnie, Karlsson, Alexander, van- each of which can be expressed in a matrix. The gene Laere, Joeri, Niklasson, Lars, Nilsson, Maria, Pers- function prediction task considered in the paper, which son, Anne, and Ziemke, Tom. On the definition of has traditionally been regarded as a classification prob- information fusion as a field of research. Technical lem (Larranaga et al., 2006), is just one example of report, University of Skovde, School of Humanities the types of problems that can be addressed with our and Informatics, Skovde, Sweden, 2007. method. We anticipate the utility of factorization- based data fusion in multiple-target learning, associ- Boulesteix, Anne-Laure, Porzelius, Christine, and ation mining, clustering, link prediction or structured Daumer, Martin. Microarray-based classification output prediction. and clinical predictors: on combined classifiers and additional predictive value. Bioinformatics, 24(15): 1698–1706, 2008.

Boutsidis, Christos and Gallopoulos, Efstratios. SVD based initialization: A head start for nonnegative Note: Please refer to the extended version of this paper published in the IEEE Transactions on Pattern Analysis and Machine Intelligence (10.1109/TPAMI.2014.2343973).

Data Fusion by Matrix Factorization

matrix factorization. Pattern Recogn., 41(4):1350– Hutchins, Lucie N., Murphy, Sean M., Singh, Priyam, 1362, 2008. ISSN 0031-3203. doi: 10.1016/j.patcog. and Graber, Joel H. Position-dependent motif char- 2007.09.010. acterization using non-negative matrix factoriza- tion. Bioinformatics, 24(23):2684–2690, 2008. ISSN Brunet, Jean-Philippe, Tamayo, Pablo, Golub, 1367-4803. doi: 10.1093/bioinformatics/btn526. Todd R., and Mesirov, Jill P. Metagenes and molec- ular pattern discovery using matrix factorization. Huttenhower, Curtis, Mutungu, K. Tsheko, Indik, PNAS, 101(12):4164–4169, 2004. Natasha, Yang, Woongcheol, Schroeder, Mark, For- man, Joshua J., Troyanskaya, Olga G., and Coller, Carpenter, Gail A., Martens, Siegfried, and Ogas, Hilary A. Detailing regulatory networks through Ogi J. Self-organizing information fusion and hi- large scale data integration. Bioinformatics, 25(24): erarchical knowledge discovery: a new framework 3267–3274, 2009. ISSN 1367-4803. doi: 10.1093/ using ARTMAP neural networks. Neural Netw., 18 bioinformatics/btp588. (3):287–295, 2005. ISSN 0893-6080. doi: 10.1016/j. neunet.2004.12.003. Klami, Arto and Kaski, Samuel. Probabilistic ap- proach to detecting dependencies between data sets. Chaudhuri, Kamalika, Kakade, Sham M., Livescu, Neurocomput., 72(1-3):39–46, 2008. ISSN 0925- Karen, and Sridharan, Karthik. Multi-view cluster- 2312. doi: 10.1016/j.neucom.2007.12.044. ing via canonical correlation analysis. In Proceedings of the 26th Annual International Conference on Ma- Kloft, Marius, Brefeld, Ulf, Sonnenburg, Soren, chine Learning, ICML ’09, pp. 129–136, New York, Laskov, Pavel, Muller, Klaus-Robert, and Zien, NY, USA, 2009. ACM. ISBN 978-1-60558-516-1. Alexander. Ecient and accurate Lp-norm multi- doi: 10.1145/1553374.1553391. ple kernel learning. Advances in Neural Information Processing Systems, 21:997–1005, 2009. Chen, Zheng and Zhang, Weixiong. Integrative anal- ysis using module-guided random forests reveals Kondor, Risi Imre and La↵erty, John D. Di↵usion ker- correlated genetic factors related to mouse weight. nels on graphs and other discrete input spaces. In PLoS Computational Biology, 9(3):e1002956, 2013. Proceedings of the Nineteenth International Confer- doi: 10.1371/journal.pcbi.1002956. ence on Machine Learning, ICML ’02, pp. 315–322, San Francisco, CA, USA, 2002. Morgan Kaufmann Cuzzocrea, Alfredo, Song, Il-Yeol, and Davis, Karen C. Publishers Inc. ISBN 1-55860-873-7. Analytics over large-scale multidimensional data: the revolution! In Proceedings of the ACM Lanckriet, Gert R. G., Cristianini, Nello, Bartlett, Pe- 14th international workshop on Data Warehousing ter, Ghaoui, Laurent El, and Jordan, Michael I. and OLAP, DOLAP ’11, pp. 101–104, New York, Learning the kernel matrix with semidefinite pro- NY, USA, 2011. ACM. ISBN 978-1-4503-0963-9. gramming. J. Mach. Learn. Res., 5:27–72, 2004a. doi: 10.1145/2064676.2064695. ISSN 1532-4435.

Debnath, Rameswar and Takahashi, Haruhisa. Kernel Lanckriet, Gert R. G., De Bie, Tijl, Cristianini, Nello, selection for the support vector machine. IEICE Jordan, Michael I., and Noble, William Sta↵ord. Transactions, 87-D(12):2903–2904, 2004. A statistical framework for genomic data fusion. Bioinformatics, 20(16):2626–2635, 2004b. ISSN Gevaert, Olivier, De Smet, Frank, Timmerman, Dirk, 1367-4803. doi: 10.1093/bioinformatics/bth294. Moreau, Yves, and De Moor, Bart. Predicting the prognosis of breast cancer by integrating clinical and Lange, Tilman and Buhmann, Joachim M. Fusion of microarray data with Bayesian networks. Bioinfor- similarity data in clustering. In NIPS, 2005. matics, 22(14):e184–90, 2006. ISSN 1367-4811. doi: 10.1093/bioinformatics/btl230. Larranaga, P., Calvo, B., Santana, R., Bielza, C., Gal- diano, J., Inza, I., Lozano, J. A., Armananzas, R., Greene, Derek and Cunningham, P´adraig. A ma- Santafe, G., Perez, A., and Robles, V. Machine trix factorization approach for integrating multiple learning in bioinformatics. Brief. Bioinformatics, data views. In Proceedings of the European Con- 7(1):86–112, 2006. ference on Machine Learning and Knowledge Dis- covery in Databases: Part I, ECML PKDD ’09, Lee, Daniel D. and Seung, H. Sebastian. Algorithms pp. 423–438, Berlin, Heidelberg, 2009. Springer- for non-negative matrix factorization. In Leen, Verlag. ISBN 978-3-642-04179-2. doi: 10.1007/ Todd K., Dietterich, Thomas G., and Tresp, Volker 978-3-642-04180-8 45. (eds.), NIPS, pp. 556–562. MIT Press, 2000. \ Note: Please refer to the extended version of this paper published in the IEEE Transactions on Pattern Analysis and Machine Intelligence (10.1109/TPAMI.2014.2343973).

Data Fusion by Matrix Factorization

Lopes, Hedibert Freitas, Gamerman, Dani, and sifications from multiple data types. Journal of Salazar, Esther. Generalized spatial dynamic factor Computational Biology, 9:401–411, 2002. models. Computational Statistics & Data Analysis, 55(3):1319–1330, 2011. Rappaport, S. M. and Smith, M. T. Environment and disease risks. Science, 330(6003):460–461, 2010. Luttinen, Jaakko and Ilin, Alexander. Varia- Savage, Richard S., Ghahramani, Zoubin, Grin, tional Gaussian-process factor analysis for modeling Jim E., de la Cruz, Bernard J., and Wild, David L. spatio-temporal data. In Bengio, Yoshua, Schuur- Discovering transcriptional modules by Bayesian mans, Dale, La↵erty, John D., Williams, Christo- data integration. Bioinformatics, 26(12):i158–i167, pher K. I., and Culotta, Aron (eds.), NIPS,pp. 2010. ISSN 1367-4803. doi: 10.1093/bioinformatics/ 1177–1185. Curran Associates, Inc., 2009. ISBN btq210. 9781615679119. Sch¨olkopf, B, Tsuda, K, and Vert, J-P. Kernel Methods Maragos, Petros, Gros, Patrick, Katsamanis, Athanas- in Computational Biology. Computational Molecu- sios, and Papandreou, George. Cross-modal in- lar Biology. MIT Press, Cambridge, MA, USA, 8 tegration for performance improving in multime- 2004. dia: A review. In Maragos, Petros, Potamianos, Alexandros, and Gros, Patrick (eds.), Multimodal Singh, Ajit P. and Gordon, Geo↵rey J. A unified view Processing and Interaction, volume 33 of Multime- of matrix factorization models. In Proceedings of dia Systems and Applications, pp. 1–46. Springer the European conference on Machine Learning and US, 2008. ISBN 978-0-387-76315-6. doi: 10.1007/ Knowledge Discovery in Databases - Part II,ECML 978-0-387-76316-3 1. PKDD ’08, pp. 358–373, Berlin, Heidelberg, 2008. Springer-Verlag. ISBN 978-3-540-87480-5. doi: 10. Moreau, Yves and Tranchevent, L´eon-Charles. Com- 1007/978-3-540-87481-2 24. putational tools for prioritizing candidate genes: boosting disease gene discovery. Nature Reviews Ge- Tang, Wei, Lu, Zhengdong, and Dhillon, Inderjit S. netics, 13(8):523–536, 2012. ISSN 1471-0064. doi: Clustering with multiple graphs. In Proceedings of 10.1038/nrg3253. the 2009 Ninth IEEE International Conference on , ICDM ’09, pp. 1016–1021, Washing- Mostafavi, Sara and Morris, Quaid. Combining many ton, DC, USA, 2009. IEEE Computer Society. ISBN interaction networks to predict gene function and 978-0-7695-3895-2. doi: 10.1109/ICDM.2009.125. analyze gene lists. Proteomics, 12(10):1687–96, 2012. ISSN 1615-9861. Van Benthem, Mark H. and Keenan, Michael R. Fast algorithm for the solution of large-scale non- Pandey, Gaurav, Zhang, Bin, Chang, Aaron N, My- negativity-constrained least squares problems. Jour- ers, Chad L, Zhu, Jun, Kumar, Vipin, and Schadt, nal of Chemometrics, 18(10):441–450, 2004. ISSN Eric E. An integrative multi-network and multi- 1099-128X. doi: 10.1002/cem.889. classifier approach to predict genetic interactions. PLoS Computational Biology, 6(9), 2010. ISSN van Vliet, Martin H, Horlings, Hugo M, van de Vi- 1553-7358. doi: 10.1371/journal.pcbi.1000928. jver, Marc J, Reinders, Marcel J T, and Wessels, Lodewyk F A. Integration of clinical and gene ex- Parikh, Anup, Miranda, Edward Roshan, Katoh- pression data has a synergetic e↵ect on predicting Kurasawa, Mariko, Fuller, Danny, Rot, Gregor, Za- breast cancer outcome. PLoS One, 7(7):e40358, gar, Lan, Curk, Tomaz, Sucgang, Richard, Chen, 2012. ISSN 1932-6203. doi: 10.1371/journal.pone. Rui, Zupan, Blaz, Loomis, William F, Kuspa, 0040358. Adam, and Shaulsky, Gad. Conserved develop- mental transcriptomes in evolutionarily divergent Vaske, Charles J., Benz, Stephen C., Sanborn, species. Genome Biology, 11(3):R35, 2010. ISSN J. Zachary, Earl, Dent, Szeto, Christopher, Zhu, 1465-6914. doi: 10.1186/gb-2010-11-3-r35. Jingchun, Haussler, David, and Stuart, Joshua M. Inference of patient-specific pathway activities Parikh, Devi and Polikar, Robi. An ensemble-based from multi-dimensional cancer genomics data using incremental learning approach to data fusion. IEEE paradigm. Bioinformatics [ISMB], 26(12):237–245, Transactions on Systems, Man, and Cybernetics, 2010. Part B, 37(2):437–450, 2007. Wang, Fei, Li, Tao, and Zhang, Changshui. Semi- Pavlidis, Paul, Cai, Jinsong, Weston, Jason, and No- supervised clustering via matrix factorization. In ble, William Sta↵ord. Learning gene functional clas- SDM, pp. 1–12. SIAM, 2008. Note: Please refer to the extended version of this paper published in the IEEE Transactions on Pattern Analysis and Machine Intelligence (10.1109/TPAMI.2014.2343973).

Data Fusion by Matrix Factorization

Wang, Hua, Huang, Heng, Ding, Chris H. Q., and conference on Machine learning, ICML ’07, pp. Nie, Feiping. Predicting protein-protein interactions 1159–1166, New York, NY, USA, 2007. ACM. ISBN from multimodal biological data sources via nonneg- 978-1-59593-793-3. doi: 10.1145/1273496.1273642. ative matrix tri-factorization. In Chor, Benny (ed.), RECOMB, volume 7262 of Lecture Notes in Com- puter Science, pp. 314–325. Springer, 2012. ISBN 978-3-642-29626-0.

Xing, Chuanhua and Dunson, David B. Bayesian infer- ence for genomic data integration reduces misclas- sification rate in predicting protein-protein interac- tions. PLoS Computational Biology, 7(7):e1002110, 2011. ISSN 1553-7358. doi: 10.1371/journal.pcbi. 1002110.

Ye, Jieping, Ji, Shuiwang, and Chen, Jianhui. Multi- class discriminant kernel learning via convex pro- gramming. J. Mach. Learn. Res., 9:719–758, 2008. ISSN 1532-4435.

Yu, Shi, Falck, Tillmann, Daemen, Anneleen, Tranchevent, Leon-Charles, Suykens, Johan Ak, De Moor, Bart, and Moreau, Yves. L2-norm multi- ple kernel learning and its application to biomedi- cal data fusion. BMC Bioinformatics, 11:309, 2010. ISSN 1471-2105. doi: 10.1186/1471-2105-11-309.

Yu, Shi, Tranchevent, Leon, Liu, Xinhai, Glanzel, Wolfgang, Suykens, Johan A. K., De Moor, Bart, and Moreau, Yves. Optimized data fusion for ker- nel k-means clustering. IEEE Trans. Pattern Anal. Mach. Intell., 34(5):1031–1039, 2012. ISSN 0162- 8828. doi: 10.1109/TPAMI.2011.255.

Zhang, Shi-Hua, Li, Qingjiao, Liu, Juan, and Zhou, Xianghong Jasmine. A novel computational frame- work for simultaneous integration of multiple types of genomic data to identify microRNA-gene reg- ulatory modules. Bioinformatics, 27(13):401–409, 2011.

Zhang, Shihua, Liu, Chun-Chi, Li, Wenyuan, Shen, Hui, Laird, Peter W., and Zhou, Xianghong Jas- mine. Discovery of multi-dimensional modules by integrative analysis of cancer genomic data. Nu- cleic Acids Research, 40(19):9379–9391, 2012. doi: 10.1093/nar/gks725.

Zhang, Yongmian and Ji, Qiang. Active and dynamic information fusion for multisensor systems with dy- namic Bayesian networks. Trans. Sys. Man Cyber. Part B, 36(2):467–472, 2006. ISSN 1083-4419. doi: 10.1109/TSMCB.2005.859081.

Zhou, Dengyong and Burges, Christopher J. C. Spec- tral clustering and transductive learning with mul- tiple views. In Proceedings of the 24th international