IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 37, NO. 1, JANUARY 2015 41 Fusion by Matrix Factorization

Marinka Zitnik and Blaz Zupan

Abstract—For most problems in science and engineering we can obtain data sets that describe the observed system from various perspectives and record the behavior of its individual components. Heterogeneous data sets can be collectively mined by data fusion. Fusion can focus on a specific target relation and exploit directly associated data together with contextual data and data about system’s constraints. In the paper we describe a data fusion approach with penalized matrix tri-factorization (DFMF) that simultaneously factorizes data matrices to reveal hidden associations. The approach can directly consider any data that can be expressed in a matrix, including those from feature-based representations, ontologies, associations and networks. We demonstrate the utility of DFMF for gene function prediction task with eleven different data sources and for prediction of pharmacologic actions by fusing six data sources. Our data fusion algorithm compares favorably to alternative approaches and achieves higher accuracy than can be obtained from any single data source alone.

Index Terms—Data fusion, intermediate data integration, matrix factorization, , , Ç

1INTRODUCTION ATA abound in all areas of human endeavour. We may place. Early (or full) integration transforms all data sources Dgather various data sets that are directly related to the into a single feature-based table and treats this as a single problem, or data sets that are loosely related to our study datasetthatcanbeexploredbyanyofthewell- but could be useful when combined with other data sets. established feature-based machine learning algorithms. Consider, for example, the exposome [1] that encompasses The inferred models can in principle include any type of the totality of human endeavour in the study of disease. Let relationships between the features from within and us say that we examine susceptibility to a particular disease between the data sources. Early integration relies on and have access to the patients’ clinical data together with procedures for feature construction. For our exposome data on their demographics, habits, living environments, example, patient-specific data would need to include both friends, relatives, movie-watching habits, and movie genre clinical data and information from the movie genre ontol- ontology. Mining such a diverse may reveal ogies. The former may be trivial as this data is already interesting patterns that would remain hidden if we would related to each specific patient, while the latter requires analyze only directly related, clinical data. What if the dis- more complex feature engineering. Early integration also ease was less common in living areas with more open neglects the modular structure of the data. spaces or in environments where people need to walk In late (decision) integration, each data source gives rise to instead of drive to the nearest grocery? Is the disease less a separate model. Predictions of these models are fused by common among those that watch comedies and ignore model weighting. Again, prior to model inference, it is nec- politics and news? essary to transform each data set to encode relations to the Methods for data fusion can collectively treat data sets target concept. For our example, information on the movie and combine diverse data sources even when they differ in preferences of friends and relatives would need to be their conceptual, contextual and typographical representa- mapped to disease associations. Such transformations may tion [2], [3]. Individual data sets may be incomplete, yet not be trivial and would need to be crafted independently because of their diversity and complementarity, fusion can for every data source. improve the robustness and predictive performance of the The youngest branch of data fusion algorithms is interme- resulting models [4], [5]. diate (partial) integration. Algorithms in this category explic- According to Pavlidis et al. (2002) [6], data fusion itly address the multiplicity of data and fuse them through approaches can be classified into three main categories inference of a single joint model. Intermediate integration depending on the modeling stage at which fusion takes does not merge the input data, nor does it develop separate models for each data source. It instead retains the structure M. Zitnik is with the Faculty of Computer and Information Science, of the data sources by incorporating it within the structure University of Ljubljana, Trza ska 25, SI-1000 Ljubljana, Slovenia. of predictive model. This particular approach is often pre- E-mail: [email protected]. ferred because of its superior predictive accuracy [5], [6], B. Zupan is with the Faculty of Computer and Information Science, Uni- [7], [8], [9], but for a given model type, it requires the devel- versity of Ljubljana, Trza ska 25, SI-1000 Ljubljana, Slovenia, and the Department of Molecular and Human Genetics, Baylor College of Medi- opment of a new inference algorithm. cine, Houston, TX 77030. E-mail: [email protected]. We here report on the development of a new method Manuscript received 23 Oct. 2013; revised 13 Mar. 2014; accepted 23 July for intermediate data fusion based on constrained matrix 2014. Date of publication 29 July 2014; date of current version 5 Dec. 2014. factorization. Our aim was to construct an algorithm that Recommended for acceptance by G. Lanckriet. requires no or only minimal transformation of input data For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference the Digital Object Identifier below. and can fuse feature-based representations, ontologies, Digital Object Identifier no. 10.1109/TPAMI.2014.2343973 associations and networks. We focus on the challenge of

0162-8828 ß 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. 42 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 37, NO. 1, JANUARY 2015 dealing with collections of heterogeneous data sources, and while showing that our method can be used on siz- able problems from current research, scaling is not the focus of the present paper. We first present our data fusion algorithm, henceforth DFMF (Section 2), and then place it within the related work of relational learning approaches (Section 3). We also refer to related data integration approaches, specifically to methods of kernel- based data fusion (Section 3). We then examine the utility of DFMF and experimentally compare it with intermedi- ate integration by multiple kernel learning (MKL), early integration with random forests, and tri-SPMF [10], previously proposed matrix tri-factorization approach (Section 4).

2DATA FUSION ALGORITHM

The DFMF considers r object types E1; ...; Er and a collec- tion of data sources, each relating a pair of object types ðEi; EjÞ. In our introductory example of the exposome, object types could be a patient, a disease or a living environment, i among others. If there are ni objects of type Ei (op is pth object of type Ei) and nj objects of type Ej, we represent the observations from the data source that relates ðEi; EjÞ for ninj i 6¼ j in a sparse matrix Rij 2 R . An example of such a matrix would relate patients and drugs by reporting on Fig. 1. Conceptual fusion configuration for four object types, E ; E ; E patient’s current drug prescriptions. Notice that matrices 1 2 3 and E4, equivalently represented by the graph of relations between Rij and Rji are in general asymmetric. A data source that object types (a) and the block-based matrix structure (b). Each data provides relations between objects of the same type Ei is source relates a pair of object types, denoted by arcs in a graph (a) or nini matrices with shades of gray in block matrix (b). For example, data represented by a constraint matrix Qi 2 R . Examples of matrix R23 relates object types E2 and E3. Some relations are entirely such constraints are social networks and drug interactions. missing. For instance, there is no data source relating objects from E3 In real-world scenarios we might not have access to rela- and E1, as there is no arc linking nodes E3 and E1 in (a) or equivalently, a matrix R31 is missing in (b). Relations can be asymmetric, such that tions between all pairs of object types. Our data fusion algo- T R23 6¼ R32. Constraints denoted by loops in (a) or matrices with blue rithm still integrates all available data if the underlying entries in (b) relate objects of the same type. In our example configura- graph of relations between object types is connected. In that tion, constraints are provided for object types E2 (one constraint matrix) case, low-dimensional representations of objects of certain and E4 (three constraint matrices). type borrow information from related objects of the differ- ent type. Fig. 1 shows an example of an underlying graph of relations and a block configuration of the fusion system The proposed fusion approach is different from treat- with four object types. ing an entire system (e.g., from Fig. 1) as a large single To retain the block structure of our fusion system and matrix. Factorization of such a matrix would yield hence model distinct relations between object types, we factorsthatarenotobjecttype-specific and would thus propose the simultaneous factorization of all relation disregard the structure of the system. We also show (Sec- matrices Rij constrained by Qi. The resulting system tion 5.5) that such an approach is inferior in terms of contains factors that are specific to each data source and predictive performance. factorsthatarespecifictoeachobjecttype.Throughfac- In comparison with existing multi-type relational data tor sharing we fuse the data but also identify source-spe- factorization approaches (see Section 3) the following char- cific patterns. acterizes our DFMF data fusion method: We have developed a variant of three-factor penalized i) DFMF can model multiple relations between multiple matrix factorization that simultaneously decomposes niki object types. Rij Gi 2 R Gj 2 all available relation matrices into , ii) Relations between some object types can be Rnjkj Rkikj and S 2 , and regularizes their approximation completely missing (see Fig. 1). through constraint matrices Qi and Qj such that Rij T iii) Every object type can be associated with multiple GiSijGj . Approximation can be rewritten such that entry constraint matrices. Rijðp; qÞ is approximated by an inner product of the pth row iv) The algorithm makes no assumptions about struc- of matrix Gi and a linear combination of the columns of tural properties of relations (e.g., symmetry of matrix Sij, weighted by the qth column of Gj. The matrix relations). Sij, which has relatively few vectors compared to Rij In order to be applicable to general real-world fusion prob- (ki ni, kj nj), is used to represent many data vectors, lems, data fusion algorithm would need to jointly address and a good approximation can only be achieved in the pres- all of these characteristics. Besides DFMF proposed in this ence of the latent structure in the original data. manuscript, we are not aware of any other approach that ZITNIK AND ZUPAN: DATA FUSION BY MATRIX FACTORIZATION 43

ðtÞ would do so. Most real-world data integration problems Qi for t 2f1; 2; ...;tig. Constraints are collectively would usually consider a larger number of object types, but encoded in a set of constraint block diagonal matrices with growing number of object types, it is likely that data ðtÞ Q for t 2f1; 2; ...; maxi tig: relating a pair of object types is either not available nor meaningful. On the other side, there may be various data ðtÞ ðtÞ ðtÞ ðtÞ Q ¼ Diag Q ; Q ; ...; Qr : (2) sources available on interactions between objects of the 1 2 t same type that also require appropriate treatment. For The ith block along the main diagonal of Qð Þ is zero if example of this type of data, consider abundance of data t>ti. Entries in constraint matrices are positive for objects bases on drug or disease interactions. that are not similar and negative for objects that are simi- In the case study presented in this paper we apply data lar. The former are known as cannot-link constraints fusion to infer relations between two target object types, Ei because they impose penalties on the current approxima- and Ej (Sections 2.6 and 2.7). This relation, encoded in a tar- tion of the matrix factors, and the latter are must-link con- get matrix Rij, will be observed in the context of all other straints, which are rewards that reduce the value of the data sources (Section 2.1). We assume that our target Rij is a cost function during optimization. Must-link constraint ; ½0 1-matrix that is only partially observed. Its entries indi- expresses the notion that a pair of objects of the same type cate a degree of relation, 0 denoting no relation and 1 denot- should be close in their latent component space. An exam- ing the strongest relation. We aim to predict unobserved ple of must-link constraints are, for instance, drug-drug entries in Rij by reconstructing them through matrix factori- interactions, and example of cannot-link constraints the zation. Such treatment in general applies to multi-class or matrix of adversaries. Typically, data sources with must- multi-label classification tasks, which are conveniently link constraints are more abundant. addressed by multiple kernel fusion [11], with which we The block matrix R is tri-factorized into block matrix fac- compare our performance in this paper. tors G and S: In the following, we present the factorization model, n k n k objective function, derive the updating rules for optimiza- 1 1 ; 2 2 ; ; nrkr ; G ¼ Diag G1 G2 ... Gr tion, and describe the procedure for prediction of rela- tions from matrix factors. In the optimization part, we closely follow [10] in notation, mathematical derivation 2 3 k1k2 k1kr and proof technique. S12 S1r 6 k2k1 k2kr 7 6 S S r 7 S ¼ 6 21 2 7: (3) 2.1 Factorization Model for Multi-Relational 4 . . . . . 5 and Multi-Object Type Data . . . . krk1 krk2 An input to DFMF is a relation block matrix R that concep- Sr1 Sr2 tually represents all relation matrices: Matrix S in Eq. (3) has the same block structure as R in 2 3 T Eq. (1). It is in general asymmetric (i.e., Sji 6¼ Sij) and if a R12 R1r 6 7 R 6 R21 R2r 7 relation matrix is missing in then also its corresponding 6 7: R ¼ 4 . . . . . 5 (1) matrix factor in S will be missing. These two properties of S . . . . stem from our decision to model relation matrices without Rr1 Rr2 assuming their structural properties or their availability for every possible combination of object types. Here, an asterisk (“*”) denotes the relation between the A factorization rank ki is assigned to Ei during inference same type of objects that DMFM does not model. Notice of the factorized system. Factor Sij defines the latent relation that our method does not require the presence of all relation between object types Ei and Ej, while factor Gi is specific to matrices in Eq. (1). Depending on a particular data setup, objects of type Ei and is used in the reconstruction of every any subset of relation matrices might be missing and thus, i j relation with this object type. In this way, each relation unmodeled. A block in the th row and th column (Rij)of T ij i ij matrix R represents the relationship between object type Ei matrix R obtains its own factorization G S Gj with factor i i j and Ej. The pth object of type Ei (i.e., op) and qth object of G (G ) that is shared across all relations which involve j Ei Ej j o ij p; q object types ( ). This can also be observed from the block type E (i.e., q) are related by R ð Þ. An important aspect T of Eq. (1) for data fusion and what distinguishes DMFM structure of the reconstructed system GSG : 2 3 from other conceptually related matrix factorization models T T G1S12G2 G1S1rGr such as S-NMTF [12] or even tri-SPMF [10] is that it is 6 T T 7 6 G2S21G1 G2S2rGr 7 designed for multi-object type and multi-relational data 6 7: (4) T 4 . . . . . 5 where the relations can be asymmetric, Rji 6¼ Rij, and some . . . . T T can be completely missing (unknown Rij) (Section 2.3). GrSr1G1 GrSr2G2 We additionally consider constraints relating objects p Gi of the same type. Several data sources of this kind may Here, the th row in factor holds the latent component oi be available for each object type. For instance, personal representation of object p. By holding Gj and Sij fixed, it is oi relations may be observed from a social network or a clear that latent component representation of p depends on family tree. Assume there are ti 0 data sources for Gj as well as on the existence of relation Rij: Consequently, object type Ei represented by a set of constraint matrices all direct and indirect relations have a determining influence 44 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 37, NO. 1, JANUARY 2015

i on the calculation of opth latent representation. Just as the objects of type Ei are represented by Gi, each relation is rep- resented by factor Sij, which models how the latent compo- nents interact in the respective relation. The asymmetry of Sij takes into account whether a latent component occurs as a subject or an object of corresponding relation Rij.

2.2 Objective Function The objective function minimized by DFMF aims at good approximation of the input data and adherence to must-link and cannot-link constraints: X T 2 min JðG; SÞ¼ Rij GiSijGj G0 Rij2R maxXi ti (5) T t þ trðG Qð ÞGÞ: t¼1 Here, jj jj and trðÞ denote the Frobenius norm and trace, respectively, and R is the set of all relations included in our model. Our objective function explicitly allows that relations between some object types are entirely missing. Notice that in Eq. (5) we do not approximate input data T by jjR GSG jj2 as was proposed in related approaches of S-NMTF [12] and tri-SPMF [10]. To model the data system such as that from Fig. 1, one could be tempted to replace the missing relation matrices with zero matrices. This would enable the optimization to further reduce the value of objective function,butwouldalsointroduce relations in factorized system that were intentionally not present in the input data. Their inclusion in the model Algorithm 1. Factorization algorithm of proposed data fusion approach would distort inferred relations between other object types (DFMF). (see Section 5.1). Proof. We introduce the Lagrangian multipliers 1;2; ...; 2.3 Computing the Factorization r and construct the Lagrange function: The DFMF algorithm for solving the minimization problem Xr T specified in Eq. (5) is shown in Algorithm 1. The algorithm L ¼ JðG; SÞ tr iGi : (7) first initializes matrix factors (Section 2.8) and then itera- i¼1 tively refines them by alternating between fixing G and Then for i; j; such that Rij 2R: updating S, and then fixing S and updating G, until conver- gence. Successive updates of Gi and Sij converge to a local @L T T T ¼2Gi RijGj þ 2Gi GiSijGj Gj; minimum of the optimization problem. @Sij We derive multiplicative updating rules for regularized decomposition of relation matrices by fixing one matrix fac- and for i ¼ 1; 2; ...;r: tor (e.g., G) and considering the roots of the partial deriva- X @L T T T tive with respect to the other matrix factor (e.g., S, and vice- ¼ 2RijGjSij þ 2GiSijGj GjSij @Gi versa) of the Lagrangian function. The latter is constructed j:Rij2R X from the objective function (Eq. (5)): T T T þ 2R GjSji þ 2GiS G GjSji X ji ji j (8) T T T j:Rji2R JðG; SÞ¼ tr RijRij 2Gj RijGiSij maxi ti Rij2R X QðtÞ : T T T þ 2 i Gi i þ Gi GiSijGj GjSij (6) t¼1 t r maxXi i X @L T ðtÞ Fixing G ; G ; ...; Gr and letting ¼ 0 for all i; j ¼ 1; þ tr G Q Gi : 1 2 @Sij i i ; ;r t¼1 i¼1 2 ... , we obtain:

T 1 T T 1 Regarding the correctness and convergence of the algo- S ¼ðG GÞ G RGðG GÞ : rithm in Algorithm 1 we have the following two theorems. @L 0 i 1; 2; ...;r: We then fix S and let @Gi ¼ for ¼ We get Theorem 1 (Correctness of DFMF algorithm). If the update an expression for the KKT multiplier i from Eq. (8). rules for matrix factors G and S from Algorithm 1 converge, Then the KKT complementary condition for the non- then the final solution satisfies the KKT conditions of optimality. negativity of Gi is ZITNIK AND ZUPAN: DATA FUSION BY MATRIX FACTORIZATION 45

based on the cophenetic correlation coefficient r [14]. We 0 ¼ i Gi 2 compute these measures for the target relation matrix. The X i j 4 T T T RSS is computed over observed associations ðop;oqÞ in Rij as ¼ 2RijGjSij þ 2GiSijGj GjSij P T 2 j:Rij2R p; q : # RSSðRijÞ¼ ½ðRij GiSijGj Þð Þ Similarly, explained t P X maxXi i 2 2 T T T ðtÞ variance is R ðRijÞ¼1 RSSðRijÞ= ½Rijðp; qÞ : þ 2RjiGjSji þ 2GiSjiGj GjSji þ 2Qi Gi Gi: j:Rji2R t¼1 We assess the three quality scores through internal cross- 2 R ij RSS ij r ij (9) validation and observe how ðR Þ, ðR Þ and ðR Þ vary with changes of factorization ranks. We select ranks Let us here introduce variables Gi to denote Gi ¼ i Gi: k1;k2; ...;kr where the cophenetic coefficient begins to fall, Eq. (9) is a fixed point equation and the solution must the explained variance is high and the RSS curve shows an satisfy it at convergence. We let: inflection point [15]. QðtÞ QðtÞ þ QðtÞ 2.6 Prediction from Matrix Factors i ¼ i i b ij T T þ T The approximate relation matrix R for the target pair of RijGjSij ¼ RijGjSij RijGjSij b T : object types Ei and Ej is reconstructed as Rij ¼ GiSijGj T T T T þ T T SijGj GjSij ¼ SijGj GjSij SijGj GjSij When the model is requested to propose relations for a new i T T þ T object on of type Ei that was not included in the training RjiGjSji ¼ RjiGjSji RjiGjSji iþ1 data, we need to estimate its factorized representation and T T T T þ T T ; SjiGj GjSji ¼ SjiGj GjSji SjiGj GjSji use the resulting factors for prediction. We formulate a non- negative linear least-squares and solve it with an efficient where all matrices on right-hand sides are nonnegative. interior point Newton-like method [16] for minxl0jjðGlSliþ T i;l 2 i;l n i R l Then, given an initial guess of G , the successive updates GlSil Þxl oniþ1jj2, where oniþ1 2 is the original descrip- of Gi using Eqs. (10), (11), and (12) converge to a local oi Rki tion of object niþ1 (if available) and xl 2 is its factorized minimum of the problem in Eq. (5). It can be easily representation (for l ¼ 1; 2; ...;r and l 6¼ i). A solution vec- i P seen that using such a rule, at convergence, G satisfies T b tor given by xl is added to Gi and a new Rij 2 Gi Gi ¼ 0; which is equivalent to Gi ¼ 0 (Eq. (9)) due to l Rðniþ1Þnj i is computed. nonnegativity of G . tu i j We would like to identify object pairs ðop;oqÞ for which Theorem 2 (Convergence of DFMF algorithm). The objective b the predicted degree of relation Rijðp; qÞ is unusually high. function JðG; SÞ given by Eq. (5) is nonincreasing under the i j We are interested in candidate pairs ðop;oqÞ for which the updating rules for matrix factors G and S in Algorithm 1. b estimated association score Rijðp; qÞ is greater than the mean Please see the Appendix for a detailed proof of the above i estimated score of all known relations of op: theorem. Our proof essentially follows the idea of auxiliary functions often used in the convergence proofs of approxi- X b p; q > 1 b p; m ; mate matrix factorization algorithms [13]. Rijð Þ oi ; Rijð Þ (13) jAð p EjÞj j i om2Aðop;EjÞ 2.4 Stopping Criterion oi ; oi In this paper we apply data fusion to infer relations between where Að p EjÞ is the set of all objects of Ej related to p. two target object types, Ei and Ej. We hence define the stop- Notice that this rule is row-centric, that is, given an object of ping criterion that observes convergence in approximation type Ei, it searches for objects of the other type (Ej) that it of only the target matrix Rij. Our convergence criterion is could be related to. We can modify the rule to become col- the absolute difference of target matrix reconstruction error, umn-centric, or even combine the two rules. T 2 For example, let us consider that we are studying disease jjRij GiSijGj jj ; computed in two consecutive iterations predispositions for a set of patients. Let the patients be of DFMF algorithm. In our experiments the stopping objects of type Ei and diseases objects of type Ej. A patient- threshold was set to 105. To reduce the computational centric rule would consider a patient and his medical his- load, the convergence criterion was assessed only every fifth tory and through Eq. (13) propose a set of new disease asso- iteration. ciations. A disease-centric rule would instead consider all patients already associated with a specific disease and iden- 2.5 Parameter Estimation tify other patients with a sufficiently high association score. Parameters to DFMF algorithm are factorization ranks, We can combine row-centric and column-centric k ;k ; ...;kr: 1 2 These are chosen from a predefined interval of approaches. For example, we can first apply a row-centric possible rank values such that their choice maximizes the approach to identify candidates of type Ei and then estimate estimated quality of the model. To reduce the number of j the strength of association to a specific object oq by reporting required factorization runs we mimic the bisection method an inverse percentile of association score in the distribution by first testing rank values at the midpoint and borders of oj specified ranges and then for each rank value selecting the of scores for all true associations of q, that is, by considering b subinterval for which the resulting model was of higher the scores in the q-ed column of Rij. In our gene function quality. We evaluate the models through the explained vari- prediction study, we use row-centric approach for ance, the residual sum of squares (RSS) and a measure candidate identification and column-centric approach for 46 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 37, NO. 1, JANUARY 2015 association scoring, and in the experiment from cheminfor- cannot be used to model relations between more than two matics we apply row-centric approach to both tasks. object types, which is a major distinction with the algorithm proposed in this paper. 2.7 An Ensemble Approach to Prediction Zhang et al. [22] proposed a joint matrix factorization to Different initializations of Gi may in practice give rise to dif- decompose a number of data matrices Ri into a common ferent factorizations of the fusion system. To leverage this basis matrix W and different coefficientP matrices Hi, such 2 effect we construct an ensemble of factorization models. that Ri WHi by minimizing i jjRi WHijj . This is an The resulting matrix factors in each model may also be dif- intermediate integration approach with different data sour- ferent due to small random perturbations of selected factori- ces but it can describe only relations whose objects (i.e., zation ranks. We use each factorization system for inference rows in Ri) are fixed across relation matrices. Similar of associations (Section 2.6) and then select the candidate approaches but with various regularization types were also pair through a majority vote. That is, the rule from Eq. (13) proposed, such as network- or relation-regularized con- must apply in more than one half of factorized systems of straints [23], [24] and hierarchical priors [25], [26]. Our work the ensemble. Ensembles improved the predictive accuracy generalizes these approaches by simultaneously dealing and stability of the factorized system and the robustness of with objects of different types, where we can vary object the results. In our experiments the ensembles combined types along both dimensions of relation matrices, Rij) and 15 factorization models. can constrain objects of every type. There is an abundance of work on matrix factorization 2.8 Matrix Factor Initialization models that consider a single dyadic relation matrix or The inference of the factorized system in Section 2.1 is multiple relation matrices between the same two types of sensitive to the initialization of factor G.Properinitiali- objects [10], [21], [24], [26], [27], [28] that are subsumed zation sidesteps the issue of local convergence and in our approach. For instance, Nickel [29] proposed a tri- reduces the number of iterations needed to obtain matrix factorization model for multiple dyadic relations that T factors of equal quality. We initialize G by separately ini- factorized every Ri as Ri ASiA . Although their model tializing each Gi; using algorithms for single-matrix fac- is appropriate for certain tasks of collective learning, all torization. Factors S are computed from G (Algorithm 1) Ri describe relations between the same two sets of and do not require initialization. objects, whereas our approach models multi-relational Wang et al. [10] and several other authors [13] use sim- and multi-object type data. ple random initialization. Other more informed initializa- Rettinger et al. [30] proposed context-aware tensor tion algorithms include random C [17], random Acol [17], decomposition for relation prediction in social networks, non-negative double SVD and its variants [18], and CARTD. They decompose a tensor into additive factorized k-means clustering or relaxed SVD-centroid initialization matrices using two-factor decomposition. They assume that [17]. We show that the latter approaches are indeed better input data is provided together with the contextual infor- over a random initialization (Section 5.4). We use random mation that describes one specific relation, the recommen- Acol in our case study. Random Acol computes each col- dation. The drawback of their and similar approaches [27], umn of Gi as an element-wise average of a random subset [31], [32] for r-ary tensors is that in higher dimensions of columns in Rij. (r>3) the tensors become increasingly sparse and the computational requirements become infeasible. Notice that r 3RELATED WORK here corresponds to number of different object types in DFMF. In comparison, the approach proposed in this paper Approximate matrix factorization estimates a data matrix R can handle tens of different object types. as a product of low-rank matrix factors that are found by Wang et al. [10] and Wang et al. [12] proposed tri-SPMF solving an optimization problem. In two-factor decomposi- and S-NMTF, respectively, a simultaneous clustering of Rnm tion, R 2 is decomposed to a product WH, where multi-type relational data via symmetric nonnegative Rnk Rkm k n; m W 2 , H 2 and minð Þ. A large class of matrix tri-factorization. These two methods are conceptu- matrix factorization algorithms minimize discrepancy ally similar to our approach and use both inter-type and between the observed matrix and its low-rank approxima- intra-type relations, but they require a full set of symmetric T tion, such that R WH. For instance, SVD, non-negative relation matrices, Rij ¼ Rji. These assumptions of tri-SPMF matrix factorization and exponential family PCA all mini- and S-NMTF are rarely met in real-world fusion scenarios mize Bregman divergence [19]. (see, for example, a fusion configuration from Fig. 2, which Although often used in for dimensionality is not a six-clique), where we do not have access to relation reduction, clustering or low-rank approximation, there matrices between all possible pairs of object types (i.e., Rij have been only a few applications of matrix factorization in for 1 i

hops and slim term similarity scores (Q2) computed as 0:2 , where hops was the length of the shortest path between two terms in the GO graph. Similarly, MeSH descriptors were constrained with the average number of hops in the MeSH hierarchy between each pair of descriptors (Q5). Constraints between KEGG pathways corresponded to the number of common ortholog groups (Q6). The slim subset of GO terms was used to limit the optimization complexity of the MKL and the number of variables in the SOCP, and to ease the computational burden of early integration by random for- ests, which inferred a separate model for each of the terms. We conducted three experiments in which we considered either 100 or 1,000 most GO-annotated genes or the whole D. discoideum genome (12;000 genes). We also examined the predictions of gene associations with any of nine GO terms that are of specific relevance to the current research in the Dictyostelium community (upon consultations with Gad Fig. 3. The fusion configuration for the prediction of pharmacologic Shaulsky, Baylor College of Medicine, Houston, TX; see actions of chemicals, with object types denoted with nodes and relations Table 2). Instead of using a generic slim subset of terms, we between them with edges. The edge representing the target relation and its corresponding data matrix R12 is highlighted. examined the predictions in the context of a complete set of pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi c c GO terms. This resulted in a data set with 2;000 terms, K ði; iÞK ðj; jÞ. The parameters for all kernels were each term having 10 direct gene annotations. selected through internal cross-validation. In cross-vali- dation, only the training part of the matrices was opti- 4.1.2 Preprocessing for Kernel-Based Fusion mized for learning, while centering and normalization We generated an RBF kernel for gene expression measure- were performed on the entire data set. The prediction task was defined through the classification matrix of ments from R13 with the RBF function kðxi; xjÞ¼expðjjxi 2 2 genes and their associated GO slim terms from R12. xjjj =2s Þ, and a linear kernel for ½0; 1-protein-interaction Q matrix from 1. This particular choice of kernels was moti- 4.1.3 Preprocessing for Early Integration vated by the experimental study and kernel comparison in [5]. Kernels were applied to data matrices. We used a linear The gene-related data matrices prepared for kernel-based kernel to generate a kernel matrix from D. discoideum spe- fusion were also used for early integration and were concatenated into a single data table. Each row in the table cific genes that participate in pathways (R16), and a kernel represented a gene profile obtained from all available data matrix from PMIDs and their associated genes (R14). Several data sources describe relations between object types other sources. For our case study, each gene was characterized by 9;362 than genes. For kernel-based fusion we had to transform a fixed -dimensional feature vector. For each GO slim them to explicitly relate to genes. For instance, to relate term, we then separately developed a classifier with a ran- genes and MeSH descriptors, we counted the number of dom forest of classification trees and reported cross-vali- dated results. publications that were associated with a specific gene (R14) and were assigned a specific MeSH descriptor (R , see also 45 4.1.4 Preprocessing for tri-SPMF Learning Fig. 2). A linear kernel was applied to the resulting matrix. Kernel matrices that incorporated relations between KEGG Relation and constraint matrices prepared for DFMF were pathways and GO terms (R62), and publications and GO also used for tri-SPMF factorization algorithm. Tri-SPMF terms were obtained in similar fashion. requires a full set of relation matrices between all pairs of To represent the hierarchical structure of MeSH descrip- object types. Thus, we used zero matrices for non-existing Q tors (Q5), the semantic structure of the GO graph (Q2) and relations from Fig. 2. For instance, R63 and 4 were repre- ortholog groups that correspond to KEGG pathways (Q ), sented by zero matrices of proper dimensions. Because tri- 6 T we considered the genes as nodes in three distinct large SPMF requires that relations are symmetric, we set Rji ¼ Rij weighted graphs. In the graph for Q5, the link between two for all available relation matrices. genes was weighted by the similarity of their associated sets 4.2 Setup for Pharmacologic Action Prediction Task of MeSH descriptors using information from R14 and R45. We considered the MeSH hierarchy to measure these simi- Identification of the mechanisms of action of chemical com- larities. Similarly, for the graph for Q2 we considered the pounds is a crucial task in drug discovery [63], [64]. Here, GO semantic structure in computing similarities of sets of our aim was to computationally predict pharmacologic GO terms associated with genes. In the graph for Q6, the actions of chemical compounds as defined in the PubChem gene edges were weighted by the number of common database [65]. KEGG ortholog groups. Kernel matrices were constructed with a diffusion kernel [62]. 4.2.1 Data Rnn The resulting kernel matricesP K 2 wereP centered We considered six object types (Fig. 3): chemicals (type 1), c i; j i; j 1=n i; j 1=n i; j as PK ð Þ¼Kð Þ i Kð Þ j Kð Þþ PubChem’s [65] pharmacologic actions (type 2), publica- 2 n c 1=n ij Kði; jÞ and normalized as K ði; jÞ¼K ði; jÞ= tions from the PubMed database (type 3), depositors of ZITNIK AND ZUPAN: DATA FUSION BY MATRIX FACTORIZATION 49

TABLE 1 Cross-Validated F1 and AUC Accuracy Scores for Fusion by Matrix Factorization (DFMF), Kernel-Based Method (MKL), Random Forests (RF) and Relational Learning-Based Matrix Factorization (tri-SPMF)

Prediction task DFMF MKL RF tri-SPMF F1 AUC F1 AUC F1 AUC F1 AUC 100 D. discoideum genes 0.799 0.801 0.781 0.788 0.761 0.785 0.731 0.724 1000 D. discoideum genes 0.826 0.823 0.787 0.798 0.767 0.788 0.756 0.741 Whole D. discoideum genome 0.831 0.849 0.800 0.821 0.782 0.801 0.778 0.787 Pharmacologic actions 0.663 0.834 0.639 0.811 0.643 0.819 0.641 0.810

chemical substances (type 4) and their categorization from the training data and tested them on the genes (chemi- (type 6), and PubChem substructure fingerprints (type 5). cals) from the test set. The performance was evaluated using The data included 1;260 chemicals extracted from the an F1 score, a harmonic mean of precision and recall, and complete DrugBank [66] database (accessed in Feb. 2014) area under ROC curve (AUC). Both scores were averaged that were identified with at least one pharmacologic action across cross-validation runs. in the PubChem Compound database. In that way, every chemical (drug) was assigned one or more MeSH headings 5RESULTS AND DISCUSSION that described its pharmacologic actions and corresponded to D27.505 tree of the 2014 MeSH Tree Structure (target rela- 5.1 Predictive Performance tion R12). For example, established pharmacologic actions Table 1 presents the cross-validated F1 and AUC scores for for Aspirin include “Anti-Inflammatory Agents, Non- both gene function prediction (data set of slim GO Steroidal”, “Fibrinolytic Agents” and “Antipyretics.” To terms) and prediction of pharmacologic actions. The increase the number of chemicals assigned to a particular accuracy of DFMF is at least comparable to MKL and pharmacologic action, the actions of the chemical also substantially higher than that of early integration by ran- included those from its direct parents in the D27.505 tree. dom forests and relational learning by tri-SPMF. When Other data considered were publications from the more genes and hence more data were considered for PubMed database (R13), data on depositors who submitted thegenefunctionpredictiontheperformanceofallfour substances of the chemicals present in PubChem Com- fusion approaches improved. pound records (R14), categories of data depositors (R46) and Poorer performance of tri-SPMF was most probably due PubChem substructure fingerprints (R15). These finger- to required introduction of relations into factorized system prints consist of a series of 881 binary indicators, each that were not present in the input data. Consequently, the denoting the presence or absence of a particular substruc- ability of tri-SPMF to infer relations of interest between ture in a molecule. Collectively, these binary keys provide a other object types deteriorated considerably. Notice also “fingerprint” of a particular chemical structure form. Chem- that tri-SPMF could not be applied if fusion schemes in icals are constrained by a matrix of substructure-based Tani- Figs. 2 or 3 would contain asymmetric or one-way relations, moto 2D similarity (Q1) obtained through PubChem Score such as those from the analysis of signed networks [67] and Matrix Service. computational biology [68], among others. We also observed numerical instability with tri-SPMF, which was 4.2.2 Preprocessing for Alternative Learning Methods exhibited as an increase in the value of objective function between successive iterations. In contrast, DFMF exhibited For the kernel-based fusion, we generated the kernel matri- Q numerical stability in all experiments (results not shown). ces for chemicals from R13, R14, R15 and 1 (Fig. 3) using the The accuracy for nine GO terms selected by domain polynomial kernel of degree 2. We included data on deposi- : expert is given in Table 2. The DFMF performs consistently tors (R46) by applying a polynomial kernel to R14R46 The better than the other three approaches. Again, the early resulting kernel matrices were centered and normalized, integration by random forests is inferior to all three interme- and the kernel parameters were selected in internal cross- diate integration methods. Notice that, with only a few validation (see Section 4.1.2 for details). Preprocessing for exceptions, both F1 and AUC scores of DFMF are high. This early integration by random forests and tri-SPMF learning is important, as all nine gene processes and functions followed the same procedures as described in Sections 4.1.3 observed are relevant for current research of D. discoideum and 4.1.4, respectively. The prediction task was defined by where the methods for data fusion can yield new candidate the associations of chemicals to pharmacologic actions given genes for focused experimental studies. by R12 (Fig. 3). Our fusion approach is faster than multiple kernel learn- ing. DFMF required 18 minutes of runtime on a standard 4.3 Scoring desktop computer compared to 77 minutes for MKL to fin- We estimated the quality of inferred models by ten-fold ish one iteration of cross-validation of the whole-genome cross-validation. In each iteration, we split the set of genes variant of gene function prediction task. The factorization (chemicals) to a train and test set. The corresponding data algorithm of DFMF also took less time to execute than tri- on genes (chemicals) from the test set was entirely omitted SPMF due to redundant representation of fusion system from the training data. We developed prediction models required by tri-SPMF. 50 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 37, NO. 1, JANUARY 2015

TABLE 2 Gene Ontology Term-Specific Cross-Validated F1 and AUC Accuracy Scores for Fusion by Matrix Factorization (DFMF), Kernel-Based Method (MKL), Random Forests (RF) and Relational Learning-Based Matrix Factorization (tri-SPMF)

GO term name Term identifier Namespace Size DFMF MKL RF tri-SPMF F1 AUC F1 AUC F1 AUC F1 AUC Activation of adeny. cyc. act. 0007190 BP 11 0.834 0.844 0.770 0.781 0.758 0.601 0.729 0.731 Chemotaxis 0006935 BP 58 0.981 0.980 0.794 0.786 0.538 0.724 0.804 0.810 Chemotaxis to cAMP 0043327 BP 21 0.922 0.910 0.835 0.862 0.798 0.767 0.838 0.815 Phagocytosis 0006909 BP 33 0.956 0.932 0.892 0.901 0.789 0.619 0.836 0.810 Response to bacterium 0009617 BP 51 0.899 0.870 0.788 0.761 0.785 0.761 0.817 0.831 Cell-cell adhesion 0016337 BP 14 0.883 0.861 0.867 0.856 0.728 0.725 0.799 0.834 Actin binding 0003779 MF 43 0.676 0.781 0.664 0.658 0.642 0.737 0.671 0.682 Lysozyme activity 0003796 MF 4 0.782 0.750 0.774 0.750 0.754 0.625 0.747 0.625 Seq.-spec. DNA bind. t. f. a. 0003700 MF 79 0.956 0.948 0.894 0.901 0.732 0.759 0.892 0.852

Terms in Gene Ontology belong to one of three namespaces, biological process (BP), molecular function (MF) or cellular component.

5.2 Sensitivity to Inclusion of Data Sources SVD. For iteration v and matrix Rij the error was com- Inclusion of additional data sources improves the accuracy puted by: of prediction models. We illustrate this for gene function ðvÞ ðvÞ T ðvÞ 2 prediction in Fig. 4a, where we started with only the target jjRij G S ðG Þ jj dF ðRij; ½Rij Þ v i ij j k ; data source R12 and then added either R13 or Q1 or both. Errijð Þ¼ (14) dF ðRij; ½RijkÞ Similar effects were observed when we studied other com- ðvÞ ðvÞ ðvÞ binations of data sources (not shown here for brevity). where Gi , Gj and Sij were matrix factors obtained after v Notice also that due to ensembling the cross-validated vari- iterations of factorization algorithm. In Eq. (14), dF ðRij; ance of F1 is small. T 2 ½RijkÞ¼jjRij UkSkVk jj denotes the Frobenius distance between Rij and its k-rank approximation given by the 5.3 Sensitivity to Inclusion of Constraints SVD, where k ¼ maxðki;kjÞ is the approximation rank. Q We varied the sparseness of gene constraint matrix 1 by ErrijðvÞ is a pessimistic measure of quantitative accuracy holding out a random subset of protein-protein interactions. because of the choice of k. This error measure is similar to Q We set the entries of 1 that corresponded to held-out con- the error of the two-factor non-negative matrix factorization straints to zero so that they did not affect the cost function from [17]. during optimization. Fig. 4b shows that including addi- Table 3 shows the results for the experiment with 1000 tional information on genes in the form of constraints most GO-annotated D. discoideum genes and selected factor- improves the predictive performance of DFMF for gene ization ranks ki < 65;i2½6: The informed initialization function prediction. algorithms surpass the random initialization. Of these, the random Acol algorithm performs best in terms of accuracy 5.4 Matrix Factor Initialization Study and is also one of the simplest. We studied the effect of matrix factor initialization on DFMF by observing the reconstruction error after one and 5.5 Early Integration by Matrix Factorization after twenty iterations of optimization procedure, the Our data fusion approach simultaneously factorizes indi- latter being about one fourth of the iterations required for vidual blocks of data in R. Alternatively, we could also the optimization algorithm to converge when predicting disregard the data structure, and treat R as a single data gene functions. We estimated the error relative to the matrix. Such data treatment would transform our data k ;k ; ;k optimal ð 1 2 ... 6Þ-rank approximation given by the fusion approach to that of early integration and lose the benefits of structured system and source-specific factori- zation. To prove this experimentally, we considered the 1;000 most GO-annotated D. discoideum genes. The resulting cross-validated F1 score for factorization-based

TABLE 3 Effect of Initialization Algorithm on Reconstruction Error of DFMF’s Factorization Model

ð0Þ ð0Þ Method Time G Storage G Err12ð1Þ Err12ð20Þ Rand. 0.0011 s 618 K 5.11 3.61 Rand. C 0.1027 s 553 K 2.97 1.67 Rand. Acol 0.0654 s 505 K 1.59 1.30 Fig. 4. Adding new data sources (a) or incorporating more object-type- K-means 0.4029 s 562 K 2.47 2.20 specific constraints in Q1 (b) both increase the accuracy of matrix fac- NNDSVDa 0.1193 s 562 K 3.50 2.01 torization-based models for gene function prediction task. ZITNIK AND ZUPAN: DATA FUSION BY MATRIX FACTORIZATION 51 early integration was 0:576,comparedto0:826 obtained Proof of Theorem 2. First, we view JðG; SÞ in Eq. (6) as a with our proposed data fusion algorithm. This result is function of G1 and construct the auxiliary function 0 not surprising as neglecting the structure of the system FWangðA; A ; B; C; DÞ such that: ; also causes the loss of the structure in matrix factors and A ¼ GX1 X thelossofzeroblocksinfactorsS and G from Eq. (3). T T ; B ¼ R1jGjS1j þ Ri1GiSi1 Clearly, data structure carries substantial information j:R1j2R i:Ri12R and should be retained in the model. t maxXi i (17) QðtÞ; C ¼ 1 6CONCLUSION tX¼1 X T T T T : We have proposed a new matrix factorization-based data D ¼ S1jGj GjS1j þ Si1Gi GiSi1 j i fusion algorithm called DFMF. The approach is flexible :R1j2R :Ri12R and, in contrast to state-of-the-art kernel-based methods, With these values for A, B, C and D, the auxilary func- requires minimal, if any, preprocessing of input data. tion F is convex in G . Notice that each of the two This latter feature, the ability to model multi-relational Wang 1 summation terms in the right-hand side expression for D and multi-object type data, and DFMF’s excellent accu- represents the sum of the symmetric matrices of the form racyandtimeresponse,aretheprincipaladvantagesof T T T T our new algorithm. ðGjS1jÞ ðGjS1jÞ and ðGiSi1Þ ðGiSi1Þ, respectively. Thus, DFMF can model any collection of data sets, each of D is symmetric. The global minimum (Eq. (15)) of 0 which can be expressed as a matrix. Tasks from bioinfor- FWangðA; A ; B; C; DÞ is exactly the update rule for G1 in matics and cheminformatics considered here that were tra- Eqs. (10), (11), and (12). ditionally regarded as classification problems exemplify We repeat this process by constructing the remaining just one type of data mining problems that can be addressed r 1 auxiliary functions by separately considering with our method. We anticipate the utility of factorization- JðG; SÞ as a function of matrix factors G2 ...; Gr. From based data fusion in multi-task learning, association mining, the theory of auxiliary functions it then follows that J is clustering, link prediction or structured output prediction. nonincreasing under the update rules for each of G1; G2; ...; Gr. Letting JðG1; G2; ...; Gr; SÞ¼ JðG; SÞ; APPENDIX we have: ROOF OF ONVERGENCE HEOREM P C (T 2) J 0; 0; ; 0; J 1; 0; ; 0; G1 G2 ... Gr S G1 G2 ... Gr S Our proof follows the concept of auxiliary functions often used in convergence proofs of approximate matrix factori- J 1; 1; ; 1; : zation algorithms [13]. The proof is performed by introduc- G1 G2 ... Gr S ing an appropriate function FðG; G0Þ, which is an auxiliary Since JðG; SÞ is certainly bounded from below by zero, function of the objective JðG; SÞ that satisfies: we proved the theorem. tu FðG0; G0Þ¼JðG0; SÞ;FðG; G0ÞJðG; SÞ: ACKNOWLEDGMENTS If such an auxiliary function F can be found and if G is updated in ðm þ 1Þth iteration as The authors thank Janez Demsar and Gad Shaulsky for their comments on the early version of this manuscript. They also m m Gð þ1Þ ¼ arg min FðG; Gð ÞÞ; (15) acknowledge the support for their work from the ARRS (P2- G 0209, J2-5480), NIH (P01-HD39691), and EU (Health-F5- then the following holds: 2010-242038).

m m m JðGð þ1Þ; SÞFðGð þ1Þ; Gð ÞÞ REFERENCES m m FðGð Þ; Gð ÞÞ (16) [1] S. M. Rappaport and M. T. Smith, “Environment and disease risks,” Science, vol. 330, no. 6003, pp. 460–461, 2010. m ¼ JðGð Þ; SÞ: [2] S. Aerts, D. Lambrechts, S. Maity, P. Van Loo, B. Coessens, F. De Smet, L.-C. Tranchevent, B. De Moor, P. Marynen, B. Hassan, P. That is, if F is an auxiliary function of JðG; SÞ, then JðG; SÞ Carmeliet, and Y. Moreau, “Gene prioritization through genomic data fusion,” Nature Biotechnol., vol. 24, no. 5, pp. 537–544, 2006. is nonincreasing under the update Eq. (15). In the proof we [3] H. Bostrom,€ S. F. Andler, M. Brohede, R. Johansson, A. Karlsson, J. show the update step for G in Eq. (12) is exactly the update vanLaere, L. Niklasson, M. Nilsson, A. Persson, and T. Ziemke, in Eq. (15) with a proper auxiliary function. For that we “On the definition of information fusion as a field of research,” Univ. Skovde, School Humanities Informat, Skovde, Sweden, make use of an auxiliary function specified by Wang et al. Tech. Rep. HS-IKI-TR-07-006, 2007. (2008) (Appendix 2 in [10]). Wang et al. constructed a func- [4] D. Greene and P. Cunningham, “A matrix factorization approach 0 tion FWangðA; A ; B; C; DÞ and showed that it satisfied the for integrating multiple data views,” in Proc. Eur. Conf. Mach. Learn. Knowl. Discovery Databases, 2009, pp. 423–438. conditions of auxiliary functions for functions of the form [5] G. R. G. Lanckriet, T. De Bie, N. Cristianini, M. I. Jordan, and W. S. T T T JðA; B; C; DÞ¼trð2A B þ ADA ÞþtrðA CAÞ, where C Noble, “A statistical framework for genomic data fusion,” Bioin- and D are symmetric, and A is nonnegative. To prove the format., vol. 20, no. 16, pp. 2626–2635, 2004. [6] P. Pavlidis, J. Cai, J. Weston, and W. S. Noble, “Learning gene convergence of our algorithm, we show that the objective functional classifications from multiple data types,” J. Comput. function from Eq. (5) is a special case of JðA; B; C; DÞ. Biol., vol. 9, pp. 401–411, 2002. 52 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 37, NO. 1, JANUARY 2015

[7] O. Gevaert, F. De Smet, D. Timmerman, Y. Moreau, and B. De [29] M. Nickel, “A three-way model for collective learning on multi- Moor, “Predicting the prognosis of breast cancer by integrating relational data,” in Proc. 28th Int. Conf. Mach. Learn., 2011, pp. 809– clinical and microarray data with Bayesian networks,” Bioinfor- 816. mat., vol. 22, no. 14, pp. e184–190, 2006. [30] A. Rettinger, H. Wermser, Y. Huang, and V. Tresp, “Context- [8] W. Tang, Z. Lu, and I. S. Dhillon, “Clustering with multiple aware tensor decomposition for relation prediction in social graphs,” in Proc. IEEE 9th Int. Conf. Data Mining, 2009, pp. 1016– networks,” Social Netw. Anal. Mining, vol. 2, no. 4, pp. 373–385, 1021. 2012. [9] M. H. van Vliet, H. M. Horlings, M. J. van de Vijver, M. J. T. [31] T. G. Kolda and B. W. Bader, “Tensor decompositions and Reinders, and L. F. A. Wessels, “Integration of clinical and applications,” SIAM Rev., vol. 51, no. 3, pp. 455–500, 2009. gene expression data has a synergetic effect on predicting [32] S. Rendle, Z. Gantner, C. Freudenthaler, and L. Schmidt-Thieme, breast cancer outcome,” PLoS One, vol. 7, no. 7, p. e40358, 2012. “Fast context-aware recommendations with factorization [10] F. Wang, T. Li, and C. Zhang, “Semi-supervised clustering via machines,” in Proc. 34th ACM SIGIR Int. Conf. Res. Develop. Inf. matrix factorization,” in Proc. SIAM Int. Conf. Data Mining, 2008, Retrieval, pp. 635–644, 2011. pp. 1–12. [33] K. Chaudhuri, S. M. Kakade, K. Livescu, and K. Sridharan,, [11] S. Yu, T. Falck, A. Daemen, L.-C. Tranchevent, J. A. Suykens, B. De “Multi-view clustering via canonical correlation analysis,” in Proc. Moor, and Y. Moreau, “L2-norm multiple kernel learning and its 26th Intl. Conf. Mach. Learn., 2009, pp. 129–136. application to biomedical data fusion,” BMC Bioinformat., vol. 11, [34] S. Mostafavi and Q. Morris, “Combining many interaction net- p. 309, 2010. works to predict gene function and analyze gene lists,” Proteomics, [12] H. Wang, H. Huang, and C. H. Q. Ding, “Simultaneous clustering vol. 12, no. 10, pp. 1687–196, 2012. of multi-type relational data via symmetric nonnegative matrix [35] D. Zhou and C. J. C. Burges, “Spectral clustering and transductive tri-factorization,” in Proc. 20th ACM CIKM Int. Conf. Inf. Knowl. learning with multiple views,” in Proc. 24th Int. Conf. Mach. Learn., Manage., 2011, pp. 279–284. 2007, pp. 1159–1166. [13] D. D. Lee and H. S. Seung, “Algorithms for non-negative matrix [36] A. Klami and S. Kaski, “Probabilistic approach to detecting factorization,” in Advances in Neural Information Processing Systems, dependencies between data sets,” Neurocomput., vol. 72, no. 1–3, T. K. Leen, T. G. Dietterich, and V. Tresp, Eds., Cambridge, MA, pp. 39–46, 2008. USA: MIT Press, 2000, pp. 556–562. [37] H. F. Lopes, D. Gamerman, and E. Salazar, “Generalized spatial [14] J.-P. Brunet, P. Tamayo, T. R. Golub, and J. P. Mesirov, dynamic factor models,” Comput. Statist. Data Anal., vol. 55, no. 3, “Metagenes and molecular pattern discovery using matrix pp. 1319–1330, 2011. factorization,” Proc. Nat. Acad. Sci. USA, vol. 101, no. 12, pp. 4164– [38] J. Luttinen and A. Ilin, “Variational Gaussian-process factor analy- 4169, 2004. sis for modeling spatio-temporal data,” in Proc. Adv. Neural Inf. [15] L. N. Hutchins, S. M. Murphy, P. Singh, and J. H. Graber, “Position- Process. Syst., 2009, pp. 1177–1185. dependent motif characterization using non-negative matrix [39] C. Xing and D. B. Dunson, “Bayesian inference for genomic data factorization,” Bioinformat., vol. 24, no. 23, pp. 2684–2690, 2008. integration reduces misclassification rate in predicting protein- [16] M. H. Van Benthem and M. R. Keenan, “Fast algorithm for the protein interactions,” PLoS Comput. Biol., vol. 7, no. 7, p. e1002110, solution of large-scale non-negativity-constrained least squares 2011. problems,” J. Chemometrics, vol. 18, no. 10, pp. 441–450, 2004. [40] Y. Zhang and Q. Ji, “Active and dynamic information fusion for [17] R. Albright, J. Cox, D. Duling, A. N. Langville, and C. D. Meyer, multisensor systems with dynamic Bayesian networks,” Trans. “Algorithms, initializations, and convergence for the nonnegative Syst. Man Cybern. Part B, vol. 36, no. 2, pp. 467–472, 2006. matrix factorization,” Dept. Math., North Carolina State Univer- [41] A. Alexeyenko and E. L. L. Sonnhammer, “Global networks of sity, Raleigh, NC, USA, Tech. Rep. 81706, 2006. functional coupling in eukaryotes from comprehensive data inte- [18] C. Boutsidis and E. Gallopoulos, “SVD based initialization: A gration,” Genome Res., vol. 19, no. 6, pp. 1107–16, 2009. head start for nonnegative matrix factorization,” Pattern Recognit., [42] C. Huttenhower, K. T. Mutungu, N. Indik, W. Yang, M. Schroeder, vol. 41, no. 4, pp. 1350–1362, 2008. J. J. Forman, O. G. Troyanskaya, and H. A. Coller, “Detailing regu- [19] A. P. Singh and G. J. Gordon, “A unified view of matrix factoriza- latory networks through large scale data integration,” Bioinformat., tion models,” in Proc. Eur. Conf. Mach. Learn. Knowl. Discovery vol. 25, no. 24, pp. 3267–3274, 2009. Databases, 2008, pp. 358–373. [43] G. A. Carpenter, S. Martens, and O. J. Ogas, “Self-organizing [20] T. Lange and J. M. Buhmann, “Fusion of similarity data in information fusion and hierarchical knowledge discovery: A new clustering,” in Proc. Adv. Neural Inf. Process. Syst., 2005, pp. 723– framework using ARTMAP neural networks,” Neural Netw., 730. vol. 18, no. 3, pp. 287–295, 2005. [21] H. Wang, H. Huang, C. H. Q. Ding, and F. Nie, “Predicting [44] Z. Chen and W. Zhang, “Integrative analysis using module- protein-protein interactions from multimodal biological data guided random forests reveals correlated genetic factors related sources via nonnegative matrix tri-factorization,” in Res. Com- to mouse weight,” PLoS Comput. Biol., vol. 9, no. 3, p. e1002956, put. Molecular Biol., vol. 7262, pp. 314–325, 2012. 2013. [22] S. Zhang, C.-C. Liu, W. Li, H. Shen, P. W. Laird, and X. J. Zhou, [45] G. R. G. Lanckriet, N. Cristianini, P. Bartlett, L. E. Ghaoui, and M. “Discovery of multi-dimensional modules by integrative analysis I. Jordan, “Learning the kernel matrix with semidefinite of cancer genomic data,” Nucleic Acids Res., vol. 40, no. 19, programming,” J. Mach. Learn. Res., vol. 5, pp. 27–72, 2004. pp. 9379–9391, 2012. [46] F. R. Bach, G. R. G. Lanckriet, and M. I. Jordan, “Multiple kernel [23] W.-j. Li and D.-y. Yeung, “Relation regularized matrix learning, conic duality, and the SMO algorithm,” in Proc. 21st Int. factorization,” in Proc. 20th Int. Joint Conf. Artif. Intell., 2007, Conf. Mach. Learn., 2004, pp. 6–14. pp. 1126–1131. [47] M. Kloft, U. Brefeld, S. Sonnenburg, P. Laskov, K.-R. Muller, and [24] S.-H. Zhang, Q. Li, J. Liu, and X. J. Zhou, “A novel computational A. Zien, “Efficient and accurate Lp-norm multiple kernel framework for simultaneous integration of multiple types of learning,” in Proc. Adv. Neural Inf. Process. Syst., vol. 21, 2009, genomic data to identify microRNA-gene regulatory modules,” pp. 997–1005. Bioinformat., vol. 27, no. 13, pp. 401–409, 2011. [48] M. Kloft, U. Brefeld, S. Sonnenburg, and A. Zien, “Lp-norm multi- [25] A. P. Singh and G. J. Gordon, “Relational learning via collective ple kernel learning,” J. Mach. Learn. Res., vol. 12, pp. 953–997, 2011. matrix factorization,” in Proc. 14th ACM SIGKDD Int. Conf. Knowl. [49] S. Yu, L. Tranchevent, X. Liu, W. Glanzel, J. A. K. Suykens, B. De Discovery Data Mining, 2008, pp. 650–658. Moor, and Y. Moreau, “Optimized data fusion for kernel k-means [26] A. P. Singh and G. J. Gordon, “A Bayesian matrix factorization clustering,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 5, model for relational data,” in Proc. 26th Conf. Uncertainty Artif. pp. 1031–1039, May 2012. Intell., 2010, pp. 556–563. [50] M. Gonen€ and E. Alpaydın, “Multiple kernel learning algo- [27] I. Sutskever, “Modelling relational data using Bayesian clustered rithms,” J. Mach. Learn. Res., vol. 12, pp. 2211–2268, 2011. tensor factorization,” in Proc. Adv. Neural Inf. Process. Syst., 2009, [51] R. Debnath and H. Takahashi, “Kernel selection for the support pp. 1821–1828. vector machine,” IEICE Trans., vol. 87-D, no. 12, pp. 2903–2904, [28] T. Li, Y. Zhang, and V. Sindhwani, “A non-negative matrix tri- 2004. factorization approach to sentiment classification with lexical [52] D. Parikh and R. Polikar, “An ensemble-based incremental prior knowledge,” in Proc. Joint Conf. 47th Annu. Meet. ACL 4th Int. learning approach to data fusion,” IEEE Trans. Syst., Man, Cybern., Joint Conf. Natural Language Process. AFNLP, 2009, pp. 244–252. Part B, vol. 37, no. 2, pp. 437–450, 2007. ZITNIK AND ZUPAN: DATA FUSION BY MATRIX FACTORIZATION 53

[53] G. Pandey, B. Zhang, A. N. Chang, C. L. Myers, J. Zhu, V. Marinka Zitnik received the BS degree in com- Kumar, and E. E. Schadt, “An integrative multi-network and puter science and mathematics from the Univer- multi-classifier approach to predict genetic interactions,” PLoS sity of Ljubljana, Slovenia, in 2012, where she is Comput. Biol., vol. 6, no. 9, p. e1000928, 2010. currently working toward the PhD degree in com- [54] R. S. Savage, Z. Ghahramani, J. E. Griffin, B. J. de la Cruz, and D. puter science. Her research interests include L. Wild, “Discovering transcriptional modules by Bayesian data machine learning, optimization, mathematical integration,” Bioinformat., vol. 26, no. 12, pp. i158–i167, 2010. modelling, matrix analysis, and probabilistic [55] L. Breiman, “Random forests,” Mach. Learn., vol. 45, no. 1, pp. 5– numerics. 32, 2001. [56] A.-L. Boulesteix, C. Porzelius, and M. Daumer, “Microarray-based classification and clinical predictors: On combined classifiers and additional predictive value,” Bioinformat., vol. 24, no. 15, Blaz Zupan studied computer science at the Uni- pp. 1698–1706, 2008. versity of Ljubljana, Slovenia, and the University [57] J. Ye, S. Ji, and J. Chen, “Multi-class discriminant kernel learning of Houston, Texas. He is a professor at the Uni- via convex programming,” J. Mach. Learn. Res., vol. 9, pp. 719–758, versity of Ljubljana, and a visiting professor at the 2008. Baylor College of Medicine in Houston. His [58] M. Ashburner, C. A. Ball, J. A. Blake, D. Botstein, H. Butler, J. M. research focuses on methods for data mining Cherry, A. P. Davis, K. Dolinski, S. S. Dwight, J. T. Eppig, M. A. and applications in bioinformatics and systems Harris, D. P. Hill, L. Issel-Tarver, A. Kasarskis, S. Lewis, J. C. biology. He is a coauthor of Orange, a Python- Matese, J. E. Richardson, M. Ringwald, G. M. Rubin, and G. based and visual programming data mining suite, Sherlock, “Gene Ontology: Tool for the unification of biology,” and several bioinformatics web applications, Nature Genetics, vol. 25, no. 1, pp. 25–29, 2000. such as dictyExpress for gene expression analyt- [59] P. Radivojac, W. T. Clark, T. R. Oron, A. M. Schnoes, T. Wittkop, ics and GenePath for epistasis analysis. A. Sokolov, K. Graim, C. Funk, K. Verspoor, A. Ben-Hur, “A large-scale evaluation of computational protein function pre- " diction,” Nature Methods, vol. 10, pp. 221–227, 2013. For more information on this or any other computing topic, [60] M. Kanehisa, S. Goto, Y. Sato, M. Kawashima, M. Furumichi, and please visit our Digital Library at www.computer.org/publications/dlib. M. Tanabe, “Data, information, knowledge and principle: Back to metabolism in kegg,” Nucleic Acids Res., vol. 42, no. D1, pp. D199– D205, 2014. [61] A. Parikh, E. R. Miranda, M. Katoh-Kurasawa, D. Fuller, G. Rot, L. Zagar, T. Curk, R. Sucgang, R. Chen, B. Zupan, W. F. Loomis, A. Kuspa, and G. Shaulsky, “Conserved developmental transcrip- tomes in evolutionarily divergent species,” Genome Biol., vol. 11, no. 3, p. R35, 2010. [62] R. I. Kondor and J. D. Lafferty, “Diffusion kernels on graphs and other discrete input spaces,” in Proc. 19th Int. Conf. Mach. Learn., 2002, pp. 315–322. [63] G. V. Paolini, R. H. Shapland, W. P. van Hoorn, J. S. Mason, and A. L. Hopkins, “Global mapping of pharmacological space,” Nature Biotechnol., vol. 24, no. 7, pp. 805–815, 2006. [64] F. Iorio, R. Bosotti, E. Scacheri, V. Belcastro, P. Mithbaokar, R. Ferriero, L. Murino, R. Tagliaferri, N. Brunetti-Pierri, A. Isac- chi et al., “Discovery of drug mode of action and drug reposi- tioning from transcriptional responses,” Proc. Nat. Acad. Sci. USA, vol. 107, no. 33, pp. 14621–14626, 2010. [65] Y. Wang, J. Xiao, T. O. Suzek, J. Zhang, J. Wang, and S. H. Bryant, “Pubchem: A public information system for analyzing bioactiv- ities of small molecules,” Nucleic Acids Res., vol. 37, no. suppl 2, pp. W623–W633, 2009. [66] V. Law, C. Knox, Y. Djoumbou, T. Jewison, A. C. Guo, Y. Liu, A. Maciejewski, D. Arndt, M. Wilson, V. Neveu, A. Tang, G. Gabriel, C. Ly, S. Adamjee, Z. T. Dame, B. Han, Y. Zhou, and D. S. Wishart, “DrugBank 4.0: Shedding new light on drug metabolism,” Nucleic Acids Res., vol. 42, no. D1, pp. D1091–D1097, 2014. [67] J. Leskovec, D. Huttenlocher, and J. Kleinberg, “Signed networks in social media,” in Proc. SIGCHI Conf. Human Factors Comput. Syst., 2010, pp. 1361–1370. [68] R. A. Notebaart, P. R. Kensche, M. A. Huynen, B. E. Dutilh, “Asymmetric relationships between proteins shape genome evolution,” Genome Biol., vol. 10, no. 2, p. R19, 2009.