IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 37, NO. 1, JANUARY 2015 41 Data Fusion by Matrix Factorization
Marinka Zitnik and Blaz Zupan
Abstract—For most problems in science and engineering we can obtain data sets that describe the observed system from various perspectives and record the behavior of its individual components. Heterogeneous data sets can be collectively mined by data fusion. Fusion can focus on a specific target relation and exploit directly associated data together with contextual data and data about system’s constraints. In the paper we describe a data fusion approach with penalized matrix tri-factorization (DFMF) that simultaneously factorizes data matrices to reveal hidden associations. The approach can directly consider any data that can be expressed in a matrix, including those from feature-based representations, ontologies, associations and networks. We demonstrate the utility of DFMF for gene function prediction task with eleven different data sources and for prediction of pharmacologic actions by fusing six data sources. Our data fusion algorithm compares favorably to alternative data integration approaches and achieves higher accuracy than can be obtained from any single data source alone.
Index Terms—Data fusion, intermediate data integration, matrix factorization, data mining, bioinformatics, cheminformatics Ç
1INTRODUCTION ATA abound in all areas of human endeavour. We may place. Early (or full) integration transforms all data sources Dgather various data sets that are directly related to the into a single feature-based table and treats this as a single problem, or data sets that are loosely related to our study datasetthatcanbeexploredbyanyofthewell- but could be useful when combined with other data sets. established feature-based machine learning algorithms. Consider, for example, the exposome [1] that encompasses The inferred models can in principle include any type of the totality of human endeavour in the study of disease. Let relationships between the features from within and us say that we examine susceptibility to a particular disease between the data sources. Early integration relies on and have access to the patients’ clinical data together with procedures for feature construction. For our exposome data on their demographics, habits, living environments, example, patient-specific data would need to include both friends, relatives, movie-watching habits, and movie genre clinical data and information from the movie genre ontol- ontology. Mining such a diverse data collection may reveal ogies. The former may be trivial as this data is already interesting patterns that would remain hidden if we would related to each specific patient, while the latter requires analyze only directly related, clinical data. What if the dis- more complex feature engineering. Early integration also ease was less common in living areas with more open neglects the modular structure of the data. spaces or in environments where people need to walk In late (decision) integration, each data source gives rise to instead of drive to the nearest grocery? Is the disease less a separate model. Predictions of these models are fused by common among those that watch comedies and ignore model weighting. Again, prior to model inference, it is nec- politics and news? essary to transform each data set to encode relations to the Methods for data fusion can collectively treat data sets target concept. For our example, information on the movie and combine diverse data sources even when they differ in preferences of friends and relatives would need to be their conceptual, contextual and typographical representa- mapped to disease associations. Such transformations may tion [2], [3]. Individual data sets may be incomplete, yet not be trivial and would need to be crafted independently because of their diversity and complementarity, fusion can for every data source. improve the robustness and predictive performance of the The youngest branch of data fusion algorithms is interme- resulting models [4], [5]. diate (partial) integration. Algorithms in this category explic- According to Pavlidis et al. (2002) [6], data fusion itly address the multiplicity of data and fuse them through approaches can be classified into three main categories inference of a single joint model. Intermediate integration depending on the modeling stage at which fusion takes does not merge the input data, nor does it develop separate models for each data source. It instead retains the structure M. Zitnik is with the Faculty of Computer and Information Science, of the data sources by incorporating it within the structure University of Ljubljana, Trza ska 25, SI-1000 Ljubljana, Slovenia. of predictive model. This particular approach is often pre- E-mail: [email protected]. ferred because of its superior predictive accuracy [5], [6], B. Zupan is with the Faculty of Computer and Information Science, Uni- [7], [8], [9], but for a given model type, it requires the devel- versity of Ljubljana, Trza ska 25, SI-1000 Ljubljana, Slovenia, and the Department of Molecular and Human Genetics, Baylor College of Medi- opment of a new inference algorithm. cine, Houston, TX 77030. E-mail: [email protected]. We here report on the development of a new method Manuscript received 23 Oct. 2013; revised 13 Mar. 2014; accepted 23 July for intermediate data fusion based on constrained matrix 2014. Date of publication 29 July 2014; date of current version 5 Dec. 2014. factorization. Our aim was to construct an algorithm that Recommended for acceptance by G. Lanckriet. requires no or only minimal transformation of input data For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference the Digital Object Identifier below. and can fuse feature-based representations, ontologies, Digital Object Identifier no. 10.1109/TPAMI.2014.2343973 associations and networks. We focus on the challenge of
0162-8828 ß 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. 42 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 37, NO. 1, JANUARY 2015 dealing with collections of heterogeneous data sources, and while showing that our method can be used on siz- able problems from current research, scaling is not the focus of the present paper. We first present our data fusion algorithm, henceforth DFMF (Section 2), and then place it within the related work of relational learning approaches (Section 3). We also refer to related data integration approaches, specifically to methods of kernel- based data fusion (Section 3). We then examine the utility of DFMF and experimentally compare it with intermedi- ate integration by multiple kernel learning (MKL), early integration with random forests, and tri-SPMF [10], previously proposed matrix tri-factorization approach (Section 4).
2DATA FUSION ALGORITHM
The DFMF considers r object types E1; ...; Er and a collec- tion of data sources, each relating a pair of object types ðEi; EjÞ. In our introductory example of the exposome, object types could be a patient, a disease or a living environment, i among others. If there are ni objects of type Ei (op is pth object of type Ei) and nj objects of type Ej, we represent the observations from the data source that relates ðEi; EjÞ for ni nj i 6¼ j in a sparse matrix Rij 2 R . An example of such a matrix would relate patients and drugs by reporting on Fig. 1. Conceptual fusion configuration for four object types, E ; E ; E patient’s current drug prescriptions. Notice that matrices 1 2 3 and E4, equivalently represented by the graph of relations between Rij and Rji are in general asymmetric. A data source that object types (a) and the block-based matrix structure (b). Each data provides relations between objects of the same type Ei is source relates a pair of object types, denoted by arcs in a graph (a) or ni ni matrices with shades of gray in block matrix (b). For example, data represented by a constraint matrix Qi 2 R . Examples of matrix R23 relates object types E2 and E3. Some relations are entirely such constraints are social networks and drug interactions. missing. For instance, there is no data source relating objects from E3 In real-world scenarios we might not have access to rela- and E1, as there is no arc linking nodes E3 and E1 in (a) or equivalently, a matrix R31 is missing in (b). Relations can be asymmetric, such that tions between all pairs of object types. Our data fusion algo- T R23 6¼ R32. Constraints denoted by loops in (a) or matrices with blue rithm still integrates all available data if the underlying entries in (b) relate objects of the same type. In our example configura- graph of relations between object types is connected. In that tion, constraints are provided for object types E2 (one constraint matrix) case, low-dimensional representations of objects of certain and E4 (three constraint matrices). type borrow information from related objects of the differ- ent type. Fig. 1 shows an example of an underlying graph of relations and a block configuration of the fusion system The proposed fusion approach is different from treat- with four object types. ing an entire system (e.g., from Fig. 1) as a large single To retain the block structure of our fusion system and matrix. Factorization of such a matrix would yield hence model distinct relations between object types, we factorsthatarenotobjecttype-specific and would thus propose the simultaneous factorization of all relation disregard the structure of the system. We also show (Sec- matrices Rij constrained by Qi. The resulting system tion 5.5) that such an approach is inferior in terms of contains factors that are specific to each data source and predictive performance. factorsthatarespecifictoeachobjecttype.Throughfac- In comparison with existing multi-type relational data tor sharing we fuse the data but also identify source-spe- factorization approaches (see Section 3) the following char- cific patterns. acterizes our DFMF data fusion method: We have developed a variant of three-factor penalized i) DFMF can model multiple relations between multiple matrix factorization that simultaneously decomposes ni ki object types. Rij Gi 2 R Gj 2 all available relation matrices into , ii) Relations between some object types can be Rnj kj Rki kj and S 2 , and regularizes their approximation completely missing (see Fig. 1). through constraint matrices Qi and Qj such that Rij T iii) Every object type can be associated with multiple GiSijGj . Approximation can be rewritten such that entry constraint matrices. Rijðp; qÞ is approximated by an inner product of the pth row iv) The algorithm makes no assumptions about struc- of matrix Gi and a linear combination of the columns of tural properties of relations (e.g., symmetry of matrix Sij, weighted by the qth column of Gj. The matrix relations). Sij, which has relatively few vectors compared to Rij In order to be applicable to general real-world fusion prob- (ki ni, kj nj), is used to represent many data vectors, lems, data fusion algorithm would need to jointly address and a good approximation can only be achieved in the pres- all of these characteristics. Besides DFMF proposed in this ence of the latent structure in the original data. manuscript, we are not aware of any other approach that ZITNIK AND ZUPAN: DATA FUSION BY MATRIX FACTORIZATION 43
ðtÞ would do so. Most real-world data integration problems Qi for t 2f1; 2; ...;tig. Constraints are collectively would usually consider a larger number of object types, but encoded in a set of constraint block diagonal matrices with growing number of object types, it is likely that data ðtÞ Q for t 2f1; 2; ...; maxi tig: relating a pair of object types is either not available nor meaningful. On the other side, there may be various data ðtÞ ðtÞ ðtÞ ðtÞ Q ¼ Diag Q ; Q ; ...; Qr : (2) sources available on interactions between objects of the 1 2 t same type that also require appropriate treatment. For The ith block along the main diagonal of Qð Þ is zero if example of this type of data, consider abundance of data t>ti. Entries in constraint matrices are positive for objects bases on drug or disease interactions. that are not similar and negative for objects that are simi- In the case study presented in this paper we apply data lar. The former are known as cannot-link constraints fusion to infer relations between two target object types, Ei because they impose penalties on the current approxima- and Ej (Sections 2.6 and 2.7). This relation, encoded in a tar- tion of the matrix factors, and the latter are must-link con- get matrix Rij, will be observed in the context of all other straints, which are rewards that reduce the value of the data sources (Section 2.1). We assume that our target Rij is a cost function during optimization. Must-link constraint ; ½0 1 -matrix that is only partially observed. Its entries indi- expresses the notion that a pair of objects of the same type cate a degree of relation, 0 denoting no relation and 1 denot- should be close in their latent component space. An exam- ing the strongest relation. We aim to predict unobserved ple of must-link constraints are, for instance, drug-drug entries in Rij by reconstructing them through matrix factori- interactions, and example of cannot-link constraints the zation. Such treatment in general applies to multi-class or matrix of adversaries. Typically, data sources with must- multi-label classification tasks, which are conveniently link constraints are more abundant. addressed by multiple kernel fusion [11], with which we The block matrix R is tri-factorized into block matrix fac- compare our performance in this paper. tors G and S: In the following, we present the factorization model, n k n k objective function, derive the updating rules for optimiza- 1 1 ; 2 2 ; ; nr kr ; G ¼ Diag G1 G2 ... Gr tion, and describe the procedure for prediction of rela- tions from matrix factors. In the optimization part, we closely follow [10] in notation, mathematical derivation 2 3 k1 k2 k1 kr and proof technique. S12 S1r 6 k2 k1 k2 kr 7 6 S S r 7 S ¼ 6 21 2 7: (3) 2.1 Factorization Model for Multi-Relational 4 . . . . . 5 and Multi-Object Type Data . . . . kr k1 kr k2 An input to DFMF is a relation block matrix R that concep- Sr1 Sr2 tually represents all relation matrices: Matrix S in Eq. (3) has the same block structure as R in 2 3 T Eq. (1). It is in general asymmetric (i.e., Sji 6¼ Sij) and if a R12 R1r 6 7 R 6 R21 R2r 7 relation matrix is missing in then also its corresponding 6 7: R ¼ 4 . . . . . 5 (1) matrix factor in S will be missing. These two properties of S . . . . stem from our decision to model relation matrices without Rr1 Rr2 assuming their structural properties or their availability for every possible combination of object types. Here, an asterisk (“*”) denotes the relation between the A factorization rank ki is assigned to Ei during inference same type of objects that DMFM does not model. Notice of the factorized system. Factor Sij defines the latent relation that our method does not require the presence of all relation between object types Ei and Ej, while factor Gi is specific to matrices in Eq. (1). Depending on a particular data setup, objects of type Ei and is used in the reconstruction of every any subset of relation matrices might be missing and thus, i j relation with this object type. In this way, each relation unmodeled. A block in the th row and th column (Rij)of T ij i ij matrix R represents the relationship between object type Ei matrix R obtains its own factorization G S Gj with factor i i j and Ej. The pth object of type Ei (i.e., op) and qth object of G (G ) that is shared across all relations which involve j Ei Ej j o ij p; q object types ( ). This can also be observed from the block type E (i.e., q) are related by R ð Þ. An important aspect T of Eq. (1) for data fusion and what distinguishes DMFM structure of the reconstructed system GSG : 2 3 from other conceptually related matrix factorization models T T G1S12G2 G1S1rGr such as S-NMTF [12] or even tri-SPMF [10] is that it is 6 T T 7 6 G2S21G1 G2S2rGr 7 designed for multi-object type and multi-relational data 6 7: (4) T 4 . . . . . 5 where the relations can be asymmetric, Rji 6¼ Rij, and some . . . . T T can be completely missing (unknown Rij) (Section 2.3). GrSr1G1 GrSr2G2 We additionally consider constraints relating objects p Gi of the same type. Several data sources of this kind may Here, the th row in factor holds the latent component oi be available for each object type. For instance, personal representation of object p. By holding Gj and Sij fixed, it is oi relations may be observed from a social network or a clear that latent component representation of p depends on family tree. Assume there are ti 0 data sources for Gj as well as on the existence of relation Rij: Consequently, object type Ei represented by a set of constraint matrices all direct and indirect relations have a determining influence 44 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 37, NO. 1, JANUARY 2015
i on the calculation of opth latent representation. Just as the objects of type Ei are represented by Gi, each relation is rep- resented by factor Sij, which models how the latent compo- nents interact in the respective relation. The asymmetry of Sij takes into account whether a latent component occurs as a subject or an object of corresponding relation Rij.
2.2 Objective Function The objective function minimized by DFMF aims at good approximation of the input data and adherence to must-link and cannot-link constraints: X T 2 min JðG; SÞ¼ Rij GiSijGj G 0 Rij2R maxXi ti (5) T t þ trðG Qð ÞGÞ: t¼1 Here, jj jj and trð Þ denote the Frobenius norm and trace, respectively, and R is the set of all relations included in our model. Our objective function explicitly allows that relations between some object types are entirely missing. Notice that in Eq. (5) we do not approximate input data T by jjR GSG jj2 as was proposed in related approaches of S-NMTF [12] and tri-SPMF [10]. To model the data system such as that from Fig. 1, one could be tempted to replace the missing relation matrices with zero matrices. This would enable the optimization to further reduce the value of objective function,butwouldalsointroduce relations in factorized system that were intentionally not present in the input data. Their inclusion in the model Algorithm 1. Factorization algorithm of proposed data fusion approach would distort inferred relations between other object types (DFMF). (see Section 5.1). Proof. We introduce the Lagrangian multipliers 1; 2; ...; 2.3 Computing the Factorization r and construct the Lagrange function: The DFMF algorithm for solving the minimization problem Xr T specified in Eq. (5) is shown in Algorithm 1. The algorithm L ¼ JðG; SÞ tr iGi : (7) first initializes matrix factors (Section 2.8) and then itera- i¼1 tively refines them by alternating between fixing G and Then for i; j; such that Rij 2R: updating S, and then fixing S and updating G, until conver- gence. Successive updates of Gi and Sij converge to a local @L T T T ¼ 2Gi RijGj þ 2Gi GiSijGj Gj; minimum of the optimization problem. @Sij We derive multiplicative updating rules for regularized decomposition of relation matrices by fixing one matrix fac- and for i ¼ 1; 2; ...;r: tor (e.g., G) and considering the roots of the partial deriva- X @L T T T tive with respect to the other matrix factor (e.g., S, and vice- ¼ 2RijGjSij þ 2GiSijGj GjSij @Gi versa) of the Lagrangian function. The latter is constructed j:Rij2R X from the objective function (Eq. (5)): T T T þ 2R GjSji þ 2GiS G GjSji X ji ji j (8) T T T j:Rji2R JðG; SÞ¼ tr RijRij 2Gj RijGiSij maxi ti Rij2R X QðtÞ : T T T þ 2 i Gi i þ Gi GiSijGj GjSij (6) t¼1 t r maxXi i X @L T ðtÞ Fixing G ; G ; ...; Gr and letting ¼ 0 for all i; j ¼ 1; þ tr G Q Gi : 1 2 @Sij i i ; ;r t¼1 i¼1 2 ... , we obtain:
T 1 T T 1 Regarding the correctness and convergence of the algo- S ¼ðG GÞ G RGðG GÞ : rithm in Algorithm 1 we have the following two theorems. @L 0 i 1; 2; ...;r: We then fix S and let @Gi ¼ for ¼ We get Theorem 1 (Correctness of DFMF algorithm). If the update an expression for the KKT multiplier i from Eq. (8). rules for matrix factors G and S from Algorithm 1 converge, Then the KKT complementary condition for the non- then the final solution satisfies the KKT conditions of optimality. negativity of Gi is ZITNIK AND ZUPAN: DATA FUSION BY MATRIX FACTORIZATION 45
based on the cophenetic correlation coefficient r [14]. We 0 ¼ i Gi 2 compute these measures for the target relation matrix. The X i j 4 T T T RSS is computed over observed associations ðop;oqÞ in Rij as ¼ 2RijGjSij þ 2GiSijGj GjSij P T 2 j:Rij2R p; q : # RSSðRijÞ¼ ½ðRij GiSijGj Þð Þ Similarly, explained t P X maxXi i 2 2 T T T ðtÞ variance is R ðRijÞ¼1 RSSðRijÞ= ½Rijðp; qÞ : þ 2RjiGjSji þ 2GiSjiGj GjSji þ 2Qi Gi Gi: j:Rji2R t¼1 We assess the three quality scores through internal cross- 2 R ij RSS ij r ij (9) validation and observe how ðR Þ, ðR Þ and ðR Þ vary with changes of factorization ranks. We select ranks Let us here introduce variables Gi to denote Gi ¼ i Gi: k1;k2; ...;kr where the cophenetic coefficient begins to fall, Eq. (9) is a fixed point equation and the solution must the explained variance is high and the RSS curve shows an satisfy it at convergence. We let: inflection point [15]. QðtÞ QðtÞ þ QðtÞ 2.6 Prediction from Matrix Factors i ¼ i i b ij T T þ T The approximate relation matrix R for the target pair of RijGjSij ¼ RijGjSij RijGjSij b T : object types Ei and Ej is reconstructed as Rij ¼ GiSijGj T T T T þ T T SijGj GjSij ¼ SijGj GjSij SijGj GjSij When the model is requested to propose relations for a new i T T þ T object on of type Ei that was not included in the training RjiGjSji ¼ RjiGjSji RjiGjSji iþ1 data, we need to estimate its factorized representation and T T T T þ T T ; SjiGj GjSji ¼ SjiGj GjSji SjiGj GjSji use the resulting factors for prediction. We formulate a non- negative linear least-squares and solve it with an efficient where all matrices on right-hand sides are nonnegative. interior point Newton-like method [16] for minxl 0jjðGlSliþ T i;l 2 i;l n i R l Then, given an initial guess of G , the successive updates GlSil Þxl oniþ1jj2, where oniþ1 2 is the original descrip- of Gi using Eqs. (10), (11), and (12) converge to a local oi Rki tion of object niþ1 (if available) and xl 2 is its factorized minimum of the problem in Eq. (5). It can be easily representation (for l ¼ 1; 2; ...;r and l 6¼ i). A solution vec- i P seen that using such a rule, at convergence, G satisfies T b tor given by xl is added to Gi and a new Rij 2 Gi Gi ¼ 0; which is equivalent to Gi ¼ 0 (Eq. (9)) due to l Rðniþ1Þ nj i is computed. nonnegativity of G . tu i j We would like to identify object pairs ðop;oqÞ for which Theorem 2 (Convergence of DFMF algorithm). The objective b the predicted degree of relation Rijðp; qÞ is unusually high. function JðG; SÞ given by Eq. (5) is nonincreasing under the i j We are interested in candidate pairs ðop;oqÞ for which the updating rules for matrix factors G and S in Algorithm 1. b estimated association score Rijðp; qÞ is greater than the mean Please see the Appendix for a detailed proof of the above i estimated score of all known relations of op: theorem. Our proof essentially follows the idea of auxiliary functions often used in the convergence proofs of approxi- X b p; q > 1 b p; m ; mate matrix factorization algorithms [13]. Rijð Þ oi ; Rijð Þ (13) jAð p EjÞj j i om2Aðop;EjÞ 2.4 Stopping Criterion oi ; oi In this paper we apply data fusion to infer relations between where Að p EjÞ is the set of all objects of Ej related to p. two target object types, Ei and Ej. We hence define the stop- Notice that this rule is row-centric, that is, given an object of ping criterion that observes convergence in approximation type Ei, it searches for objects of the other type (Ej) that it of only the target matrix Rij. Our convergence criterion is could be related to. We can modify the rule to become col- the absolute difference of target matrix reconstruction error, umn-centric, or even combine the two rules. T 2 For example, let us consider that we are studying disease jjRij GiSijGj jj ; computed in two consecutive iterations predispositions for a set of patients. Let the patients be of DFMF algorithm. In our experiments the stopping objects of type Ei and diseases objects of type Ej. A patient- threshold was set to 10 5. To reduce the computational centric rule would consider a patient and his medical his- load, the convergence criterion was assessed only every fifth tory and through Eq. (13) propose a set of new disease asso- iteration. ciations. A disease-centric rule would instead consider all patients already associated with a specific disease and iden- 2.5 Parameter Estimation tify other patients with a sufficiently high association score. Parameters to DFMF algorithm are factorization ranks, We can combine row-centric and column-centric k ;k ; ...;kr: 1 2 These are chosen from a predefined interval of approaches. For example, we can first apply a row-centric possible rank values such that their choice maximizes the approach to identify candidates of type Ei and then estimate estimated quality of the model. To reduce the number of j the strength of association to a specific object oq by reporting required factorization runs we mimic the bisection method an inverse percentile of association score in the distribution by first testing rank values at the midpoint and borders of oj specified ranges and then for each rank value selecting the of scores for all true associations of q, that is, by considering b subinterval for which the resulting model was of higher the scores in the q-ed column of Rij. In our gene function quality. We evaluate the models through the explained vari- prediction study, we use row-centric approach for ance, the residual sum of squares (RSS) and a measure candidate identification and column-centric approach for 46 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 37, NO. 1, JANUARY 2015 association scoring, and in the experiment from cheminfor- cannot be used to model relations between more than two matics we apply row-centric approach to both tasks. object types, which is a major distinction with the algorithm proposed in this paper. 2.7 An Ensemble Approach to Prediction Zhang et al. [22] proposed a joint matrix factorization to Different initializations of Gi may in practice give rise to dif- decompose a number of data matrices Ri into a common ferent factorizations of the fusion system. To leverage this basis matrix W and different coefficientP matrices Hi, such 2 effect we construct an ensemble of factorization models. that Ri WHi by minimizing i jjRi WHijj . This is an The resulting matrix factors in each model may also be dif- intermediate integration approach with different data sour- ferent due to small random perturbations of selected factori- ces but it can describe only relations whose objects (i.e., zation ranks. We use each factorization system for inference rows in Ri) are fixed across relation matrices. Similar of associations (Section 2.6) and then select the candidate approaches but with various regularization types were also pair through a majority vote. That is, the rule from Eq. (13) proposed, such as network- or relation-regularized con- must apply in more than one half of factorized systems of straints [23], [24] and hierarchical priors [25], [26]. Our work the ensemble. Ensembles improved the predictive accuracy generalizes these approaches by simultaneously dealing and stability of the factorized system and the robustness of with objects of different types, where we can vary object the results. In our experiments the ensembles combined types along both dimensions of relation matrices, Rij) and 15 factorization models. can constrain objects of every type. There is an abundance of work on matrix factorization 2.8 Matrix Factor Initialization models that consider a single dyadic relation matrix or The inference of the factorized system in Section 2.1 is multiple relation matrices between the same two types of sensitive to the initialization of factor G.Properinitiali- objects [10], [21], [24], [26], [27], [28] that are subsumed zation sidesteps the issue of local convergence and in our approach. For instance, Nickel [29] proposed a tri- reduces the number of iterations needed to obtain matrix factorization model for multiple dyadic relations that T factors of equal quality. We initialize G by separately ini- factorized every Ri as Ri ASiA . Although their model tializing each Gi; using algorithms for single-matrix fac- is appropriate for certain tasks of collective learning, all torization. Factors S are computed from G (Algorithm 1) Ri describe relations between the same two sets of and do not require initialization. objects, whereas our approach models multi-relational Wang et al. [10] and several other authors [13] use sim- and multi-object type data. ple random initialization. Other more informed initializa- Rettinger et al. [30] proposed context-aware tensor tion algorithms include random C [17], random Acol [17], decomposition for relation prediction in social networks, non-negative double SVD and its variants [18], and CARTD. They decompose a tensor into additive factorized k-means clustering or relaxed SVD-centroid initialization matrices using two-factor decomposition. They assume that [17]. We show that the latter approaches are indeed better input data is provided together with the contextual infor- over a random initialization (Section 5.4). We use random mation that describes one specific relation, the recommen- Acol in our case study. Random Acol computes each col- dation. The drawback of their and similar approaches [27], umn of Gi as an element-wise average of a random subset [31], [32] for r-ary tensors is that in higher dimensions of columns in Rij. (r>3) the tensors become increasingly sparse and the computational requirements become infeasible. Notice that r 3RELATED WORK here corresponds to number of different object types in DFMF. In comparison, the approach proposed in this paper Approximate matrix factorization estimates a data matrix R can handle tens of different object types. as a product of low-rank matrix factors that are found by Wang et al. [10] and Wang et al. [12] proposed tri-SPMF solving an optimization problem. In two-factor decomposi- and S-NMTF, respectively, a simultaneous clustering of Rn m tion, R 2 is decomposed to a product WH, where multi-type relational data via symmetric nonnegative Rn k Rk m k n; m W 2 , H 2 and minð Þ. A large class of matrix tri-factorization. These two methods are conceptu- matrix factorization algorithms minimize discrepancy ally similar to our approach and use both inter-type and between the observed matrix and its low-rank approxima- intra-type relations, but they require a full set of symmetric T tion, such that R WH. For instance, SVD, non-negative relation matrices, Rij ¼ Rji. These assumptions of tri-SPMF matrix factorization and exponential family PCA all mini- and S-NMTF are rarely met in real-world fusion scenarios mize Bregman divergence [19]. (see, for example, a fusion configuration from Fig. 2, which Although often used in data analysis for dimensionality is not a six-clique), where we do not have access to relation reduction, clustering or low-rank approximation, there matrices between all possible pairs of object types (i.e., Rij have been only a few applications of matrix factorization in for 1 i