Fusion Tutorial [BC]2 Basel, June 9, 2015

Jane looks foR help! jane’s personal hairball! Hi jane. NAR just published 176 new bio databases*! .... Messy. Think about What’s wrong? Ohhh! all different edge types! I have no idea how to How about stiching Make anything useful. them in a single TRMT61A data table? RecA_monomer-monomer_interface TOP3B Tried it. A nightmare! NFX1 PMS1 RPL11 GBP2 Homologous Recombination Repair RAD52 TP53 Double-Strand Break Repair Think of GO annotationS POLD4 RAD54B

ACACB POLD1 RAD9B DNA Repair I could work this out, RPA1 EXO1 in the data table RPA4 EIF5A ACACA CRYAB MLH1 RPA2 but not for every MND1 TERF2IP PMS2 TOP3A EME1 DNAJA3 of yeast phenotypes! POLR2K CDK2 TERF2 Meiotic Recombination MUS81 different data source RFC5 RFC2 PRKDC RFC4 ATR UBE2I BIOCARTA_ATM_PATHWAY MYO18A RPA3 POLD3 RAD9A out there. MSH3 BARD1 RFC3 FANCD2 BIOCARTA_ATRBRCA_PATHWAY RFC1 PCNA AIRE WRN ZNF280B MLH3 XRCC5 XRCC2 DMC1 MDC1 Told you! MRE11A CSNK1D DNA_recomb/repair_RecA RAD51 COPB2 APEX2 Homologous recombination BRCA1 MSH5 RAD50 DNA_recomb/repair_Rad51_C MSH6 BRIP1 POLD2 FANCL NBN MSH4 MSH2 XRCC6 HSPA9 SEC14L5 H2AFX BRCA2 RAD51D FANCF RAD54L BLM FANCC TOPBP1 CSNK1E MED6 ATM XRCC3 XRCC4 PPP1CC DNA_recomb_RecA/RadB_ATP-bd SHFM1 CHEK2 JUN FANCA Mismatch repair FANCE C10orf2 RAD51AP1 LIG1 MSH5-C6orf26 RAD51C FANCG CHEK1 FEN1 TP53BP1

FIGN SSBP1 RBBP8 UIMC1 PALB2 RAD51B Meiosis Homologous recombination repair of ... GYS1 BARD1 signaling events

Fanconi anemia pathway

CSTF1 FAM175A

ANAPC2

* Fernandez-suarez & galperin, nucleic acids research, 2013.

Large-scale data fusion by collective matrix factorization Tutorial at the Basel Computational Biology Conference, Basel, Switzerland, 2015

These notes include introduction Welcome to the hands-on Data Fusion Tutorial! This tutorial is designed to integrative with for researchers and biologists with interest in data analysis examples from collaborative and large-scale . We will explore latent factor models, a filtering and systems biology, popular class of approaches that have in recent years seen many and Orange workflows that we successful applications in integrative data analysis. We will describe the will construct during the tutorial. intuition behind matrix factorization and explain why factorization Tutorial instructors: approaches are suitable when collectively analyzing many heterogeneous Marinka Zitnik and Blaz Zupan, data sets. To practice data fusion, we will construct visual data fusion with the help from members of workflows using Orange and its Data Fusion Add-on. Lab, Ljubljana. If you haven’t already installed Orange, please follow the installation guide at http://biolab.github.io/datafusion-installation-guide.

* See http://helikoid.si/recomb14/zitnik-zupan-recomb14.png for our full award-winning poster on data fusion.

1 Data Fusion Tutorial [BC]2 Basel, June 9, 2015

Lesson 1: Everything is a Matrix In many data mining applications there are plenty of potentially beneficial data available. However, these data naturally come in various formats and at different levels of granularity, can be represented in totally different input data spaces and typically describe distinct data types.

For joint predictive modeling of heterogeneous data we need a generic way to encode data that might be fundamentally different from each other, both in type and in structure. An effective way to organize a data compendium is to view each data set a matrix. Matrices describe dyadic relationships, which are relationships between two groups of objects. A matrix relates objects in the rows to objects in the columns. Examples of data matrices commonly used in the analysis of biological data include degrees of protein-protein interactions from the STRING database that are represented in a gene-to-gene matrix:

Gene interaction network gacT gemA rdiA racN racJ racI xacA racM gemA can easily be converted gacT gacT to a matrix. Each wighted rdiA gemA racN edge in a network rdiA corresponds to a matrix racJ racN entry. racM racJ racI racI xacA xacA racM

Binary matrices can be used to associate Gene ontology terms with cellular pathways:

alg13 Binary relations between alg7 alg1 alg14 two object types can be Fructose and mannose represented with a binary metabolism Part of N-Glycan biosynthesis pathway Ontology terms matrix. dpm1 dpm2 dpm3

Protein N-linked glycosylation (GO:0006487)

Orthology Ontology Pathways Dolichol kinase (K00902) GO:0004168 Alpha-mannosidase II (K01231) GO:0004572 Oligosaccharyltransferase complex (K12668) GO:0008250

2 Data Fusion Tutorial [BC]2 Basel, June 9, 2015

research articles with Medical Subject Headings (MeSH):

Papers cited in PubMed Medical Subject Headings are tagged with MeSH terms. We can use one MeSH terms large binary matrix to Cell separation encode relations Cytoplasmic vesicles/metabolism Ethidium/metabolism between research articles Immunity/innate and MeSH terms. Mutation Literature Phagocytes/cytology Phagocytes/immunology* Phagocytosis*

or membership of genes in pathway, one column for each pathway:

Just like the relations Part of N-Glycan biosynthesis pathway between MeSH terms and scientific papers, we alg13 alg7 alg1 alg2 can encode pathway alg14 memberships of genes in Pathways Fructose and mannose one large matrix that has metabolism genes in rows, pathways alg11 in commons. dpm1 dpm2 dpm3 alg3 Genes

alg12 alg9

GPI-anchor biosynthesis

The structure of Gene Ontology can be represented with a real-valued matrix whose elements represent distance or semantic similarity between the corresponding ontological terms:

Any ontology can be Part of Gene Ontology graph Gene Ontology terms Response to Response to represented with a Response external biotic to stress square matrix. We use stimulus stimulus ontology to measure

Response to Response distances between its to stress Response to external biotic other organisms entities, and encode stimulus these distances in a Response Defense Response to to bacterium distance matrix. response other organisms

Defense Response to response to Defense bacterium other organism response

Defense Gene Ontology terms response to bacterium 3 Data Fusion Tutorial [BC]2 Basel, June 9, 2015

Lesson 2: The Challenge

Suppose we would like to identify genes whose mutants exhibit a certain phenotype, e.g., genes that are sensitive to Gram negative bacteria. In addition to current knowledge about phenotypical annotations, i.e. data encoded in a gene-to- phenotype matrix, which might be incomplete and contain some erroneous information, there exists a variety of circumstantial evidence, such as gene expression data, literature data, annotations of research articles etc.

An obvious question is how to link these seemingly disparate data sets. In many applications there exists some correspondence between different input dimensions. For example, genes can be linked to MeSH terms via gene-to-publication and publication-to-MeSH-term data matrices. This is an important observation, which we exploit to define a relational structure of the entire data system.

The major challenge for such problems is how to jointly model multiple types of data heterogeneity in a mutually beneficial way. For example, in the scheme below, can information about the relatedness of MeSH terms and similarity between phenotypes from the Phenotype Ontology help us to improve the accuracy of recognizing Gram negative defective genes?

The data excerpt on the right comes from a gene prioritization problem where our goal was to find candidates for bacterial response genes Gram neg. defective Aberrant spore color Aberrant spore color Decreased chemotaxis Gram pos.defective Phenotype in a social amoeba Ontology Dictyostelium. Other than Mutant for a few seed genes, Phenotypes there was not any data Timepoints Publications from which we could directly infer the bacterial spc3 swp1 phenotype of mutants. kif9 Pubmed data Hence, we considered alyL Genes nagB1 circumstantial data sets gpi and hoped that their shkA nip7 fusion would uncover MeSH terms Expression interesting new bacterial Phenotype data data response genes. MeSH terms MeSH Ontology MeSH annotations

4 Data Fusion Tutorial [BC]2 Basel, June 9, 2015

Lesson 3: Recommender Systems Sparse matrices and matrix completion have been thoroughly addressed in the area of machine learning called recommender systems. Several methods from this field form foundation for matrix-based data fusion. Hence, we diverge here from fusion to recommender systems, and for a while, from biology to movies.

How would you decide which movie to recommend to a friend? Obviously, a useful source of information might be ratings of the movies your friend had seen in the past, i.e. one star up to five stars. Movie recommender systems primarily use user ratings information from which they estimate correlations between different movies and similarities between users and infer a prediction model which can be used to make recommendations about which movie a user should see next.

For example, in the figure below we see a movie ratings data matrix containing information for four users and four movies. Notice that in a real setting such matrices can contain information for millions of users and hundreds of thousands of movies. However, each individual user typically sees only a small proportion of all the movies and rates even fewer of them. Hence, data matrices in recommender systems are typically extremely sparse, e.g., it is common that up to ~99% of matrix elements are unknown. This characteristic together with a strong relational structure of the data, i.e. “you might enjoy movies, which users similar to you, are enthusiastic about” and “you might like movies that are similar to the movies you have already seen and rated favorably.”

Is there an analogy between recommender systems and challenges in systems biology? John, Kate, Alex and We will answer these questions in the next lessons. Mike rated a selection from four movies, Passengers, War of the Worlds, Bride Wars and The Matrix Reloaded. John, for example, has seen Passengers and User Bride Wars, and did not PassengersWar of Bridethe Worlds TheWars Matrix Reloaded like them so much. Which John 2 3 of the two other movies, Kate 5 4 if any, he should see? Alex 4 5 The movie rating matrix Mike 4 5 has users in rows and movies in columns. We made this explicit in this Movie simple graph: object types are represented as nodes (User, Movie) and an edge is labeled with a matrix that relates them.

5 Data Fusion Tutorial [BC]2 Basel, June 9, 2015

Lesson 4: Matrix Factorization and Completion

Taking our four-by-four movie ratings matrix we can try to factorize it into a product of two much smaller latent matrices called latent factors. One latent matrix describes latent profiles of the users and the other matrix contains latent data representation of the movies.

For example, each of our four users is described by a latent profile of length two and similarly, each movie is explained via two latent components, i.e. L1 and L2. The dimensionality of a latent model is typically called factorization rank.

Two-factorization of a user-movie rating matrix L1 L2 from the previous page. John 0.2 0 Factorization rank of 2 was used. Should Kate 0 0.5 PassengersWar of Bridethe Worlds TheWars Matrix Reloaded factorization rank be the same for both latent Alex 0.5 0 L1 6.3 0 1.1 8 matrices? Mike 0 0.5 L2 3.9 10.7 0 3.3

The challenge of matrix factorization stems from the difficulty of estimating the latent matrices in a way that their matrix product minimizes some measure of discrepancy between the input data matrix and its reconstruction obtained by factorization. Importantly, reconstructed matrix is complete, i.e. all of its elements are defined, which we exploit for making predictions.

Latent model, that is, the two latent matrices are complete, hence their product is also a complete matrix. This product (matrix on the right) is an estimate of an PassengersWar of Bridethe Worlds TheWars Matrix Reloaded original matrix (matrix on 2 3 1.3 0 0.2 1.6 the left). How good is our John reconstruction? Which of Kate 5 4 2 5.4 0 1.7 the two movies should ~ be recommend to Mike? Alex 4 5 3.2 0 0.6 4 Mike 4 5 2 5.4 0 1.7

6 Data Fusion Tutorial [BC]2 Basel, June 9, 2015

Lesson 5: Matrix Tri-Factorization So far, we found a decomposition of the movie ratings matrix into two latent matrices. An alternative approach is to factorize it into three latent matrices; one latent matrix that expresses the degrees of user membership to each of the latent components, i.e. user recipe matrix; another latent matrix with memberships of movies to movie-specific latent components, i.e. movie recipe matrix. A third matrix, i.e. a backbone matrix, captures the interactions between latent components specific to the users, i.e. U1, U2, and components specific to the movies, i.e. M1, M2. The backbone matrix (a 2x2 matrix in the middle) could be seen as a compressed version of original user-movies rating matrix. It has U1 U2 “meta” users in rows and John 0.2 0.3 “meta” movies in columns. We can use the Kate 0.8 0.2 M1 M2 PassengersWar of Bridethe Worlds TheWars Matrix Reloaded two recipe matrices (left and right matrix) to Alex 0.7 1.2 U1 -4.4 9.1 M1 0.9 0.2 0.2 1 transform the backbone 0.8 1.2 6.7 -5.8 0.6 0.8 0.1 0.7 matrix back to the Mike U2 M2 original user-movies space. Similarly to the previous lesson, the goal of matrix ti-factorization is to estimate three latent matrices that provide a quality approximation of observed entries in the input data matrix. By selecting sufficiently small factorization rank, we compress the data, which ensures generalization and consequently prediction of how a given user would enjoy a particular movie he has not seen before.

Just like for two- factorization, in tri- factorization the latent matrices are complete. Passengers Bride Wars So is their product (the War of the WorldsThe Matrix Reloaded matrix on the right). This John 2 3 1.1 0.3 0.2 1.2 product is also an estimate of the original Kate 5 4 1.7 4.5 0.2 2.1 matrix. How good is our ~ estimate? Which movie Alex 4 5 4.1 0.5 0.9 4.5 should be recommended to Mike? Mike 4 5 1.5 4.8 0.1 1.9

7 Data Fusion Tutorial [BC]2 Basel, June 9, 2015

Lesson 6: Tri-Factorization in Orange

Let's try matrix tri-factorization in practice. We construct a visual workflow in In Orange workflows its Orange, a data mining suite. The workflow loads movie ratings, represents them components (widgets) with a data matrix, tri-factorizes it and explores the latent factors. load or process data and pass the information to other widgets. Widgets inputs are on its left, and outputs on its right. Try adding a Data Table widget to display an input data set and any of the latent factors! In this tutorial we organize the data sets using a structure that we call a data fusion graph. It shows the relational structure of entire . Each distinct type of objects, e.g., users, movies, is represented with a node and each data set corresponds to an edge that relates two types of objects, e.g., movie ratings data relate users with the movies.

In Latent Factors widget one can select any of the latent matrices and then explore them further, say, through hierarchical clustering.

8 Data Fusion Tutorial [BC]2 Basel, June 9, 2015

Lesson 7: Colective Matrix Factorization and Sharing of Latent Factors

Orange workflow on this In the previous lesson we analyzed a single data set. Ultimately, we would like to page adds another data colectively tri-factorize many heterogeneous data sets across different input spaces. source: movie genres. How does that effect the Suppose we have collected information about movie genres. This is a relation that results of the movie relates movies to genres, hence our data fusion graph gets an additional node, i.e. clustering? genres, and an edge linking movies with genres.

To fuse heterogeneous data at large scales we need to define the kind of knowledge that can be transferred between related data matrices, types of objects and prediction tasks. Data fusion algorithms typically rely on one of the following three assumptions:

Relation transfer: We build the relational map called a data fusion graph of all the relations considered in data fusion and relax the assumptions about independently and identically distributed relations.

Object type transfer: We assume that there exists a common feature space shared by the input spaces, which can be used as a bridge to transfer knowledge.

Parameter transfer: We make use of latent model parameterization and assume that heterogeneous input spaces have shared latent parameters and hyperparameters.

In collective matrix factorization we achieve data fusion by sharing latent matrices across related data sets.

In our running example we reuse the movie recipe matrix in both decompositions of user-to-movie as well as movie-to-genre matrices. Importantly, collective matrix factorization estimates the latent matrices for all data sets in a compendium simultaneously, which ensures transfer of knowledge between data, i.e. data fusion, and presents many unique opportunities from the application perspective and challenges in algorithmic design.

9 Data Fusion Tutorial [BC]2 Basel, June 9, 2015

Lesson 8: More Complex Fusion Schemes, Data Sampling and Completion Scoring

So far we fused at most two data sets. Let’s proceed by constructing a larger data compendium. There are many other sources of information that might be informative for movie recommendation, for example, user demographics profiles, movie casting, information about movie directors and screenplays, scenery, etc.

We construct an Orange workflow that considers four data sources, i.e. movie ratings, movie casting, genres and relationships between actors, and fuse them via collective matrix factorization.

This data fusion configuration is already a complex one. We are using four different data sources. Try having a Fusion Graph widget window open, so that you can see the data fusion schema as it shapes up when adding the data sets.

A simple way to assess the benefits of integrative data analysis over the analysis of a single homogeneous data set is to measure the quality of predictions made by data fusion versus the quality of prediction model inferred from a part of data collection.

The assessment is fair if we evaluate predictions for data that are hidden fom the algorithm during prediction model inference. There are four different ways of partitioning a data matrix into a training and a test set:

10 Data Fusion Tutorial [BC]2 Basel, June 9, 2015

In predictive modeling tasks, such as movie recommendation, where we regress against the target variable, i.e. movie rating, we can evaluate model quality by reporting a variety of measures, including the root mean squared error (RMSE). A lower RMSE value indicate a better model. Alternatively, if our goal would be to rank the movies from what the model believes are the most enjoyable to the least enjoyable for a given user, we would use the area under curve (AUC).

How does the quality of reconstruction change when adding or removing data sets from the fusion schema? Try it out! Should RMSE always decrease with new data sources being added?

11 Data Fusion Tutorial [BC]2 Basel, June 9, 2015

Lesson 9: “Meta Genes” - Latent Profiling

Until now we focused on non-biological data. We now apply a latent factor model to gene expression data. The microarray data for this example is from an influential paper by DeRisi, Iyer, and Brown (Science 1997), who explored the metabolic and genetic control of gene expression on a genomic scale. The authors used DNA microarrays to study temporal gene expression of almost all genes in baker’s yeast Saccharomyces cerevisiae during the metabolic shift from fermentation to respiration. Expression levels were measured at seven time points during the diauxic shift, e.g. T1 to T7.

V5

V4 Experiments T1 T2 T3 T4 T5 T6 T7 V3

V2

V1 Gene expression Genes

T1 T2 T3 T4 T5 T6 T7 Timepoints

As we will see in this and in the next lessons, collective matrix factorization is a generic and flexible tool for integrative data analysis in different domains, e.g., recommender systems and functional genomics.

We construct an Orange workflow that reads the expression data into Orange using What is similar between matrix-based movie Table to Relation widget, tri-factorizes the data, and explores the estimated latent recommendation system data representation using various Orange widgets, such as Linear Projection, and data fusion in Scatter Plot and Multi-dimensional Scaling (MDS). molecular biology? Everything! We’ll use the same set of Orange widgets for bio data fusion. All tricks that we have learned so far apply.

12 Data Fusion Tutorial [BC]2 Basel, June 9, 2015

By factorizing gene expression data we obtained three latent data matrices: a gene recipe matrix, an experiment recipe matrix, and a backbone matrix that relates both recipe matrices in the latent space.

It is common in matrix factorization algorithms to interpret the experiment recipe matrix as a matrix that reports on expression of “meta genes,” i.e. “genes,” whose profiles are obtained from the original gene expression profiles by a linear (or, non- linear, depending on a latent factor model) transformation. Similarly, one can see the gene recipe matrix as a matrix that reports on expression of genes in “meta experiments,” i.e. “experiments,” which cannot be interpreted in an intuitive manner but which can improve the quality of prediction models applied to them, e.g., clustering of genes based on their recipe matrix and enrichment analysis of detected clusters.

Combination of hierarchical clustering and GO enrichment analysis is a cool way to explore results of data fusion. Genes in the data set we are exploring are also function-labeled. Any other ideas how to use latent matrices? Classification, perhaps? 13 And then estimation of AUC in cross-validation? Data Fusion Tutorial [BC]2 Basel, June 9, 2015

Lesson 10: The Yeast Case Study

Next, we collectively analyze eight data sets from molecular biology of yeast S. cervisiae (load the data sets from http://bit.ly/1Gb8SJ7). We organize them in a data fusion graph with six object types and eight edges, one for each data set.

This schema looks boring. But it offers so much for the patient one! Try adding matrix sampling and RMSE- based evaluation! Or clustering with gene set enrichment. Or data projection based on any of the latent matrices.

14 Data Fusion Tutorial [BC]2 Basel, June 9, 2015

Lesson 11: Latent Matrix Chaining

The concept of chaining latent matrices is important because it allows us to profile objects in the latent space of any other object type based on the connectivity in the data fusion graph.

In the simplest scenario, where object types are adjacent in the fusion graph, e.g., “Genes” and “Experiments” from Lesson 9, chaining construct data profiles of one object type, e.g., genes, in the latent space of another object type, e.g., experiments, by multiplying the recipe matrix of the first object type by the backbone matrix of the data set. The resulting profile matrix has objects of the first type, e.g., genes, in rows and the latent components of the second type, e.g., experiments, in columns.

However, the power of chaining becomes apparent when we would like to profile objects whose types are not direct neighbors in the fusion graph, such as “Genes” and “Literature Topics,” i.e. MeSH terms, in the fusion graph from Lesson 10. To profile genes in the latent space of literature topics chaining starts with the recipe matrix of genes and multiplies it by backbone matrices of gene-to-literature and literature- to-literature-topic data sets on the path from “Genes” to “Literature Topics” in the fusion graph. This procedure yields profiles of genes in the latent space of literature topics.

Latent matrix chaining constructs dense profiles that include the most informative features obtained by collectively compressing data via matrix factorization. Intuitively, chaining is able to establish links between genes and literature topics even though relationships between these object types are not available in input data.

A conceptual presentation Literature Topics of yeast biology Literature of profiling of genes in the Topic latent space of MeSH terms. The MeSH-based gene profiles are constructed by multiplying latent factors on the path Gene literature from one to another object Chain of type. latent matrices Gene Literature topic Gene

Gene Literature topic = x x =

Gene profile matrix 15 Data Fusion Tutorial [BC]2 Basel, June 9, 2015

In Orange we chain the latent matrices of a data system using the Chaining widget.

Chaining widget allows us to select a start object type and a target object type (highlighted in orange below) in the latent fusion graph. It then computes the chains associated with selected nodes from the fusion graph. The so obtained profile matrices can be used for further data analysis.

Chaining and construction of relations for objects that were originally not related in any input data set comes as an extra benefit of matrix-based data fusion. Try exploring the chaining results by feeding them into a data table first, and then push them through unsupervised or supervised analysis pipeline.

16 Data Fusion Tutorial [BC]2 Basel, June 9, 2015

Lesson 12: Case Studies in Data Fusion

2 Identification of the Pharmacologic Action mechanisms of action of chemical compounds is a R12 Θ1 3 crucial task in drug PMID 1 R13 discovery. We have Chemical integrated 6 data sets to R14 4 6 improve prediction R46 Depositor R15 Depositor pharmacologic actions of Category chemical compounds (IEEE 5 TPAMI 2015). Substructure Fingerprint

Root layer ...... gastric lymphoma 146 Hodgkin’s lymphoma We have fused 11 systems- crescentic glomerulonephritis Cushing’s syndrome Largest level molecular data sets to disease class predict disease-disease Level 1 cancer associations (Sci Reports inherited metabolic disorders nervous system diseases 51 2013). respiratory system diseases bile duct disease cardiovascular system disease hemolytic-uremic syndrome

Level 2 immune system diseases cognitive disorders acquired metabolic diseases 18 metabolic diseases cancer pulpitis periodontitis

Level 3

. . . 6 Disease class size: abetalipoproteinemia, a single disease lung metastasis dysgerminoma two diseases serous cystadenoma factor XIII defciency three or more diseases Plasmodium falciparum malaria

18 eighteen diseases

17 Data Fusion Tutorial [BC]2 Basel, June 9, 2015

5 Data fusion of 11 data sets 4 5 R45 MeSH PMID substantial raised the IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,Descriptor VOL. X, NO. X, OCTOBER 2013 11

accuracy of gene function R14 R42 TABLE 1 3 predictions, also when Cross-validated F11 and AUC accuracy scores for fusion by matrix factorization (DFMF), kernel-based method (MKL), random forestsExperimental (RF) and relational learning-based matrix factorization (tri-SPMF). compared to kernel-based R 1 13 Condition 2 data integration approach Prediction taskGene DFMF MKL RF tri-SPMF F12 AUC F1 AUC F1 AUC F1 AUC (IEEE TPAMI 2015). R12 100 D. discoideum genesGO 0.799 Term 0.801 0.781 0.788 0.761 0.785 0.731 0.724 1000 D. discoideum genesR16 0.826 0.823 0.787 0.798 0.767 0.788 0.756 0.741 Whole D. discoideum genomeR62 0.831 0.849 0.800 0.821 0.782 0.801 0.778 0.787 Pharmacologic actions6 0.663 0.834 0.639 0.811 0.643 0.819 0.641 0.810 6 TABLE 2 KEGG Gene Ontology term-specific cross-validatedPathway F1 and AUC accuracy scores for fusion by matrix factorization (DFMF), kernel-based method (MKL), random forests (RF) and relational learning-based matrix factorization (tri-SPMF). Terms in Gene Ontology belong to one of three namespaces, biological process (BP), molecular function (MF) or cellular component.

GO term name Term identifier Namespace Size DFMF MKL RF tri-SPMF

F1 AUC F1 AUC F1 AUC F1 AUC Activation of adeny. cyc. act. 0007190 BP 11 0.834 0.844 0.770 0.781 0.758 0.601 0.729 0.731 Chemotaxis 0006935 BP 58 0.981 0.980 0.794 0.786 0.538 0.724 0.804 0.810 Chemotaxis to cAMP 0043327 BP 21 0.922 0.910 0.835 0.862 0.798 0.767 0.838 0.815 Phagocytosis 0006909 BP 33 0.956 0.932 0.892 0.901 0.789 0.619 0.836 0.810 Response to bacterium 0009617 BP 51 0.899 0.870 0.788 0.761 0.785 0.761 0.817 0.831 Cell-cell adhesion 0016337 BP 14 0.883 0.861 0.867 0.856 0.728 0.725 0.799 0.834 Actin binding 0003779 MF 43 0.676 0.781 0.664 0.658 0.642 0.737 0.671 0.682 Lysozyme activity 0003796 MF 4 0.782 0.750 0.774 0.750 0.754 0.625 0.747 0.625 Seq.-spec. DNA bind. t. f. a. 0003700 MF 79 0.956 0.948 0.894 0.901 0.732 0.759 0.892 0.852

1.0 0.835 to held-out constraints to zero so that they did not affect the cost function during optimization. Fig. 5b 0.8 0.830 Step 1. Compressive shows that including additional information on genes Prioritization of genes in data fusion score 0.6 score in the form of constraints improves the predictive

(a)1 0.825 (c) 1 (b) F F performance of DFMFFused latent for gene function prediction. A Data fusion A a quest to identify the graph A data representation 0.4 0.820 Averaged most promising Averaged 0.2 B 0.815 E 5.4B Matrix Factor InitializationE Study B candidates for bacterial We studied the effect of matrix factor initialization on 0.0 Tri-factorization 0.810 0.0 0.2 0.4 0.6 0.8 1.0 DFMF by observing the reconstruction error after one response in R12 R12, 1 Rof12 matrix, R13 RA-B12, R13, 1 Proportion of 1 included in the model and after twenty iterations of optimization procedure, A Dictyostelium fused 13 (a) (b) the latter beingCollective matrix about one fourth of the iterations D D requiredC forfactorization the optimization algorithm to convergeC input data sets. Out of 9 Fig. 5. AddingBackbone new data sources (a) or incorporating Recipe matrix of B when predicting gene functions. We estimated the morematrix object-type-specific of A A-B constraints in ⇥ (b) both top-rated candidates, 8 1 error relative to the optimal (k1,k2,...,k6)-rank ap- increase the accuracy of matrix factorization-based proximation given by the SVD. For iteration v and Recipe F F G G predictions were models for gene functionmatrix of B prediction task. matrix Rij the error was computed by:

(v) (v) T (v) 2 confirmed in the wet lab Rij Gi Sij (Gj ) dF (Rij , [Rij ]k) Err (v)= || || , (14) A Object type ij dF (Rij , [Rij ]k) (submitted, 2015). functionx predictionx = in~ Fig. 5a, where we started with Data set relating only the target dataReconstructed source R12 andA thenB addedobjects of type either A (v) (v) (v) matrix A-B to objects of type B where Gi , Gj and Sij were matrix factors ob- R13 or ⇥1 or both. Similar effects were observed Step 2. Object profiling by tainedStep 3. after Similarity v iterations of factorization algorithm. chainingwhen of latent we matrices studied other combinations of data sources estimation In Eq. (14), d (R , [R ] )= R U ⌃ VT 2 (not shown here for brevity). Notice also that due to F ij ij k || ij k k k || (d) (e) denotes(f) the Frobenius distance between Rij and its ensembling the cross-validated varianceChains of of Flatent1 matricesis small. starting at A A k-rank approximation givenSeed bygenes the SVD, where k = i A i max(ki,kj) is the approximationii rank. Errij(v) is a Chains 5.3 Sensitivity to Inclusion of Constraints Candidate iii pessimisticgene measure of quantitativeiv accuracy because ii A B v We varied the sparseness of gene constraint matrix ⇥1 of the choice of k. Thisvi error measure is similar Similarity scoring Similarity score Scored vii aggregation candidate gene by holding outChain of a random subset ofiii protein-proteinA E G in- B to the error of the two-factor non-negative matrix viii latent matrices Seed teractions. We set the entries of ⇥1 that corresponded factorization from [17]. ix D A C C genes iv A D F B Similarity score matrix

Step 4. Gene ranking v A D F C F (g) vi A D

Profiling of objects of type A in the latent space of C vii A E A C = A D F C

Similarity scoring Similarity score Candidate A C = x x x = viii A D F aggregation Scored genes candidate genes Profile matrix A C ix A E G

Similarity score matrices

18 Data Fusion Tutorial [BC]2 Basel, June 9, 2015 In drug toxicity prediction 12 9 the task was to distinguish Hematology, Sample biochemistry, metadata between compounds that liver weight represent little or no health R concern and those with the R R6;12 5;9 R7;9 R 5;12 R6;9 8;9 greatest likelihood to cause 11 Drug type adverse effects in humans 6 7 (CAMDA 2013). High- Sample from Sample from rat in vivo rat in 5 repeated dose vitro study 8 throughput and Sample from study R10;11 Sample from rat in vivo human in toxicogenomic screening single dose R7,10 vitro study study R6,10 R5,10 coupled with a plethora of R8,10 circumstantial evidence 10 R4,8 R1,5 R3,7 provide a challenge for R2,6 Drug Θ10,10 R improved toxicity R1,10 4,10 1 4 R prediction and require Gene from 2,10 R3,10 Gene from Θ Θ 4,4 rat in vivo 1,1 human in appropriate computational single dose 2 3 vitro study study Gene from Θ2,2 R Gene from Θ3,3 methods that integrate rat in vivo 10;14 rat in vitro repeated dose study various biological, chemical study R 14 R and toxicological data. R1,13 2,13 3,13 DILI R4,13 Fusion of 29 data sets potential allowed us to improve prediction accuracy well above that achieved by 13 Θ standard supervised GO term 13,13 approaches (Sys Biomed 2014).

19 Data Fusion Tutorial [BC]2 Basel, June 9, 2015

Lesson 13: Related Work on Data Fusion

Lanckriet, Gert R.G., et al. A statistical framework for genomic data fusion. Bioinformatics 20.16 (2004): 2626-2635. The first study to propose a kernel-based integration as a way of intermediate data integration.

Schadt, Eric E., et al. An integrative genomics approach to infer causal associations between gene expression and disease. Nature Genetics 37.7 (2005): 710-717. This study integrated DNA variation and gene expression data to identify drivers of complex traits.

Aerts, Stein, et al. Gene prioritization through genomic data fusion. Nature Biotechnology 24.5 (2006): 537-544. The paper describes Endeavour, a tool to prioritize candidate genes underlying biological processes or diseases, based on their similarity to known genes involved in these phenomena.

Mostafavi, Sara, et al. GeneMANIA: a real-time multiple association network integration algorithm for predicting gene function. Genome Biology 9.Suppl 1 (2008): S4. GeneMANIA is a tool that integrates multiple functional association networks and predicts gene functions using label propagation.

Zitnik, Marinka, et al. Discovering disease-disease associations by fusing systems- level molecular data. Scientific Reports 3 (2013). A study of relationships between diseases based on evidence from fusing available molecular interaction and ontology data.

Wang, Bo, et al. Similarity network fusion for aggregating data types on a genomic scale. Nature Methods 11.3 (2014): 333-337. Fusion of cancer patient similarity networks by combining mRNA expression, DNA methylation and microRNA expression data.

Zitnik, Marinka, and Zupan, Blaz. Matrix factorization-based data fusion for drug- induced liver injury prediction. Systems Biomedicine 2.1 (2014): 16-22. An application of a data fusion approach for prediction of drug toxicity in humans using 29 data sets provided by the CAMDA 2013 Challenge.

Ritchie, Marylyn D., et al. Methods of integrating data to uncover genotype- phenotype interactions. Nature Reviews Genetics 16.2 (2015): 85-97. This review explores emerging approaches for data integration including multi-staged, meta- dimensional and factor analysis.

Zitnik, Marinka, and Zupan, Blaz. Data fusion by matrix factorization. IEEE Transactions on Pattern Analysis and Machine Inteligence 37.1 (2015): 41-53. An introduction and formalization of collective matrix factorization as presented in this tutorial. The paper also provides mathematical derivation of optimization approach.

20 Data Fusion Tutorial [BC]2 Basel, June 9, 2015

Lesson 14: Related Tools for Data Fusion

21 Data Fusion Tutorial [BC]2 Basel, June 9, 2015

Lesson 15: Data Fusion in Python

We have developed a scripting library in Python, which implements collective matrix factorization and completion, and is suitable for fusion of large data compendia.

The official source code repository is at http://github.com/marinkaz/scikit-fusion.

22