Large-Scale Data Fusion by Collective Matrix Factorization Tutorial at the Basel Computational Biology Conference, Basel, Switzerland, 2015
Total Page:16
File Type:pdf, Size:1020Kb
Data Fusion Tutorial [BC]2 Basel, June 9, 2015 Jane looks foR help! jane’s personal hairball! Hi jane. NAR just published 176 new bio databases*! .... Messy. Think about What’s wrong? Ohhh! all different edge types! I have no idea how to How about stiching Make anything useful. them in a single TRMT61A data table? RecA_monomer-monomer_interface TOP3B Tried it. A nightmare! NFX1 PMS1 RPL11 GBP2 Homologous Recombination Repair RAD52 TP53 Double-Strand Break Repair Think of GO annotationS POLD4 RAD54B ACACB POLD1 RAD9B DNA Repair I could work this out, RPA1 EXO1 in the data table RPA4 EIF5A ACACA CRYAB MLH1 RPA2 but not for every MND1 TERF2IP PMS2 TOP3A EME1 DNAJA3 of yeast phenotypes! POLR2K CDK2 TERF2 Meiotic Recombination MUS81 different data source RFC5 RFC2 PRKDC RFC4 ATR UBE2I BIOCARTA_ATM_PATHWAY MYO18A RPA3 POLD3 RAD9A out there. MSH3 BARD1 RFC3 FANCD2 BIOCARTA_ATRBRCA_PATHWAY RFC1 PCNA AIRE WRN ZNF280B MLH3 XRCC5 XRCC2 DMC1 MDC1 Told you! MRE11A CSNK1D DNA_recomb/repair_RecA RAD51 COPB2 APEX2 Homologous recombination BRCA1 MSH5 RAD50 DNA_recomb/repair_Rad51_C MSH6 BRIP1 POLD2 FANCL NBN MSH4 MSH2 XRCC6 HSPA9 SEC14L5 H2AFX BRCA2 RAD51D FANCF RAD54L BLM FANCC TOPBP1 CSNK1E MED6 ATM XRCC3 XRCC4 PPP1CC DNA_recomb_RecA/RadB_ATP-bd SHFM1 CHEK2 JUN FANCA Mismatch repair FANCE C10orf2 RAD51AP1 LIG1 MSH5-C6orf26 RAD51C FANCG CHEK1 FEN1 TP53BP1 FIGN SSBP1 RBBP8 UIMC1 PALB2 RAD51B Meiosis Homologous recombination repair of ... GYS1 BARD1 signaling events Fanconi anemia pathway CSTF1 FAM175A ANAPC2 * Fernandez-suarez & galperin, nucleic acids research, 2013. Large-scale data fusion by collective matrix factorization Tutorial at the Basel Computational Biology Conference, Basel, Switzerland, 2015 These notes include introduction Welcome to the hands-on Data Fusion Tutorial! This tutorial is designed to integrative data analysis with for data mining researchers and biologists with interest in data analysis examples from collaborative and large-scale data integration. We will explore latent factor models, a filtering and systems biology, popular class of approaches that have in recent years seen many and Orange workflows that we successful applications in integrative data analysis. We will describe the will construct during the tutorial. intuition behind matrix factorization and explain why factorization Tutorial instructors: approaches are suitable when collectively analyzing many heterogeneous Marinka Zitnik and Blaz Zupan, data sets. To practice data fusion, we will construct visual data fusion with the help from members of workflows using Orange and its Data Fusion Add-on. Bioinformatics Lab, Ljubljana. If you haven’t already installed Orange, please follow the installation guide at http://biolab.github.io/datafusion-installation-guide. * See http://helikoid.si/recomb14/zitnik-zupan-recomb14.png for our full award-winning poster on data fusion. !1 Data Fusion Tutorial [BC]2 Basel, June 9, 2015 Lesson 1: Everything is a Matrix In many data mining applications there are plenty of potentially beneficial data available. However, these data naturally come in various formats and at different levels of granularity, can be represented in totally different input data spaces and typically describe distinct data types. For joint predictive modeling of heterogeneous data we need a generic way to encode data that might be fundamentally different from each other, both in type and in structure. An effective way to organize a data compendium is to view each data set a matrix. Matrices describe dyadic relationships, which are relationships between two groups of objects. A matrix relates objects in the rows to objects in the columns. Examples of data matrices commonly used in the analysis of biological data include degrees of protein-protein interactions from the STRING database that are represented in a gene-to-gene matrix: Gene interaction network gacT gemA rdiA racN racJ racI xacA racM gemA can easily be converted gacT gacT to a matrix. Each wighted rdiA gemA racN edge in a network rdiA corresponds to a matrix racJ racN entry. racM racJ racI racI xacA xacA racM Binary matrices can be used to associate Gene ontology terms with cellular pathways: alg13 Binary relations between alg7 alg1 alg14 two object types can be Fructose and mannose represented with a binary metabolism Part of N-Glycan biosynthesis pathway Ontology terms matrix. dpm1 dpm2 dpm3 Protein N-linked glycosylation (GO:0006487) Orthology Ontology Pathways Dolichol kinase (K00902) GO:0004168 Alpha-mannosidase II (K01231) GO:0004572 Oligosaccharyltransferase complex (K12668) GO:0008250 !2 Data Fusion Tutorial [BC]2 Basel, June 9, 2015 research articles with Medical Subject Headings (MeSH): Papers cited in PubMed Medical Subject Headings are tagged with MeSH terms. We can use one MeSH terms large binary matrix to Cell separation encode relations Cytoplasmic vesicles/metabolism Ethidium/metabolism between research articles Immunity/innate and MeSH terms. Mutation Literature Phagocytes/cytology Phagocytes/immunology* Phagocytosis* or membership of genes in pathway, one column for each pathway: Just like the relations Part of N-Glycan biosynthesis pathway between MeSH terms and scientific papers, we alg13 alg7 alg1 alg2 can encode pathway alg14 memberships of genes in Pathways Fructose and mannose one large matrix that has metabolism genes in rows, pathways alg11 in commons. dpm1 dpm2 dpm3 alg3 Genes alg12 alg9 GPI-anchor biosynthesis The structure of Gene Ontology can be represented with a real-valued matrix whose elements represent distance or semantic similarity between the corresponding ontological terms: Any ontology can be Part of Gene Ontology graph Gene Ontology terms Response to Response to represented with a Response external biotic to stress square matrix. We use stimulus stimulus ontology to measure Response to Response distances between its to stress Response to external biotic other organisms entities, and encode stimulus these distances in a Response Defense Response to to bacterium distance matrix. response other organisms Defense Response to response to Defense bacterium other organism response Defense Gene Ontology terms response to bacterium !3 Data Fusion Tutorial [BC]2 Basel, June 9, 2015 Lesson 2: The Challenge Suppose we would like to identify genes whose mutants exhibit a certain phenotype, e.g., genes that are sensitive to Gram negative bacteria. In addition to current knowledge about phenotypical annotations, i.e. data encoded in a gene-to- phenotype matrix, which might be incomplete and contain some erroneous information, there exists a variety of circumstantial evidence, such as gene expression data, literature data, annotations of research articles etc. An obvious question is how to link these seemingly disparate data sets. In many applications there exists some correspondence between different input dimensions. For example, genes can be linked to MeSH terms via gene-to-publication and publication-to-MeSH-term data matrices. This is an important observation, which we exploit to define a relational structure of the entire data system. The major challenge for such problems is how to jointly model multiple types of data heterogeneity in a mutually beneficial way. For example, in the scheme below, can information about the relatedness of MeSH terms and similarity between phenotypes from the Phenotype Ontology help us to improve the accuracy of recognizing Gram negative defective genes? The data excerpt on the right comes from a gene prioritization problem where our goal was to find candidates for bacterial response genes Gram neg. defective Aberrant spore color Aberrant spore color Decreased chemotaxis Gram pos.defective Phenotype in a social amoeba Ontology Dictyostelium. Other than Mutant for a few seed genes, Phenotypes there was not any data Timepoints Publications from which we could directly infer the bacterial spc3 swp1 phenotype of mutants. kif9 Pubmed data Hence, we considered alyL Genes nagB1 circumstantial data sets gpi and hoped that their shkA nip7 fusion would uncover MeSH terms Expression interesting new bacterial Phenotype data data response genes. MeSH terms MeSH Ontology MeSH annotations !4 Data Fusion Tutorial [BC]2 Basel, June 9, 2015 Lesson 3: Recommender Systems Sparse matrices and matrix completion have been thoroughly addressed in the area of machine learning called recommender systems. Several methods from this field form foundation for matrix-based data fusion. Hence, we diverge here from fusion to recommender systems, and for a while, from biology to movies. How would you decide which movie to recommend to a friend? Obviously, a useful source of information might be ratings of the movies your friend had seen in the past, i.e. one star up to five stars. Movie recommender systems primarily use user ratings information from which they estimate correlations between different movies and similarities between users and infer a prediction model which can be used to make recommendations about which movie a user should see next. For example, in the figure below we see a movie ratings data matrix containing information for four users and four movies. Notice that in a real setting such matrices can contain information for millions of users and hundreds of thousands of movies. However, each individual user typically sees only a small proportion of all the movies and rates even fewer of them. Hence, data matrices in recommender systems are typically extremely sparse, e.g., it is common that up to ~99% of matrix elements are unknown. This characteristic together