Computational Approaches for Protein Function Prediction: a Survey

Total Page:16

File Type:pdf, Size:1020Kb

Computational Approaches for Protein Function Prediction: a Survey Computational Approaches for Protein Function Prediction: A Survey Gaurav Pandey, Vipin Kumar and Michael Steinbach Department of Computer Science and Engineering, University of Minnesota {gaurav,kumar,steinbac}@cs.umn.edu Proteins are the most essential and versatile macromolecules of life, and the knowledge of their functions is a cru- cial link in the development of new drugs, better crops, and even the development of synthetic biochemicals such as biofuels. Experimental procedures for protein function prediction are inherently low throughput and are thus unable to annotate a non-trivial fraction of proteins that are becoming available due to rapid advances in genome sequencing technology. This has motivated the development of computational techniques that utilize a variety of high-throughput experimental data for protein function prediction, such as protein and genome sequences, gene expression data, protein interaction networks and phylogenetic profiles. Indeed, in a short period of a decade, several hundred articles have been published on this topic. This survey aims to discuss this wide spectrum of approaches by categorizing them in terms of the data type they use for predicting function, and thus identify the trends and needs of this very important field. The survey is expected to be useful for computational biologists and bioinformaticians aiming to get an overview of the field of computational function prediction, and identify areas that can benefit from further research. Key Words and Phrases: Protein function prediction, bioinformatics, Gene Ontology, multiple biological data types, high-throughput experimental data, data mining, non-homology based methods Contents 1 Introduction 3 2 What is Protein Function? 6 2.1 FunctionalClassificationSchemes . 7 2.2 GOistheWaytoGo!............................ 10 2.3 Discussion.................................. 13 3 Protein Sequences 13 3.1 Introduction................................. 13 3.2 Annotation transfer from homologues: How good is it for function predic- tion?..................................... 15 3.3 Existing Approaches Beyond Simple Homology-based Annotation Transfer 16 3.3.1 Homology-basedapproaches. 17 3.3.2 Subsequence-basedapproaches . 19 Authors’ Address: Department of Computer Science and Engineering, University of Minnesota, 4-192 EE/CS Building, 200 Union Street SE, Minneapolis, MN 55414, USA Support: This material is based upon work supported by the National Science Foundation under Grant Nos. IIS-0308264 and ITR-0325949. Access to computing facilities was provided by the Minnesota Supercomputing Institute. 2 · Pandey et al. 3.3.3 Feature-basedapproaches . 24 3.4 Discussion.................................. 27 4 Protein Structure 28 4.1 Introduction................................. 28 4.2 IsStructureTiedtoFunction? . 30 4.3 ExistingApproaches ............................ 32 4.3.1 StructuralSimilarity-basedApproaches . .... 33 4.3.2 Three-dimensionalMotif-basedApproaches. .... 35 4.3.3 Surface-basedApproaches . 36 4.3.4 Learning-basedApproaches . 38 4.4 Discussion.................................. 39 5 Genomic Sequences 39 5.1 Introduction................................. 39 5.2 ExistingApproaches ............................ 40 5.2.1 Genome-wide homology-basedannotation transfer . ..... 40 5.2.2 Approachesexploitinggeneneighborhood . .. 41 5.2.3 Approachesexploitinggenefusion. 43 5.3 ComparisonandAssimilationoftheApproaches . ..... 44 6 Phylogenetic Data 46 6.1 Introduction................................. 46 6.2 ExistingApproaches ............................ 48 6.2.1 ApproachesUsingPhylogeneticProfiles . .. 48 6.2.2 ApproachesUsingPhylogeneticTrees . 51 6.2.3 HybridApproaches.. .... .... ... .... .... .... 54 6.3 Discussion.................................. 54 7 Gene Expression Data 55 7.1 Introduction................................. 55 7.2 ExistingApproaches ............................ 57 7.2.1 Clustering-basedapproaches . 57 7.2.2 Classification-basedapproaches . 62 7.2.3 Temporalanalysis-basedapproaches. .. 65 7.3 Discussion.................................. 66 8 Protein Interaction Networks 67 8.1 Introduction................................. 67 8.2 ThePromiseofProteinInteractionNetworks . ..... 70 8.3 Existingapproaches............................. 71 8.3.1 Neighborhood-basedapproaches. 72 8.3.2 Globaloptimization-basedapproaches . ... 75 8.3.3 Clustering-basedapproaches . 78 8.3.4 AssociationAnalysis-basedApproaches . ... 80 8.4 Discussion.................................. 81 Computational Techniques for Protein Function Prediction: A Survey · 3 9 Literature and Text 82 9.1 Introduction................................. 82 9.2 ExistingApproaches ............................ 82 9.2.1 IR-basedapproaches . 84 9.2.2 Textmining-basedapproaches . 85 9.2.3 NLP-basedapproaches . 86 9.2.4 Keywordsearch........................... 88 9.3 StandardizationInitiatives . ... 90 9.3.1 BioCreAtIvE ............................ 90 9.3.2 TREC2003GenomicsTrack. 92 9.4 Discussion.................................. 93 10 Mutiple Data Types 93 10.1Introduction................................. 93 10.2 ExistingApproaches ............................ 94 10.2.1 ApproachesUsingaCommonDataFormat . 94 10.2.2 ApproachesUsingIndependentDataFormats . ... 97 10.3Discussion.................................. 104 11 Conclusions 105 1. INTRODUCTION Proteins are macromolecules that serve as building blocks and functional components of a cell, and account for the second largest fraction of the cellular weight after water. Proteins are responsible for some of the most important functions in an organism, such as constitu- tion of the organs (structural proteins), the catalysis of biochemical reactions necessary for metabolism (enzymes), and the maintenance of the cellular environment (transmembrane proteins). Thus, proteins are the most essential and versatile macromolecules of life, and the knowledge of their functions is a crucial link in the development of new drugs, better crops, and even the development of synthetic biochemicals such as biofuels. The early approaches to predicting protein function were experimental and usually fo- cused on a specific target gene or protein, or a small set of proteins forming natural groups such as protein complexes. These approaches included gene knockout, targeted mutations and the inhibition of gene expression [Weaver 2002]. However, irrespective of the details, these approaches are low-throughput because of the huge experimental and human effort required in analyzing a single gene or protein. As a result, even large-scale experimen- tal annotation initiatives, such as the EUROFAN project [Oliver 1996], are inadequate for annotating a non-trivial fraction of the proteins that are becoming available due to rapid advances in genome sequencing technology. This has resulted in a continually expanding sequence-function gap for the discovered proteins [Roberts 2004]. In an attempt to close this gap, numerous high-throughputexperimental procedures have been invented to investigate the mechanisms leading to the accomplishment of a protein’s function. These procedures have generated a wide variety of useful data that ranges from simple protein sequences to complex high-throughput data, such as gene expression data sets and protein interaction networks. These data offer different types of insights into a protein’s function and related concepts. For instance, protein interaction data shows which proteins come together to perform a particular function, while the three-dimensional 4 · Pandey et al. structure of a protein determines the precise sites to which the interacting protein binds itself. Furthermore, recent years have seen the recording of this data in very standardized and professionally maintained databases such as SWISS-PROT [Boeckmann et al. 2003], MIPS [Mewes et al. 2002], DIP [Xenarios et al. 2002] and PDB [Berman et al. 2000]. The huge amount of data that has accumulated over the years has made biological dis- covery via manual analysis tedious and cumbersome. This has, in turn, necessitated the use of techniques from the field of bioinformatics, an approach that is crucial in today’s age of rapid generation and warehousing of biological data. Bioinformatics focuses on the uti- lization of techniques from computer science and the development of novel computational approaches for addressing problems in molecular biology and associated disciplines. In- deed, a more recently advocated path for biological research is the creation of hypotheses by generating results from an appropriate bioinformatics algorithm in order to narrow the search space, and the subsequent validation of these hypotheses to reach the final conclu- sion [Rastan and Beeley 1997; Roberts 2004]. Standard sequence comparison tools such as BLAST [Altschul et al. 1990; Altschul et al. 1997], and databases such as PROSITE [Hulo et al. 2006], Pfam [Sonnhammer et al. 1997] and PRINTS [Attwood et al. 2003] serve as testimonials to the benefits that bioinformatics can provide to molecular biology. Following the success of computational approaches in solving important problems such as sequence alignment and comparison [Altschul et al. 1997], and genome fragment as- sembly [Shendure et al. 2004], and given the importance of protein function, numerous computational techniques have also been proposed for predicting protein function. Early approaches used sequence similarity tools such as BLAST [Altschul et al. 1990] to trans- fer functional annotation from the most similar proteins. Subsequently,
Recommended publications
  • Gene Prediction and Genome Annotation
    A Crash Course in Gene and Genome Annotation Lieven Sterck, Bioinformatics & Systems Biology VIB-UGent [email protected] ProCoGen Dissemination Workshop, Riga, 5 nov 2013 “Conifer sequencing: basic concepts in conifer genomics” “This Project is financially supported by the European Commission under the 7th Framework Programme” Genome annotation: finding the biological relevant features on a raw genomic sequence (in a high throughput manner) ProCoGen Dissemination Workshop, Riga, 5 nov 2013 Thx to: BSB - annotation team • Lieven Sterck (Ectocarpus, higher plants, conifers, … ) • Yao-cheng Lin (Fungi, conifers, …) • Stephane Rombauts (green alga, mites, …) • Bram Verhelst (green algae) • Pierre Rouzé • Yves Van de Peer ProCoGen Dissemination Workshop, Riga, 5 nov 2013 Annotation experience • Plant genomes : A.thaliana & relatives (e.g. A.lyrata), Poplar, Physcomitrella patens, Medicago, Tomato, Vitis, Apple, Eucalyptus, Zostera, Spruce, Oak, Orchids … • Fungal genomes: Laccaria bicolor, Melampsora laricis- populina, Heterobasidion, other basidiomycetes, Glomus intraradices, Pichia pastoris, Geotrichum Candidum, Candida ... • Algal genomes: Ostreococcus spp, Micromonas, Bathycoccus, Phaeodactylum (and other diatoms), E.hux, Ectocarpus, Amoebophrya … • Animal genomes: Tetranychus urticae, Brevipalpus spp (mites), ... ProCoGen Dissemination Workshop, Riga, 5 nov 2013 Why genome annotation? • Raw sequence data is not useful for most biologists • To be meaningful to them it has to be converted into biological significant knowledge
    [Show full text]
  • Gene Structure Prediction
    Gene finding Lorenzo Cerutti Swiss Institute of Bioinformatics EMBNet course, September 2002 Gene finding EMBNet 2002 Introduction Gene finding is about detecting coding regions and infer gene structure Gene finding is difficult DNA sequence signals have low information content (degenerated and highly unspe- • cific) It is difficult to discriminate real signals • Sequencing errors • Prokaryotes High gene density and simple gene structure • Short genes have little information • Overlapping genes • Eukaryotes Low gene density and complex gene structure • Alternative splicing • Pseudo-genes • 1 Gene finding EMBNet 2002 Gene finding strategies Homology method Gene structure can be deduced by homology • Requires a not too distant homologous sequence • Ab initio method Requires two types of information • . compositional information . signal information 2 Gene finding EMBNet 2002 Gene finding: Homology method 3 Gene finding EMBNet 2002 Homology method Principles of the homology method. Coding regions evolve slower than non-coding regions, i.e. local sequence similarity • can be used as a gene finder. Homologous sequences reflect a common evolutionary origin and possibly a common • gene structure, i.e. gene structure can be solved by homology (mRNAs, ESTs, proteins, domains). Standard homology search methods can be used (BLAST, Smith-Waterman, ...). • Include ”gene syntax” information (start/stop codons, ...). • Homology methods are also useful to confirm predictions inferred by other methods 4 Gene finding EMBNet 2002 Homology method: a simple view Gene of unknown structure Homology with a gene of known structure Exon 1 Exon 2 Exon 3 Find DNA signals ATG GT {TAA,TGA,TAG} AG 5 Gene finding EMBNet 2002 Procrustes Procrustes is a software to predict gene structure from homology found in pro- teins (Gelfand et al., 1996) Principle of the algorithm • .
    [Show full text]
  • Molecular Basis for the Distinct Cellular Functions of the Lsm1-7 and Lsm2-8 Complexes
    bioRxiv preprint doi: https://doi.org/10.1101/2020.04.22.055376; this version posted April 23, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. Molecular basis for the distinct cellular functions of the Lsm1-7 and Lsm2-8 complexes Eric J. Montemayor1,2, Johanna M. Virta1, Samuel M. Hayes1, Yuichiro Nomura1, David A. Brow2, Samuel E. Butcher1 1Department of Biochemistry, University of Wisconsin-Madison, Madison, WI, USA. 2Department of Biomolecular Chemistry, University of Wisconsin School of Medicine and Public Health, Madison, WI, USA. Correspondence should be addressed to E.J.M. ([email protected]) and S.E.B. ([email protected]). Abstract Eukaryotes possess eight highly conserved Lsm (like Sm) proteins that assemble into circular, heteroheptameric complexes, bind RNA, and direct a diverse range of biological processes. Among the many essential functions of Lsm proteins, the cytoplasmic Lsm1-7 complex initiates mRNA decay, while the nuclear Lsm2-8 complex acts as a chaperone for U6 spliceosomal RNA. It has been unclear how these complexes perform their distinct functions while differing by only one out of seven subunits. Here, we elucidate the molecular basis for Lsm-RNA recognition and present four high-resolution structures of Lsm complexes bound to RNAs. The structures of Lsm2-8 bound to RNA identify the unique 2′,3′ cyclic phosphate end of U6 as a prime determinant of specificity. In contrast, the Lsm1-7 complex strongly discriminates against cyclic phosphates and tightly binds to oligouridylate tracts with terminal purines.
    [Show full text]
  • Posters A.Pdf
    INVESTIGATING THE COUPLING MECHANISM IN THE E. COLI MULTIDRUG TRANSPORTER, MdfA, BY FLUORESCENCE SPECTROSCOPY N. Fluman, D. Cohen-Karni, E. Bibi Department of Biological Chemistry, Weizmann Institute of Science, Rehovot, Israel In bacteria, multidrug transporters couple the energetically favored import of protons to export of chemically-dissimilar drugs (substrates) from the cell. By this function, they render bacteria resistant against multiple drugs. In this work, fluorescence spectroscopy of purified protein is used to unravel the mechanism of coupling between protons and substrates in MdfA, an E. coli multidrug transporter. Intrinsic fluorescence of MdfA revealed that binding of an MdfA substrate, tetraphenylphosphonium (TPP), induced a conformational change in this transporter. The measured affinity of MdfA-TPP was increased in basic pH, raising a possibility that TPP might bind tighter to the deprotonated state of MdfA. Similar increases in affinity of TPP also occurred (1) in the presence of the substrate chloramphenicol, or (2) when MdfA is covalently labeled by the fluorophore monobromobimane at a putative chloramphenicol interacting site. We favor a mechanism by which basic pH, chloramphenicol binding, or labeling with monobromobimane, all induce a conformational change in MdfA, which results in deprotonation of the transporter and increase in the affinity of TPP. PHENOTYPE CHARACTERIZATION OF AZOSPIRILLUM BRASILENSE Sp7 ABC TRANSPORTER (wzm) MUTANT A. Lerner1,2, S. Burdman1, Y. Okon1,2 1Department of Plant Pathology and Microbiology, Faculty of Agricultural, Food and Environmental Quality Sciences, Hebrew University of Jerusalem, Rehovot, Israel, 2The Otto Warburg Center for Agricultural Biotechnology, Faculty of Agricultural, Food and Environmental Quality Sciences, Hebrew University of Jerusalem, Rehovot, Israel Azospirillum, a free-living nitrogen fixer, belongs to the plant growth promoting rhizobacteria (PGPR), living in close association with plant roots.
    [Show full text]
  • Sox2-RNA Mechanisms of Chromosome Topological Control in Developing Forebrain
    bioRxiv preprint doi: https://doi.org/10.1101/2020.09.22.307215; this version posted September 22, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. Title: Sox2-RNA mechanisms of chromosome topological control in developing forebrain Ivelisse Cajigas1, Abhijit Chakraborty2, Madison Lynam1, Kelsey R Swyter1, Monique Bastidas1, Linden Collens1, Hao Luo1, Ferhat Ay2,3, Jhumku D. Kohtz1,4 1Department of Pediatrics, Northwestern University, Feinberg School of Medicine, Department of Human Molecular Genetics, Stanley Manne Children's Research Institute 2430 N Halsted, Chicago, IL 60614 2Centers for Autoimmunity and Cancer Immunotherapy, La Jolla Institute for Immunology, 9420 Athena Circle, La Jolla, CA, 92037, USA 3School of Medicine, University of California San Diego, 9500 Gilman Drive, La Jolla, CA 92093, USA 4To whom correspondence should be addressed: [email protected] 1 bioRxiv preprint doi: https://doi.org/10.1101/2020.09.22.307215; this version posted September 22, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. Summary Precise regulation of gene expression networks requires the selective targeting of DNA enhancers. The Evf2 long non-coding RNA regulates Dlx5/6 ultraconserved enhancer(UCE) interactions with long-range target genes, controlling gene expression over a 27Mb region in mouse developing forebrain. Here, we show that Evf2 long range gene repression occurs through multi-step mechanisms involving the transcription factor Sox2, a component of the Evf2 ribonucleoprotein complex (RNP).
    [Show full text]
  • Proteomics Provides Insights Into the Inhibition of Chinese Hamster V79
    www.nature.com/scientificreports OPEN Proteomics provides insights into the inhibition of Chinese hamster V79 cell proliferation in the deep underground environment Jifeng Liu1,2, Tengfei Ma1,2, Mingzhong Gao3, Yilin Liu4, Jun Liu1, Shichao Wang2, Yike Xie2, Ling Wang2, Juan Cheng2, Shixi Liu1*, Jian Zou1,2*, Jiang Wu2, Weimin Li2 & Heping Xie2,3,5 As resources in the shallow depths of the earth exhausted, people will spend extended periods of time in the deep underground space. However, little is known about the deep underground environment afecting the health of organisms. Hence, we established both deep underground laboratory (DUGL) and above ground laboratory (AGL) to investigate the efect of environmental factors on organisms. Six environmental parameters were monitored in the DUGL and AGL. Growth curves were recorded and tandem mass tag (TMT) proteomics analysis were performed to explore the proliferative ability and diferentially abundant proteins (DAPs) in V79 cells (a cell line widely used in biological study in DUGLs) cultured in the DUGL and AGL. Parallel Reaction Monitoring was conducted to verify the TMT results. γ ray dose rate showed the most detectable diference between the two laboratories, whereby γ ray dose rate was signifcantly lower in the DUGL compared to the AGL. V79 cell proliferation was slower in the DUGL. Quantitative proteomics detected 980 DAPs (absolute fold change ≥ 1.2, p < 0.05) between V79 cells cultured in the DUGL and AGL. Of these, 576 proteins were up-regulated and 404 proteins were down-regulated in V79 cells cultured in the DUGL. KEGG pathway analysis revealed that seven pathways (e.g.
    [Show full text]
  • A Curated Benchmark of Enhancer-Gene Interactions for Evaluating Enhancer-Target Gene Prediction Methods
    University of Massachusetts Medical School eScholarship@UMMS Open Access Articles Open Access Publications by UMMS Authors 2020-01-22 A curated benchmark of enhancer-gene interactions for evaluating enhancer-target gene prediction methods Jill E. Moore University of Massachusetts Medical School Et al. Let us know how access to this document benefits ou.y Follow this and additional works at: https://escholarship.umassmed.edu/oapubs Part of the Bioinformatics Commons, Computational Biology Commons, Genetic Phenomena Commons, and the Genomics Commons Repository Citation Moore JE, Pratt HE, Purcaro MJ, Weng Z. (2020). A curated benchmark of enhancer-gene interactions for evaluating enhancer-target gene prediction methods. Open Access Articles. https://doi.org/10.1186/ s13059-019-1924-8. Retrieved from https://escholarship.umassmed.edu/oapubs/4118 Creative Commons License This work is licensed under a Creative Commons Attribution 4.0 License. This material is brought to you by eScholarship@UMMS. It has been accepted for inclusion in Open Access Articles by an authorized administrator of eScholarship@UMMS. For more information, please contact [email protected]. Moore et al. Genome Biology (2020) 21:17 https://doi.org/10.1186/s13059-019-1924-8 RESEARCH Open Access A curated benchmark of enhancer-gene interactions for evaluating enhancer-target gene prediction methods Jill E. Moore, Henry E. Pratt, Michael J. Purcaro and Zhiping Weng* Abstract Background: Many genome-wide collections of candidate cis-regulatory elements (cCREs) have been defined using genomic and epigenomic data, but it remains a major challenge to connect these elements to their target genes. Results: To facilitate the development of computational methods for predicting target genes, we develop a Benchmark of candidate Enhancer-Gene Interactions (BENGI) by integrating the recently developed Registry of cCREs with experimentally derived genomic interactions.
    [Show full text]
  • There Is a Lot of Research on Gene Prediction Methods
    LBNL #58992 Fungal Genomic Annotation Igor V. Grigoriev1, Diego A. Martinez2 and Asaf A. Salamov1 1US Department of Energy Joint Genome Institute, Walnut Creek, CA 94598 ([email protected], [email protected]); 2Los Alamos National Laboratory Joint Genome Institute, P.O. Box 1663 Los Alamos, NM 87545 ([email protected]). Sequencing technology in the last decade has advanced at an incredible pace. Currently there are hundreds of microbial genomes available with more still to come. Automated genome annotation aims to analyze this amount of sequence data in a high-throughput fashion and help researches to understand the biology of these organisms. Manual curation of automatically annotated genomes validates the predictions and set up 'gold' standards for improving the methodologies used. Here we review the methods and tools used for annotation of fungal genomes in different genome sequencing centers. 1. INTRODUCTION In recent years the power of DNA sequencing has dramatically increased, with dedicated centers running 24 hours a day 7 days a week able to produce as much as 2 gigabases of raw sequence or more a month. The researchers who work on a variety of fungi are fortunate, as most fungal genomes are under 50 megabases and produce high- quality draft assembly almost as easily as bacteria. This feature of fungal genomes is a key reason that the first sequenced eukaryotic genome was of the ascomycete Saccharomyces cerevisiae (Goffeau et al. 1996). As of the submission of this chapter, one can obtain draft sequences of more than 100 fungal genomes (Table 1) and the list is growing. While some are species of the same genus (e.g., Aspergillus has three members and more coming), there still remains a height of data that could confuse and bury a researcher for many years.
    [Show full text]
  • Gene Prediction Using Deep Learning
    FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Gene prediction using Deep Learning Pedro Vieira Lamares Martins Mestrado Integrado em Engenharia Informática e Computação Supervisor: Rui Camacho (FEUP) Second Supervisor: Nuno Fonseca (EBI-Cambridge, UK) July 22, 2018 Gene prediction using Deep Learning Pedro Vieira Lamares Martins Mestrado Integrado em Engenharia Informática e Computação Approved in oral examination by the committee: Chair: Doctor Jorge Alves da Silva External Examiner: Doctor Carlos Manuel Abreu Gomes Ferreira Supervisor: Doctor Rui Carlos Camacho de Sousa Ferreira da Silva July 22, 2018 Abstract Every living being has in their cells complex molecules called Deoxyribonucleic Acid (or DNA) which are responsible for all their biological features. This DNA molecule is condensed into larger structures called chromosomes, which all together compose the individual’s genome. Genes are size varying DNA sequences which contain a code that are often used to synthesize proteins. Proteins are very large molecules which have a multitude of purposes within the individual’s body. Only a very small portion of the DNA has gene sequences. There is no accurate number on the total number of genes that exist in the human genome, but current estimations place that number between 20000 and 25000. Ever since the entire human genome has been sequenced, there has been an effort to consistently try to identify the gene sequences. The number was initially thought to be much higher, but it has since been furthered down following improvements in gene finding techniques. Computational prediction of genes is among these improvements, and is nowadays an area of deep interest in bioinformatics as new tools focused on the theme are developed.
    [Show full text]
  • Prediction of Protein-Protein Interactions and Essential Genes Through Data Integration
    Prediction of Protein-Protein Interactions and Essential Genes Through Data Integration by Max Kotlyar A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy Graduate Department of Department of Medical Biophysics University of Toronto Copyright °c 2011 by Max Kotlyar Abstract Prediction of Protein-Protein Interactions and Essential Genes Through Data Integration Max Kotlyar Doctor of Philosophy Graduate Department of Department of Medical Biophysics University of Toronto 2011 The currently known network of human protein-protein interactions (PPIs) is pro- viding new insights into diseases and helping to identify potential therapies. However, according to several estimates, the known interaction network may represent only 10% of the entire interactome – indicating that more comprehensive knowledge of the inter- actome could have a major impact on understanding and treating diseases. The primary aim of this thesis was to develop computational methods to provide increased coverage of the interactome. A secondary aim was to gain a better understanding of the link between networks and phenotype, by analyzing essential mouse genes. Two algorithms were developed to predict PPIs and provide increased coverage of the interactome: F pClass and mixed co-expression. F pClass differs from previous PPI prediction methods in two key ways: it integrates both positive and negative evidence for protein interactions, and it identifies synergies between predictive features. Through these approaches F pClass provides interaction networks with significantly improved reli- ability and interactome coverage. Compared to previous predicted human PPI networks, FpClass provides a network with over 10 times more interactions, about 2 times more pro- teins and a lower false discovery rate.
    [Show full text]
  • Supplementary Materials
    Supplementary Materials COMPARATIVE ANALYSIS OF THE TRANSCRIPTOME, PROTEOME AND miRNA PROFILE OF KUPFFER CELLS AND MONOCYTES Andrey Elchaninov1,3*, Anastasiya Lokhonina1,3, Maria Nikitina2, Polina Vishnyakova1,3, Andrey Makarov1, Irina Arutyunyan1, Anastasiya Poltavets1, Evgeniya Kananykhina2, Sergey Kovalchuk4, Evgeny Karpulevich5,6, Galina Bolshakova2, Gennady Sukhikh1, Timur Fatkhudinov2,3 1 Laboratory of Regenerative Medicine, National Medical Research Center for Obstetrics, Gynecology and Perinatology Named after Academician V.I. Kulakov of Ministry of Healthcare of Russian Federation, Moscow, Russia 2 Laboratory of Growth and Development, Scientific Research Institute of Human Morphology, Moscow, Russia 3 Histology Department, Medical Institute, Peoples' Friendship University of Russia, Moscow, Russia 4 Laboratory of Bioinformatic methods for Combinatorial Chemistry and Biology, Shemyakin-Ovchinnikov Institute of Bioorganic Chemistry of the Russian Academy of Sciences, Moscow, Russia 5 Information Systems Department, Ivannikov Institute for System Programming of the Russian Academy of Sciences, Moscow, Russia 6 Genome Engineering Laboratory, Moscow Institute of Physics and Technology, Dolgoprudny, Moscow Region, Russia Figure S1. Flow cytometry analysis of unsorted blood sample. Representative forward, side scattering and histogram are shown. The proportions of negative cells were determined in relation to the isotype controls. The percentages of positive cells are indicated. The blue curve corresponds to the isotype control. Figure S2. Flow cytometry analysis of unsorted liver stromal cells. Representative forward, side scattering and histogram are shown. The proportions of negative cells were determined in relation to the isotype controls. The percentages of positive cells are indicated. The blue curve corresponds to the isotype control. Figure S3. MiRNAs expression analysis in monocytes and Kupffer cells. Full-length of heatmaps are presented.
    [Show full text]
  • "An Overview of Gene Identification: Approaches, Strategies, and Considerations"
    An Overview of Gene Identification: UNIT 4.1 Approaches, Strategies, and Considerations Modern biology has officially ushered in a new era with the completion of the sequencing of the human genome in April 2003. While often erroneously called the “post-genome” era, this milestone truly marks the beginning of the “genome era,” a time in which the availability of sequence data for many genomes will have a significant effect on how science is performed in the 21st century. While complete human sequence data is now available at an overall accuracy of 99.99%, the mere availability of all of these As, Cs, Ts, and Gs still does not answer some of the basic questions regarding the human genome—how many genes actually comprise the genome, how many of these genes code for multiple gene products, and where those genes actually lie along the complement of human chromosomes. Current estimates, based on preliminary analyses of the draft sequence, place the number of human genes at ∼30,000 (International Human Genome Sequencing Consortium, 2001). This number is in stark contrast to previously-suggested estimates, which had ranged as high as 140,000. A number that is in the 30,000 range brings into question the one-gene, one-protein hypothesis, underscoring the importance of processes such as alternative splicing in the generation of multiple gene products from a single gene. Finding all of the genes and the positions of those genes within the human genome sequence—and in other model organism genome sequences as well—requires the devel- opment and application of robust computational methods, some of which are listed in Table 4.1.1.
    [Show full text]