Human Protein Function Prediction: Application of Machine Learning for Integration of Heterogeneous Data Sources

Total Page:16

File Type:pdf, Size:1020Kb

Human Protein Function Prediction: Application of Machine Learning for Integration of Heterogeneous Data Sources Human Protein Function Prediction: application of machine learning for integration of heterogeneous data sources Anna Lobley A dissertation submitted in partial fulfilment of the requirements for the degree of Doctor of Philosophy of University College London. Department of Biochemistry University College London July 2009 1 I, Anna Lobley, confirm that the work presented in this thesis is my own. Where infor- mation has been derived from other sources, I confirm that this has been indicated in the thesis. 2 Abstract Experimental characterisation of protein cellular function can be prohibitively expensive and take years to complete. To address this problem, this thesis focuses on the development of com- putational approaches to predict function from sequence. For sequences with well characterised close relatives, annotation is trivial, orphans or distant homologues present a greater challenge. The use of a feature based method employing ensemble support vector machines to predict indi- vidual Gene Ontology classes is investigated. It is found that different combinations of feature inputs are required to recognise different functions. Although the approach is applicable to any human protein sequence, it is restricted to broadly descriptive functions. The method is well suited to prioritisation of candidate functions for novel proteins rather than to make highly accu- rate class assignments. Signatures of common function can be derived from different biological characteristics; inter- actions and binding events as well as expression behaviour. To investigate the hypothesis that common function can be derived from expression information, public domain human microar- ray datasets are assembled. The questions of how best to integrate these datasets and derive features that are useful in function prediction are addressed. Both co-expression and abundance information is represented between and within experiments and investigated for correlation with function. It is found that features derived from expression data serve as a weak but significant signal for recognising functions. This signal is stronger for biological processes than molecular function categories and independent of homology information. The protein domain has historically been coined as a modular evolutionary unit of protein func- tion. The occurrence of domains that can be linked by ancestral fusion events serves as a signal for domain-domain interactions. To exploit this information for function prediction, novel do- main architecture and fused architecture scores are developed. Architecture scores rather than single domain scores correlate more strongly with function, and both architecture and fusion scores correlate more strongly with molecular functions than biological processes. 3 The final study details the development of a novel heterogeneous function prediction approach designed to target the annotation of both homologous and non-homologous proteins. Support vector regression is used to combine pair-wise sequence features with expression scores and domain architecture scores to rank protein pairs in terms of their functional similarities. The target of the regression models represents the continuum of protein function space empirically derived from the Gene Ontology molecular function and biological process graphs. The merit and performance of the approach is demonstrated using homologous and non-homologous test datasets and significantly improves upon classical nearest neighbour annotation transfer by se- quence methods. The final model represents a method that achieves a compromise between high specificity and sensitivity for all human proteins regardless of their homology status. It is expected that this strategy will allow for more comprehensive and accurate annotations of the human proteome. 4 Acknowledgements First and foremost I would like to acknowledge my supervisors Prof. D T Jones and Prof. C A Orengo for their encouragement and excellent guidance throughout my project. I would also like to thank members of the Jones and Orengo groups especially Drs S. Lise and O. Redfern for helpful discussions. I acknowledge both UCL and CS department computing services for allowing me to gobble many CPU hours, abuse high memory machines and regularly destroy their exquisite computing infrastructure in order to complete my work. Lastly and most importantly, I acknowledge close friends, family (Dad especially for proof read- ing) and Dr. Richard Myers for his support and patience. Contents 1 Introduction 16 1.1 The importance of protein function ........................ 16 1.2 The need for automated methods ......................... 17 1.3 Function annotation schemes .......................... 18 1.3.1 Multifun, FunCat and KEGG hierarchical schemes . 19 1.3.2 The Gene Ontology ........................... 19 1.3.3 Modelling function similarity ....................... 27 1.4 Automated function prediction methods ...................... 30 1.4.1 Homology based approaches ....................... 31 1.4.2 Non-homology based approaches ..................... 41 1.4.3 Integrated function prediction approaches . 55 1.5 Thesis aims .................................. 61 2 Characterising the system: the Gene Ontology and Human proteome an- notations 63 2.1 Chapter aims ................................. 63 2.2 Shape and structure of the Ontology Graphs .................... 64 2.3 Human proteome annotations .......................... 65 2.4 Annotation specificity and completeness ..................... 69 2.5 Growth rates for human sequence annotations ................... 70 2.6 Datasets for benchmarking ........................... 73 2.7 Chapter Summary ............................... 75 3 Quantifying homology based annotation transfer 79 3.1 Chapter introduction and aims .......................... 79 3.2 Conducting homology searches ......................... 82 3.2.1 Datasets ................................ 82 3.2.2 Scoring sequence similarity ....................... 82 3.2.3 Scoring annotation transfers ....................... 85 3.3 Global scoring for annotation transfer ...................... 87 3.3.1 Comparing Molecular Function and Biological Process annotation transfer . 89 3.3.2 Performance of BLAST and PSI-BLAST algorithms . 92 3.3.3 Within and between species transfer .................... 93 3.4 Annotation specificities ............................. 93 3.5 Sources of errors ................................ 96 3.6 Local scoring and functional heterogeneity . 101 3.7 Discussion .................................. 106 4 Feature based function prediction 111 4.1 Chapter aims ................................. 111 4.2 Designing features encoding disorder . 113 4.2.1 Functional analysis of disordered human sequences . 113 4.2.2 Encoding strategy for disorder features . 115 4.2.3 Disorder features in context with other features . 118 4.3 Support Vector Classification of Function . 127 4.3.1 Training and testing datasets . 127 4.3.2 Kernel choice and parameter optimisation . 128 4.3.3 Function Category Classification Results . 129 4.3.4 Assessing the importance of different features in function prediction . 133 4.4 Benchmarking against ProtFun method . 137 4.4.1 FFPred Server ............................. 140 4.5 Chapter Summary ............................... 148 5 Designing pairwise features for function prediction 150 5.1 Introduction and aims .............................. 150 5.2 Defining a Function Similarity Measure . 151 5.2.1 Selection of a semantic similarity measure . 152 5.2.2 Function similarity measures . 155 5.3 Feature design for heterogeneous data . 155 5.3.1 Sequence Similarity . 157 5.3.2 Protein-protein Interactions . 159 5.3.3 Topology ............................... 160 5.3.4 Cellular Localisation . 165 5.3.5 Domain content and domain fusions . 167 5.3.6 Microarray expression information . 184 5.3.7 Characterising feature relationships . 200 5.4 Chapter summary ............................... 202 6 Combining features for function prediction: Performance evaluation and benchmark 205 6.1 Introduction and aims .............................. 205 6.2 Methods ................................... 207 6.2.1 Vector space integration . 207 6.2.2 Data source integration method . 210 6.2.3 Support Vector Regression training . 213 6.2.4 Integrating complex features . 216 6.3 Results and model application . 222 6.3.1 Technical performance of the different model integration approaches . 223 6.3.2 Practical assessment of model quality in annotation transfer . 224 6.3.3 Establishing the value of different data sources . 227 6.3.4 Annotation of uncharacterised human sequences . 232 6.4 Chapter Discussion ............................... 236 7 Discussion 240 Appendix I 248 Appendix II 263 Glossary 268 Bibliography 272 List of Figures 1.1 Example of Molecular Function graph taken from (Ashburner et al. 2000) . 22 1.2 Example of Biological Process graph taken from (Ashburner et al. 2000) . 23 1.3 Example of Cellular Component graph taken from (Ashburner et al. 2000) . 24 1.4 Semantic similarity example ........................... 28 1.5 Schematic of phylogenomics approach. ...................... 36 1.6 Phylogenetic profiling analysis .......................... 38 1.7 Diagrammatic representation of domain fusion and fissions . 40 1.8 Domain content vector representation ....................... 42 1.9 Schematic of the ProtFun method ........................ 44 1.10 Unsupervised clustering for microarray data .................... 47 1.11 Similarity measures for co-expression .....................
Recommended publications
  • "Phylogenetic Profiling"
    Introductory Review Phylogenetic p rofiling Matteo Pellegrini, Todd O. Yeates and David Eisenberg Howard Hughes Medical Institute, University of California at Los Angeles, Los Angeles, CA, USA Sorel T. Fitz-Gibbon IGPP Center for Astrobiology, University of California at Los Angeles, Los Angeles, CA, USA 1. Introduction Biology has been profoundly changed by the development of techniques to se- quence DNA. The advent of rapid sequencing in conjunction with the capability to assemble sequence fragments into complete genome sequences enables researchers to read and analyze entire genomes of organisms. Parallel progress has been made in algorithms to study the evolutionary history of proteins. The techniques rely on the ability to measure the similarity of protein sequences in order to determine the likelihood that different proteins are descended from a common ancestor. It is therefore possible to reconstruct families of proteins that share a common ancestor. Combining these two capabilities, we can now not only determine which proteins are coded within an organism’s genome but we can also discover the evolutionary relationships between the proteins of multiple organisms. Phylogenetic profiling is the study of which protein types are found in which organisms. In order to perform phylogenetic profiling, one must first establish a classification of proteins into families. An example of such a classification scheme across a broad range of fully sequenced organisms is the Clusters of Orthologous Groups (Tatusov, 1997), where an attempt is made to group together proteins that perform a similar function. Next, each organism is described in terms of which protein families are coded or not coded in its genome.
    [Show full text]
  • Pellegrini M. Using Phylogenetic Profiles To
    Chapter 9 Using Phylogenetic Profiles to Predict Functional Relationships Matteo Pellegrini Abstract Phylogenetic profiling involves the comparison of phylogenetic data across gene families. It is possible to construct phylogenetic trees, or related data structures, for specific gene families using a wide variety of tools and approaches. Phylogenetic profiling involves the comparison of this data to determine which families have correlated or coupled evolution. The underlying assumption is that in certain cases these couplings may allow us to infer that the two families are functionally related: that is their function in the cell is coupled. Although this technique can be applied to noncoding genes, it is more commonly used to assess the function of protein coding genes. Examples of proteins that are functionally related include subunits of protein complexes, or enzymes that perform consecutive steps along biochemical pathways. We hypothesize the deletion of one of the families from a genome would then indirectly affect the function of the other. Dozens of different implementations of the phylogenetic profiling technique have been developed over the past decade. These range from the first simple approaches that describe phylogenetic profiles as binary vectors to the most complex ones that attempt to model to the coevolution of protein families on a phylogenetic tree. We discuss a set of these implementations and present the software and databases that are available to perform phylogenetic profiling. Key words: Phylogenetic profiles, Coevolution, Functional associations, Comparative genomics, Coevolving proteins 1. Introduction The remarkable improvements in sequencing technology that have occurred over the past few decades have made the sequenc- ing of genomes an ever more routine task.
    [Show full text]
  • Integrating Genomic Data to Predict Transcription Factor Binding
    Integrating Genomic Data to Predict Transcription Factor Binding Dustin T. Holloway 1 Mark Kon 2 Charles DeLisi 3 [email protected] [email protected] [email protected] 1 Molecular Biology Cell Biology and Biochemistry, 3 Boston University, Boston, MA 02215, U.S.A. Bioinformatics and Systems Biology, 2 Boston University, Boston, MA 02215, U.S.A. Department of Mathematics and Statistics, Boston University, Boston, MA 02215 , U.S.A. Abstract Transcription factor binding sites (TFBS) in gene promoter regions are often predicted by using position specific scoring matrices (PSSMs), which summarize sequence patterns of experimentally determined TF binding sites. Although PSSMs are more reliable than simple consensus string matching in predicting a true binding site, they generally result in high numbers of false positive hits. This study attempts to reduce the number of false positive matches and generate new predictions by integrating various types of genomic data by two methods: a Bayesian allocation procedure, and support vector machine classification. Several methods will be explored to strengthen the prediction of a true TFBS in the Saccharomyces cerevisiae genome: binding site degeneracy, binding site conservation, phylogenetic profiling, TF binding site clustering, gene expression profiles, GO functional annotation, and k-mer counts in promoter regions. Binding site degeneracy (or redundancy) refers to the number of times a particular transcription factor’s binding motif is discovered in the upstream region of a gene. Phylogenetic conservation takes into account the number of orthologous upstream regions in other genomes that contain a particular binding site. Phylogenetic profiling refers to the presence or absence of a gene across a large set of genomes.
    [Show full text]
  • Scalable Phylogenetic Profiling Using Minhash Uncovers Likely Eukaryotic Sexual Reproduction Genes
    bioRxiv preprint doi: https://doi.org/10.1101/852491; this version posted November 22, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. 1 Scalable Phylogenetic Profiling using MinHash Uncovers Likely Eukaryotic Sexual Reproduction Genes 1,2,3,* 1,2,3 4,5 1,2,3,6,7,* David Moi ​ ,​ Laurent Kilchoer ,​ Pablo S. Aguilar ​ and Christophe Dessimoz ​ ​ ​ ​ 1 2 Department​ of Computational Biology, University of Lausanne, Switzerland; Center​ for Integrative Genomics, ​ 3 4 University of Lausanne, Switzerland; SIB​ Swiss Institute of Bioinformatics, Lausanne, Switzerland; Instituto​ de ​ ​ 5 Investigaciones Biotecnologicas (IIBIO), Universidad Nacional de San Martín Buenos Aires, Argentina; Instituto​ de ​ 6 Fisiología, Biología Molecular y Neurociencias (IFIBYNE-CONICET), Department​ of Genetics, Evolution, and ​ 7 Environment, University College London, UK; Department​ of Computer Science, University College London, UK. ​ *Corresponding authors: [email protected] and [email protected] Abstract Phylogenetic profiling is a computational method to predict genes involved in the same biological process by identifying protein families which tend to be jointly lost or retained across the tree of life. Phylogenetic profiling has customarily been more widely used with prokaryotes than eukaryotes, because the method is thought to require many diverse genomes. There are now many eukaryotic genomes available, but these are considerably larger, and typical phylogenetic profiling methods require quadratic time or worse in the number of genes. We introduce a fast, scalable phylogenetic profiling approach entitled HogProf, which leverages hierarchical orthologous groups for the construction of large profiles and locality-sensitive hashing for efficient retrieval of similar profiles.
    [Show full text]
  • Review Article Protein-Protein Interaction Detection: Methods and Analysis
    Hindawi Publishing Corporation International Journal of Proteomics Volume 2014, Article ID 147648, 12 pages http://dx.doi.org/10.1155/2014/147648 Review Article Protein-Protein Interaction Detection: Methods and Analysis V. Srinivasa Rao,1 K. Srinivas,1 G. N. Sujini,2 and G. N. Sunand Kumar1 1 Department of CSE, VR Siddhartha Engineering College, Vijayawada 520007, India 2 Department of CSE, Mahatma Gandhi Institute of Technology, Hyderabad 500075, India Correspondence should be addressed to V. Srinivasa Rao; [email protected] Received 26 October 2013; Revised 5 December 2013; Accepted 20 December 2013; Published 17 February 2014 Academic Editor: Yaoqi Zhou Copyright © 2014 V. Srinivasa Rao et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Protein-protein interaction plays key role in predicting the protein function of target protein and drug ability of molecules. The majority of genes and proteins realize resulting phenotype functions as a set of interactions. The in vitro and in vivo methods like affinity purification, Y2H (yeast 2 hybrid), TAP (tandem affinity purification), and so forth have their own limitations like cost, time, and so forth, and the resultant data sets are noisy and have more false positives to annotate the function of drug molecules. Thus, in silico methods which include sequence-based approaches, structure-based approaches, chromosome proximity, gene fusion, in silico 2 hybrid, phylogenetic tree, phylogenetic profile, and gene expression-based approaches were developed. Elucidation of protein interaction networks also contributes greatly to the analysis of signal transduction pathways.
    [Show full text]
  • Mouse Mutants As Models for Congenital Retinal Disorders
    Experimental Eye Research 81 (2005) 503–512 www.elsevier.com/locate/yexer Review Mouse mutants as models for congenital retinal disorders Claudia Dalke*, Jochen Graw GSF-National Research Center for Environment and Health, Institute of Developmental Genetics, D-85764 Neuherberg, Germany Received 1 February 2005; accepted in revised form 1 June 2005 Available online 18 July 2005 Abstract Animal models provide a valuable tool for investigating the genetic basis and the pathophysiology of human diseases, and to evaluate therapeutic treatments. To study congenital retinal disorders, mouse mutants have become the most important model organism. Here we review some mouse models, which are related to hereditary disorders (mostly congenital) including retinitis pigmentosa, Leber’s congenital amaurosis, macular disorders and optic atrophy. q 2005 Elsevier Ltd. All rights reserved. Keywords: animal model; retina; mouse; gene mutation; retinal degeneration 1. Introduction Although mouse models are a good tool to investigate retinal disorders, one should keep in mind that the mouse Mice suffering from hereditary eye defects (and in retina is somehow different from a human retina, particular from retinal degenerations) have been collected particularly with respect to the number and distribution of since decades (Keeler, 1924). They allow the study of the photoreceptor cells. The mouse as a nocturnal animal molecular and histological development of retinal degener- has a retina dominated by rods; in contrast, cones are small ations and to characterize the genetic basis underlying in size and represent only 3–5% of the photoreceptors. Mice retinal dysfunction and degeneration. The recent progress of do not form cone-rich areas like the human fovea.
    [Show full text]
  • Mapping Global and Local Co-Evolution Across 600 Species to Identify Novel Homologous Recombination Repair Genes
    Downloaded from genome.cshlp.org on October 9, 2021 - Published by Cold Spring Harbor Laboratory Press Mapping global and local co-evolution across 600 species to identify novel homologous recombination repair genes Dana Sherill-Rofe1*, Dolev Rahat1,2*, Steven Findlay3,4,#, Anna Mellul1,#, Irene Guberman1, Maya Braun1, Idit Bloch1, Alon Lalezari1, Arash Samiei 3,4, Ruslan Sadreyev5,6,7, Michal 8 3,4,9,10 2,† 1,†, Goldberg , Alexandre Orthwein , Aviad Zick and Yuval Tabach 1Department of Developmental Biology and Cancer Research, Institute for Medical Research- Israel-Canada, Hebrew University of Jerusalem, Jerusalem, Israel. 2Sharett Institute of Oncology, Hadassah Medical Center, Ein-Kerem, Jerusalem, Israel. 3Lady Davis Institute for Medical Research, Segal Cancer Centre, Jewish General Hospital, Montreal, Quebec, Canada. 4Division of Experimental Medicine, McGill University, Montreal, Quebec, Canada. 5Department of Molecular Biology, Massachusetts General Hospital, Boston, MA USA. 6Department of Genetics, Harvard Medical School, Boston, MA USA. 7Department of Pathology, Massachusetts General Hospital and Harvard Medical School, Boston, MA USA. 8Department of Genetics, Alexander Silberman Institute of Life Sciences, Hebrew University of Jerusalem, Jerusalem, Israel. 9Department of Microbiology and Immunology, McGill University, Montreal, Quebec, Canada. 10Gerald Bronfman Department of Oncology, McGill University, Montreal, Quebec, Canada. *These authors contributed equally to this work. # These authors contributed equally to this work.
    [Show full text]
  • B-Mediated Induction of BAG3 Upon Inhibition of Constitutive Protein Degradation Pathways
    Citation: Cell Death and Disease (2015) 6, e1692; doi:10.1038/cddis.2014.584 OPEN & 2015 Macmillan Publishers Limited All rights reserved 2041-4889/15 www.nature.com/cddis NIK is required for NF-κB-mediated induction of BAG3 upon inhibition of constitutive protein degradation pathways F Rapino1,5, BA Abhari1,5, M Jung2,3,4 and S Fulda*,1,3,4 Recently, we reported that induction of the co-chaperone Bcl-2-associated athanogene 3 (BAG3) is critical for recovery of rhabdomyosarcoma (RMS) cells after proteotoxic stress upon inhibition of the two constitutive protein degradation pathways, that is, the ubiquitin-proteasome system by Bortezomib and the aggresome-autophagy system by histone deacetylase 6 (HDAC6) inhibitor ST80. In the present study, we investigated the molecular mechanisms mediating BAG3 induction under these conditions. Here, we identify nuclear factor-kappa B (NF-κB)-inducing kinase (NIK) as a key mediator of ST80/Bortezomib-stimulated NF-κB activation and transcriptional upregulation of BAG3. ST80/Bortezomib cotreatment upregulates mRNA and protein expression of NIK, which is accompanied by an initial increase in histone H3 acetylation. Importantly, NIK silencing by siRNA abolishes NF-κB activation and BAG3 induction by ST80/Bortezomib. Furthermore, ST80/Bortezomib cotreatment stimulates NF-κB transcriptional activity and upregulates NF-κB target genes. Genetic inhibition of NF-κB by overexpression of dominant- negative IκBα superrepressor (IκBα-SR) or by knockdown of p65 blocks the ST80/Bortezomib-stimulated upregulation of BAG3 mRNA and protein expression. Interestingly, inhibition of lysosomal activity by Bafilomycin A1 inhibits ST80/Bortezomib- stimulated IκBα degradation, NF-κB activation and BAG3 upregulation, indicating that IκBα is degraded via the lysosome in the presence of Bortezomib.
    [Show full text]
  • Absence of Sac2/INPP5F Enhances the Phenotype of a Parkinson's
    Absence of Sac2/INPP5F enhances the phenotype of a Parkinson’s disease mutation of synaptojanin 1 Mian Caoa,b,c,d,1,2, Daehun Parka,b,c,d,1, Yumei Wua,b,c,d, and Pietro De Camillia,b,c,d,e,3 aDepartment of Neuroscience, Yale University School of Medicine, New Haven, CT 06510; bDepartment of Cell Biology, Yale University School of Medicine, New Haven, CT 06510; cHoward Hughes Medical Institute, Yale University School of Medicine, New Haven, CT 06510; dProgram in Cellular Neuroscience, Neurodegeneration, and Repair, Yale University School of Medicine, New Haven, CT 06510; and eKavli Institute for Neuroscience, Yale University School of Medicine, New Haven, CT 06510 Contributed by Pietro De Camilli, April 9, 2020 (sent for review March 9, 2020; reviewed by Volker Haucke and David Sulzer) Numerous genes whose mutations cause, or increase the risk of, analysis of cultured neurons and brain tissue of these mice Parkinson’s disease (PD) have been identified. An inactivating mu- demonstrated defects in synaptic vesicle recycling and dystrophic tation (R258Q) in the Sac inositol phosphatase domain of synapto- changes selectively in nerve terminals of dopaminergic neurons janin 1 (SJ1/PARK20), a phosphoinositide phosphatase implicated in the striatum (17). Levels of auxilin (PARK19) and other in synaptic vesicle recycling, results in PD. The gene encoding Sac2/ endocytic factors were abnormally increased in the brains of such INPP5F, another Sac domain-containing protein, is located within a mice (17). In addition, levels of the PD gene parkin (PARK2) PD risk locus identified by genome-wide association studies. were strikingly elevated (17), strengthening the evidence for Knock-In mice carrying the SJ1 patient mutation (SJ1RQKI) exhibit a functional partnership among endophilin, SJ1, and parkin PD features, while Sac2 knockout mice (Sac2KO) do not have ob- reported by previous studies (18–20).
    [Show full text]
  • Entrez ID 1 Symbol 1 Entrez ID 2 Symbol 2 Data Source (R
    Supporting Information Table 4. List of human protein-protein interactons. Entrez ID 1 Symbol 1 Entrez ID 2 Symbol 2 Data Source (R: Rual et al; S: Stelzl et al; L: Literature curation) 1 A1BG 10321 CRISP3 L 2 A2M 259 AMBP L 2 A2M 348 APOE L 2 A2M 351 APP L 2 A2M 354 KLK3 L 2 A2M 567 B2M L 2 A2M 1508 CTSB L 2 A2M 1990 ELA1 L 2 A2M 3309 HSPA5 L 2 A2M 3553 IL1B L 2 A2M 3586 IL10 L 2 A2M 3931 LCAT L 2 A2M 3952 LEP L 2 A2M 4035 LRP1 L 2 A2M 4803 NGFB L 2 A2M 5047 PAEP L 2 A2M 7045 TGFBI L 2 A2M 8728 ADAM19 L 2 A2M 9510 ADAMTS1 L 2 A2M 10944 SMAP S 2 A2M 55729 ATF7IP L 9 NAT1 8260 ARD1A L 12 SERPINA3 351 APP L 12 SERPINA3 354 KLK3 L 12 SERPINA3 1215 CMA1 L 12 SERPINA3 1504 CTRB1 L 12 SERPINA3 1506 CTRL L 12 SERPINA3 1511 CTSG L 12 SERPINA3 1990 ELA1 L 12 SERPINA3 1991 ELA2 L 12 SERPINA3 2064 ERBB2 L 12 SERPINA3 2153 F5 L 12 SERPINA3 3817 KLK2 L 12 SERPINA3 4035 LRP1 L 12 SERPINA3 4485 MST1 L 12 SERPINA3 5422 POLA L 12 SERPINA3 64215 DNAJC1 L 14 AAMP 51497 TH1L S 15 AANAT 7534 YWHAZ L 18 ABAT 7915 ALDH5A1 L 19 ABCA1 335 APOA1 L 19 ABCA1 6645 SNTB2 L 19 ABCA1 8772 FADD L 20 ABCA2 55755 CDK5RAP2 L 22 ABCB7 2235 FECH L 23 ABCF1 3692 ITGB4BP S 24 ABCA4 1258 CNGB1 L 25 ABL1 27 ABL2 L 25 ABL1 472 ATM L 25 ABL1 613 BCR L 25 ABL1 718 C3 L 25 ABL1 867 CBL L 25 ABL1 1501 CTNND2 L 25 ABL1 2048 EPHB2 L 25 ABL1 2547 XRCC6 L 25 ABL1 2876 GPX1 L 25 ABL1 2885 GRB2 L 25 ABL1 3055 HCK L 25 ABL1 3636 INPPL1 L 25 ABL1 3716 JAK1 L 25 ABL1 4193 MDM2 L 25 ABL1 4690 NCK1 L 25 ABL1 4914 NTRK1 L 25 ABL1 5062 PAK2 L 25 ABL1 5295 PIK3R1 L 25 ABL1 5335 PLCG1 L 25 ABL1 5591
    [Show full text]
  • Phylogenetic Profiling of the Arabidopsis Thaliana Proteome
    Open Access Research2004Gutiérrezet al.Volume 5, Issue 8, Article R53 Phylogenetic profiling of the Arabidopsis thaliana proteome: what comment proteins distinguish plants from other organisms? Rodrigo A Gutiérrez*†§, Pamela J Green‡, Kenneth Keegstra*† and John B Ohlrogge† Addresses: *Department of Energy Plant Research Laboratory, Michigan State University, East Lansing, MI 48824-1312, USA. †Department of Plant Biology, Michigan State University, East Lansing, MI 48824-1312, USA. ‡Delaware Biotechnology Institute, University of Delaware, 15 § Innovation Way, Newark, DE 19711, USA. Current address: Department of Biology, New York University, 100 Washington Square East, New reviews York, NY 10003, USA. Correspondence: Rodrigo A Gutiérrez. E-mail: [email protected] Published: 15 July 2004 Received: 16 March 2004 Revised: 10 May 2004 Genome Biology 2004, 5:R53 Accepted: 7 June 2004 The electronic version of this article is the complete one and can be found online at http://genomebiology.com/2004/5/8/R53 reports © 2004 Gutiérrez et al.; licensee BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL Phylogenetic<p>Theopportunitycodingof life</p> genes availability to profilingof decipher <it>Arabidopsis of theof the the complete genetic Arabidopsis thaliana genomefactors </it>onthaliana that sequence define the proteome: ofbasis plant <it>Arab of form whattheir idopsisand proteinspattern function. thaliana of distingu sequence To </it>together beginish similarityplants this task, from wi to thwe other organismsthose have organisms? of classified other across organisms the the nuc threlear proe domains protein-vides an deposited research Abstract Background: The availability of the complete genome sequence of Arabidopsis thaliana together with those of other organisms provides an opportunity to decipher the genetic factors that define plant form and function.
    [Show full text]
  • Phylogenetic Profiling of the Arabidopsis Genome
    Figure 1 Phylogenetic Profiling of the Arabidopsis Genome Aren Ewing and Gernot Presting Department of Molecular Biosciences and Bioengineering, University of Hawaii, Honolulu, HI 96822, USA Conclusion: Introduction: Results: The goal of this project is to assign genes to Member proteins of several biosynthetic pathways pathways based on co-evolution. Arabidopsis Clustered pathway proteins were found in the hierarchical tree using the greedy outlier removal clustered based on their phylogenetic profiles using several thaliana is the first fully sequenced plant, and program. Tryptophan and isoprenoid biosynthesis pathways were examined in detail. Most of different hierarchical trees and datasets. Within these many (60%) of its genes are functionally the proteins in the pathways do cluster, those that don’t either catalyze the first enzymatic step compact clusters we found new metabolic pathway annotated. This makes it a good candidate for or are misannotated / mispredicted (figure 3). New pathway members were found near known members that were not annotated in Aracyc. This study phlyogenetic studies. Utilizing existing pathway members in the hierarchical tree (figure 4). demonstrates the utility of a novel tool for assigning new information about pathways of Arabidopsis Figure 1 (left and poster border): All of the Arabidopsis genes with at least one match other members to pathways through the use of a high throughput (AraCyc), a tool was created to find novel than Arabidopsis (19324 rows) clustered in a binary tree. Clusters of unique profiles are pathway/gene list search. genes that may be associated with known Figure 1 scattered throughout the tree. Red color signifies presence of that Arabidopsis gene in target pathways.
    [Show full text]