What's That Gene
Total Page:16
File Type:pdf, Size:1020Kb
What’s that gene (or protein)? Online resources for exploring functions of genes, transcripts, and proteins James Hutchins To cite this version: James Hutchins. What’s that gene (or protein)? Online resources for exploring functions of genes, transcripts, and proteins. Molecular Biology of the Cell, American Society for Cell Biology, 2014, 25 (8), pp.1187-1201. 10.1091/mbc.E13-10-0602. hal-02354657 HAL Id: hal-02354657 https://hal.archives-ouvertes.fr/hal-02354657 Submitted on 7 Nov 2019 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. M BoC | TECHNICAL PERSPECTIVE What’s that gene (or protein)? Online resources for exploring functions of genes, transcripts, and proteins James R. A. Hutchins Institute of Human Genetics, Centre National de la Recherche Scientifique (CNRS), 34396 Montpellier, France ABSTRACT The genomic era has enabled research projects that use approaches including Monitoring Editor genome-scale screens, microarray analysis, next-generation sequencing, and mass spectrom- Doug Kellogg etry–based proteomics to discover genes and proteins involved in biological processes. Such University of California, Santa Cruz methods generate data sets of gene, transcript, or protein hits that researchers wish to ex- plore to understand their properties and functions and thus their possible roles in biological Received: Oct 21, 2013 systems of interest. Recent years have seen a profusion of Internet-based resources to aid this Revised: Feb 13, 2014 process. This review takes the viewpoint of the curious biologist wishing to explore the prop- Accepted: Feb 14, 2014 erties of protein-coding genes and their products, identified using genome-based technolo- gies. Ten key questions are asked about each hit, addressing functions, phenotypes, expres- sion, evolutionary conservation, disease association, protein structure, interactors, posttranslational modifications, and inhibitors. Answers are provided by presenting the latest publicly available resources, together with methods for hit-specific and data set–wide infor- mation retrieval, suited to any genome-based analytical technique and experimental species. The utility of these resources is demonstrated for 20 factors regulating cell proliferation. Re- sults obtained using some of these are discussed in more depth using the p53 tumor suppres- sor as an example. This flexible and universally applicable approach for characterizing ex- perimental hits helps researchers to maximize the potential of their projects for biological discovery. DOI: 10.1091/mbc.E13-10-0602 INTRODUCTION Address correspondence to: James R. A. Hutchins ([email protected]). The past decade has witnessed huge advances in the power and Periodic updates to the Supplemental Materials for this article will be made avail- able on the author’s website: http://www.jrahutchins.net. scope of analytical technologies based on genomic data. These The author declares that no conflict of interest exists. methods, which include the functional identification of genes using Abbreviations used: CDD, Conserved Domain Database; CNRS, Centre National traditional genetic and RNA interference (RNAi)-based knockdown de la Recherche Scientifique; DDBJ, DNA Data Bank of Japan; EBI, European screens (Forsburg, 2001; Boutros and Ahringer, 2008), the identifica- Bioinformatics Institute; ELM, eukaryotic linear motif; ENA, European Nucleotide Archive; GDSC, Genomics of Drug Sensitivity in Cancer; GEO, Gene Expression tion of DNA and RNA populations by microarray analysis or next- Omnibus; GI, GenInfo Identifier; GO, Gene Ontology; GSV, genomic structural generation sequencing (Capaldi, 2010; Niedringhaus et al., 2011; variation; ID, identifier code; IMEx, International Molecular Exchange; IPI, Interna- Ozsolak and Milos, 2011), and the identification of proteins, com- tional Protein Index; KEGG, Kyoto Encyclopedia of Genes and Genomes; NCBI, National Center for Biotechnology Information; nr, nonredundant; OMIM, Online plexes, and their modifications using mass spectrometry–based Mendelian Inheritance in Man; PDB, Protein Data Bank; PTM, posttranslational proteomics (Walther and Mann, 2010), have transformed biological modification; RCSB, Research Collaboratory for Structural Bioinformatics; Ro5, Rule of Five; SLiM, short linear motif; SNP, single-nucleotide polymorphism. research. Many researchers can now turn to these techniques to ad- © 2014 Hutchins. This article is distributed by The American Society for Cell Biol- dress specific biological questions or, by performing larger-scale or ogy under license from the author(s). Two months after publication it is available high-throughput experiments, discover genes, transcripts, and pro- to the public under an Attribution–Noncommercial–Share Alike 3.0 Unported Creative Commons License (http://creativecommons.org/licenses/by-nc-sa/3.0). teins involved in their system of interest. “ASCB®,” “The American Society for Cell Biology®,” and “Molecular Biology of Although these technologies differ greatly in their principles the Cell®” are registered trademarks of The American Society of Cell Biology. and mechanistics, their general approaches share a similar form. A Supplemental Material can be found at: Volume 25 April 15, 2014 http://www.molbiolcell.org/content/suppl/2014/04/03/25.8.1187.DC1.html 1187 tionally, resulting in the identification of Biological multiple gene, transcript, or protein hits, material that is, entries from public sequence data- Isolation of DNA, RNA or proteins. bases, each described using a unique iden- tifier code (ID; in some cases known as an accession code), plus other data pertinent Sample quality control Sample for analysis to the experiment in question, such as a (yield, purity) Sample processing confidence score or intensity measurement. The resulting hits table may typically un- Analysis of processed sample: data acquisition dergo further bioinformatic analyses, includ- (e.g. sequencing, microarray analysis, or mass spectrometry). ing statistical validation, ranking, and, in Raw data file(s) some cases, identification and removal of known contaminant entries. Public databases Unfortunately, this hits table is often nucleic acid or Data analysis: identification of “hits” (genes, protein sequences transcripts, or proteins), with confidence scores. where platform-driven analysis stops, leav- ing the research biologist with a list of often Bioinformatic analysis: statistical validation, ranking, unfamiliar gene, transcript, or protein contaminant filtering. Sequence & Functions, pathways names, abbreviations, and IDs, of which he genomic origins Results table & systems or she has the task of making sense. Re- Hits (genes, transcripts or proteins), searchers faced with such a list will naturally each with a unique identifier code be curious to find out more about each of (ID) and analysis-specific data. the hits, to determine whether they are in- Question 1 Question 2 Supplemental Tables S1, S2 & S3 Supplemental Tables S4, S5, S6 & S7 teresting and worthy of investing time and resources for follow-up studies. They may wish to know whether the hit was previously Homologues Gene expression & evolution reported to have an involvement in their bi- ological system of interest or whether it is novel and what is known about its functions, structure, interactions, and so on. Question 3 Question 4 Fortunately, help is at hand. The past de- Supplemental Table S8 Supplemental Table S9 cade has also seen the emergence of a plethora of high-quality database resources Protein structure Protein-protein Post-translational providing information about the functions of interactions modifications genes, transcripts, and proteins for many or- ganisms. These provide multiple gateways for the biologist, allowing access to informa- Question 5 Question 6 Question 7 tion relating to nucleotide or amino acid se- Supplemental Tables S10 & S11 Supplemental Table S12 Supplemental Table S13 quences, genomic origins, evolutionary conservation, expression in cells and tissues, and association with disease processes. Fur- Genetic variants & Drugs & inhibitors Summary & overview ther protein-related information centers on disease association enzymatic or other functions, biological pro- cesses in which they are involved, domain and three-dimensional structures, interac- Question 8 Question 9 Question 10 tion partners, posttranslational modifica- Supplemental Tables S14 & S15 Supplemental Table S16 Supplemental Tables S17 & S18 tions, and the possibility of modulating their activities using small-molecule inhibitors. FIGURE 1: Generalized workflow for the analysis of DNA, RNA, or protein samples and Database resources providing such in- questions about the hits identified. Nucleic acid or protein samples isolated from the biological formation are necessarily based on gene, material of interest are processed, then analyzed by various methods. Raw analytical data are (less commonly) transcript, or protein IDs. then matched to entries in public databases, generating a results table listing the genes, Analytical