The Pfam Protein Families Database Marco Punta1,*, Penny C

Total Page:16

File Type:pdf, Size:1020Kb

The Pfam Protein Families Database Marco Punta1,*, Penny C D290–D301 Nucleic Acids Research, 2012, Vol. 40, Database issue Published online 29 November 2011 doi:10.1093/nar/gkr1065 The Pfam protein families database Marco Punta1,*, Penny C. Coggill1, Ruth Y. Eberhardt1, Jaina Mistry1, John Tate1, Chris Boursnell1, Ningze Pang1, Kristoffer Forslund2, Goran Ceric3, Jody Clements3, Andreas Heger4, Liisa Holm5, Erik L. L. Sonnhammer2, Sean R. Eddy3, Alex Bateman1 and Robert D. Finn3 1Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton CB10 1SA, UK, 2Stockholm Bioinformatics Center, Swedish eScience Research Center, Department of Biochemistry and Biophysics, Science for Life Laboratory, Stockholm University, Box 1031, SE-17121 Solna, Sweden, 3HHMI Janelia Farm Research Campus, 19700 Helix Drive, Ashburn, VA 20147, USA, 4Department of Physiology, Anatomy and Genetics, MRC Functional Genomics Unit, University of Oxford, Oxford, OX1 3QX, UK and 5Institute of Biotechnology and Department of Biological and Environmental Sciences, University of Helsinki, PO Box 56 Downloaded from (Viikinkaari 5), 00014 Helsinki, Finland Received October 7, 2011; Revised October 26, 2011; Accepted October 27, 2011 http://nar.oxfordjournals.org/ ABSTRACT INTRODUCTION Pfam is a widely used database of protein families, Pfam is a database of protein families, where families are currently containing more than 13 000 manually sets of protein regions that share a significant degree of curated protein families as of release 26.0. Pfam is sequence similarity, thereby suggesting homology. available via servers in the UK (http://pfam.sanger Similarity is detected using the HMMER3 (http:// hmmer.janelia.org/) suite of programs. .ac.uk/), the USA (http://pfam.janelia.org/) and Pfam contains two types of families: high quality, Sweden (http://pfam.sbc.su.se/). Here, we report manually curated Pfam-A families and automatically at Helsinki University Library on May 19, 2016 on changes that have occurred since our 2010 generated Pfam-B families. The latter are derived from NAR paper (release 24.0). Over the last 2 years, we clusters produced by the ADDA algorithm (1), followed have generated 1840 new families and by the subtraction of overlapping Pfam-A regions at each increased coverage of the UniProt Knowledgebase release. Pfam-A families are built following what is, in (UniProtKB) to nearly 80%. Notably, we have essence, a four-step process: taken the step of opening up the annotation of our (i) building of a high-quality multiple sequence align- families to the Wikipedia community, by linking ment (the so-called seed alignment); Pfam families to relevant Wikipedia pages and (ii) constructing a profile hidden Markov model (HMM) encouraging the Pfam and Wikipedia communities from the seed alignment (using HMMER3); to improve and expand those pages. We continue (iii) searching the profile HMM against the UniProtKB to improve the Pfam website and add new visualiza- sequence database (2) and tions, such as the ‘sunburst’ representation of (iv) choosing family-specific sequence and domain taxonomic distribution of families. In this work we gathering thresholds (GAs); all sequence regions additionally address two topics that will be of that score above the GAs are included in the full particular interest to the Pfam community. First, alignment for the family (GAs are described in we explain the definition and use of family-specific, detail in a later section of this paper). manually curated gathering thresholds. Second, we In addition to providing matches to UniProtKB, Pfam discuss some of the features of domains of also provides matches for the NCBI non-redundant unknown function (also known as DUFs), which con- database, as well as a collection of metagenomic stitute a rapidly growing class of families within samples. We generate a variety of data downstream, Pfam. including, among others, a family sequence-conservation *To whom correspondence should be addressed. Tel: +44 1223 497399; Fax: +44 1223 494919; Email: [email protected] ß The Author(s) 2011. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/3.0), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. Nucleic Acids Research, 2012, Vol. 40, Database issue D291 logo based on the HMM, a description of domain archi- give users an overview of the function(s) of the family or tectures, where all co-occurrences with other domains are domain. However, rather than have the Pfam curators reported, and a species tree summarizing the taxonomic describe our families, we would strongly prefer to have range in the family. annotations written by those who know the proteins and The quality of the seed alignment is the crucial factor in families best, namely the biologists and informaticians determining the quality of the Pfam resource, influencing who work with them on a day-to-day basis. Harnessing not only all data generated within the database but also the knowledge of these experts remains a significant the outcome of external searches that use our profile challenge. HMMs, e.g. to assign domains to proteins which are One recent approach has been to use Wikipedia as a part of newly sequenced genomes. For this reason, a con- source of scientific information (7,8). Wikipedia is the siderable curatorial effort goes into seed alignment world’s largest online encyclopedia, with over 3.7 million generation. English language articles, and is widely acknowledged to Members of the same Pfam family are expected to share be the most popular general reference work on the a common evolutionary history and thus at least some internet. A cornerstone of Wikipedia is that anyone can functional aspect. Ideally, our families should represent edit the content. functional units, which, when combined in different The Rfam database moved all of its family annota- Downloaded from ways, can generate proteins with unique functions. The tion into articles in Wikipedia in 2009, thereby ultimate goal of Pfam is to create a collection of function- allowing anyone to freely edit and improve their ally annotated families that is as representative as possible content. This experiment has proved successful, of protein sequence-space, such that our families can be engaging the wider scientific community to provide expert annotations and improving the overall quality of used effectively for both genome-annotation and http://nar.oxfordjournals.org/ small-scale protein studies. It must be stressed, however, the Rfam annotation (7). In light of this positive experi- that homology is no guarantee of functional similarity and ence, we decided to adopt the same approach and use transfer of functional annotation based solely on family Wikipedia as the primary source of Pfam annotation membership should always be undertaken with caution. (Figure 1A). On the other hand, additional data that are available The Rfam database included around 600 families when from Pfam, such as conservation of family signature the switch to Wikipedia was made. This made it feasible to residues or conservation of common domain architec- assign existing articles where possible and to generate new tures, can increase confidence in a given functional ‘stub’ articles in Wikipedia for any family that still lacked hypothesis. For more background on how to any relevant article. These stubs have since been gradually at Helsinki University Library on May 19, 2016 query and use our web interface please refer to Coggill expanded and improved by the Rfam and Wikipedia et al. (3). communities. At the time when Pfam began using In this paper, we report on the most recent Pfam release Wikipedia for annotations, release 25.0, there were (26.0) as well as on important changes that have been 12 273 Pfam-A families. We initially identified existing introduced over the last 2 years, since our 2010 NAR articles that described protein domains or families and database issue paper (where we presented release 24.0) which provided useful information about the Pfam (4). Arguably, the change carrying the most significant family. These articles are now assigned as the primary philosophical implications has been the decision to annotation for the appropriate families. Given the follow the lead of the Rfam database (5) and out-source number of families that remain without an article, functional annotation of Pfam families to Wikipedia. We however, it is simply not feasible to manually gener- will discuss the background to this decision and give ate articles for all of them. For these families we details of the progress towards Wikipedia coverage of continue to show the original annotation comments, Pfam families. Another important development has been which were written by the curator of the family, while the adoption of the iterative sequence-search program encouraging our users to tell us about appropriate jackhmmer (6) as our principal tool for generating new Wikipedia articles or to create them on our behalf. families. In addition, we have extended our mechanism Furthermore, it is likely that there will be some families for family curation, which now allows trained and that are not sufficiently notable for inclusion in Wikipedia trusted external collaborators to create and add their and we anticipate that many of these will remain without own families to Pfam. Finally, we will take this opportun- Wikipedia annotations for the long term, perhaps ity to address and present fresh analysis on two topics that indefinitely. we consider of particular importance: family-specific GAs As of release 26.0, there are 4909 Pfam families that link and Domains of Unknown Function (DUFs). to 1016 Wikipedia articles. We invite readers and users of the Pfam website to edit and improve these articles in Wikipedia. Mapping of Pfam-A families to Wikipedia WHAT’S NEW articles is available in JSON format, from: http://pfamsrv.sanger.ac.uk/cgi-bin/mapping.cgi?db Community annotation =pfam.
Recommended publications
  • The ELIXIR Core Data Resources: ​Fundamental Infrastructure for The
    Supplementary Data: The ELIXIR Core Data Resources: fundamental infrastructure ​ for the life sciences The “Supporting Material” referred to within this Supplementary Data can be found in the Supporting.Material.CDR.infrastructure file, DOI: 10.5281/zenodo.2625247 (https://zenodo.org/record/2625247). ​ ​ Figure 1. Scale of the Core Data Resources Table S1. Data from which Figure 1 is derived: Year 2013 2014 2015 2016 2017 Data entries 765881651 997794559 1726529931 1853429002 2715599247 Monthly user/IP addresses 1700660 2109586 2413724 2502617 2867265 FTEs 270 292.65 295.65 289.7 311.2 Figure 1 includes data from the following Core Data Resources: ArrayExpress, BRENDA, CATH, ChEBI, ChEMBL, EGA, ENA, Ensembl, Ensembl Genomes, EuropePMC, HPA, IntAct /MINT , InterPro, PDBe, PRIDE, SILVA, STRING, UniProt ● Note that Ensembl’s compute infrastructure physically relocated in 2016, so “Users/IP address” data are not available for that year. In this case, the 2015 numbers were rolled forward to 2016. ● Note that STRING makes only minor releases in 2014 and 2016, in that the interactions are re-computed, but the number of “Data entries” remains unchanged. The major releases that change the number of “Data entries” happened in 2013 and 2015. So, for “Data entries” , the number for 2013 was rolled forward to 2014, and the number for 2015 was rolled forward to 2016. The ELIXIR Core Data Resources: fundamental infrastructure for the life sciences ​ 1 Figure 2: Usage of Core Data Resources in research The following steps were taken: 1. API calls were run on open access full text articles in Europe PMC to identify articles that ​ ​ mention Core Data Resource by name or include specific data record accession numbers.
    [Show full text]
  • Cryptic Inoviruses Revealed As Pervasive in Bacteria and Archaea Across Earth’S Biomes
    ARTICLES https://doi.org/10.1038/s41564-019-0510-x Corrected: Author Correction Cryptic inoviruses revealed as pervasive in bacteria and archaea across Earth’s biomes Simon Roux 1*, Mart Krupovic 2, Rebecca A. Daly3, Adair L. Borges4, Stephen Nayfach1, Frederik Schulz 1, Allison Sharrar5, Paula B. Matheus Carnevali 5, Jan-Fang Cheng1, Natalia N. Ivanova 1, Joseph Bondy-Denomy4,6, Kelly C. Wrighton3, Tanja Woyke 1, Axel Visel 1, Nikos C. Kyrpides1 and Emiley A. Eloe-Fadrosh 1* Bacteriophages from the Inoviridae family (inoviruses) are characterized by their unique morphology, genome content and infection cycle. One of the most striking features of inoviruses is their ability to establish a chronic infection whereby the viral genome resides within the cell in either an exclusively episomal state or integrated into the host chromosome and virions are continuously released without killing the host. To date, a relatively small number of inovirus isolates have been extensively studied, either for biotechnological applications, such as phage display, or because of their effect on the toxicity of known bacterial pathogens including Vibrio cholerae and Neisseria meningitidis. Here, we show that the current 56 members of the Inoviridae family represent a minute fraction of a highly diverse group of inoviruses. Using a machine learning approach lever- aging a combination of marker gene and genome features, we identified 10,295 inovirus-like sequences from microbial genomes and metagenomes. Collectively, our results call for reclassification of the current Inoviridae family into a viral order including six distinct proposed families associated with nearly all bacterial phyla across virtually every ecosystem.
    [Show full text]
  • Learning Protein Constitutive Motifs from Sequence Data Je´ Roˆ Me Tubiana, Simona Cocco, Re´ Mi Monasson*
    TOOLS AND RESOURCES Learning protein constitutive motifs from sequence data Je´ roˆ me Tubiana, Simona Cocco, Re´ mi Monasson* Laboratory of Physics of the Ecole Normale Supe´rieure, CNRS UMR 8023 & PSL Research, Paris, France Abstract Statistical analysis of evolutionary-related protein sequences provides information about their structure, function, and history. We show that Restricted Boltzmann Machines (RBM), designed to learn complex high-dimensional data and their statistical features, can efficiently model protein families from sequence information. We here apply RBM to 20 protein families, and present detailed results for two short protein domains (Kunitz and WW), one long chaperone protein (Hsp70), and synthetic lattice proteins for benchmarking. The features inferred by the RBM are biologically interpretable: they are related to structure (residue-residue tertiary contacts, extended secondary motifs (a-helixes and b-sheets) and intrinsically disordered regions), to function (activity and ligand specificity), or to phylogenetic identity. In addition, we use RBM to design new protein sequences with putative properties by composing and ’turning up’ or ’turning down’ the different modes at will. Our work therefore shows that RBM are versatile and practical tools that can be used to unveil and exploit the genotype–phenotype relationship for protein families. DOI: https://doi.org/10.7554/eLife.39397.001 Introduction In recent years, the sequencing of many organisms’ genomes has led to the collection of a huge number of protein sequences, which are catalogued in databases such as UniProt or PFAM Finn et al., 2014). Sequences that share a common ancestral origin, defining a family (Figure 1A), *For correspondence: are likely to code for proteins with similar functions and structures, providing a unique window into [email protected] the relationship between genotype (sequence content) and phenotype (biological features).
    [Show full text]
  • DECIPHER: Harnessing Local Sequence Context to Improve Protein Multiple Sequence Alignment Erik S
    Wright BMC Bioinformatics (2015) 16:322 DOI 10.1186/s12859-015-0749-z RESEARCH ARTICLE Open Access DECIPHER: harnessing local sequence context to improve protein multiple sequence alignment Erik S. Wright1,2 Abstract Background: Alignment of large and diverse sequence sets is a common task in biological investigations, yet there remains considerable room for improvement in alignment quality. Multiple sequence alignment programs tend to reach maximal accuracy when aligning only a few sequences, and then diminish steadily as more sequences are added. This drop in accuracy can be partly attributed to a build-up of error and ambiguity as more sequences are aligned. Most high-throughput sequence alignment algorithms do not use contextual information under the assumption that sites are independent. This study examines the extent to which local sequence context can be exploited to improve the quality of large multiple sequence alignments. Results: Two predictors based on local sequence context were assessed: (i) single sequence secondary structure predictions, and (ii) modulation of gap costs according to the surrounding residues. The results indicate that context-based predictors have appreciable information content that can be utilized to create more accurate alignments. Furthermore, local context becomes more informative as the number of sequences increases, enabling more accurate protein alignments of large empirical benchmarks. These discoveries became the basis for DECIPHER, a new context-aware program for sequence alignment, which outperformed other programs on largesequencesets. Conclusions: Predicting secondary structure based on local sequence context is an efficient means of breaking the independence assumption in alignment. Since secondary structure is more conserved than primary sequence, it can be leveraged to improve the alignment of distantly related proteins.
    [Show full text]
  • Methods in and Applications of the Sequencing of Short Non-Coding Rnas" (2013)
    University of Pennsylvania ScholarlyCommons Publicly Accessible Penn Dissertations 2013 Methods in and Applications of the Sequencing of Short Non- Coding RNAs Paul Ryvkin University of Pennsylvania, [email protected] Follow this and additional works at: https://repository.upenn.edu/edissertations Part of the Bioinformatics Commons, Genetics Commons, and the Molecular Biology Commons Recommended Citation Ryvkin, Paul, "Methods in and Applications of the Sequencing of Short Non-Coding RNAs" (2013). Publicly Accessible Penn Dissertations. 922. https://repository.upenn.edu/edissertations/922 This paper is posted at ScholarlyCommons. https://repository.upenn.edu/edissertations/922 For more information, please contact [email protected]. Methods in and Applications of the Sequencing of Short Non-Coding RNAs Abstract Short non-coding RNAs are important for all domains of life. With the advent of modern molecular biology their applicability to medicine has become apparent in settings ranging from diagonistic biomarkers to therapeutics and fields angingr from oncology to neurology. In addition, a critical, recent technological development is high-throughput sequencing of nucleic acids. The convergence of modern biotechnology with developments in RNA biology presents opportunities in both basic research and medical settings. Here I present two novel methods for leveraging high-throughput sequencing in the study of short non- coding RNAs, as well as a study in which they are applied to Alzheimer's Disease (AD). The computational methods presented here include High-throughput Annotation of Modified Ribonucleotides (HAMR), which enables researchers to detect post-transcriptional covalent modifications ot RNAs in a high-throughput manner. In addition, I describe Classification of RNAs by Analysis of Length (CoRAL), a computational method that allows researchers to characterize the pathways responsible for short non-coding RNA biogenesis.
    [Show full text]
  • Comparing Tools for Non-Coding RNA Multiple Sequence Alignment Based On
    Downloaded from rnajournal.cshlp.org on September 26, 2021 - Published by Cold Spring Harbor Laboratory Press ES Wright 1 1 TITLE 2 RNAconTest: Comparing tools for non-coding RNA multiple sequence alignment based on 3 structural consistency 4 Running title: RNAconTest: benchmarking comparative RNA programs 5 Author: Erik S. Wright1,* 6 1 Department of Biomedical Informatics, University of Pittsburgh (Pittsburgh, PA) 7 * Corresponding author: Erik S. Wright ([email protected]) 8 Keywords: Multiple sequence alignment, Secondary structure prediction, Benchmark, non- 9 coding RNA, Consensus secondary structure 10 Downloaded from rnajournal.cshlp.org on September 26, 2021 - Published by Cold Spring Harbor Laboratory Press ES Wright 2 11 ABSTRACT 12 The importance of non-coding RNA sequences has become increasingly clear over the past 13 decade. New RNA families are often detected and analyzed using comparative methods based on 14 multiple sequence alignments. Accordingly, a number of programs have been developed for 15 aligning and deriving secondary structures from sets of RNA sequences. Yet, the best tools for 16 these tasks remain unclear because existing benchmarks contain too few sequences belonging to 17 only a small number of RNA families. RNAconTest (RNA consistency test) is a new 18 benchmarking approach relying on the observation that secondary structure is often conserved 19 across highly divergent RNA sequences from the same family. RNAconTest scores multiple 20 sequence alignments based on the level of consistency among known secondary structures 21 belonging to reference sequences in their output alignment. Similarly, consensus secondary 22 structure predictions are scored according to their agreement with one or more known structures 23 in a family.
    [Show full text]
  • 1 Codon-Level Information Improves Predictions of Inter-Residue Contacts in Proteins 2 by Correlated Mutation Analysis 3
    1 Codon-level information improves predictions of inter-residue contacts in proteins 2 by correlated mutation analysis 3 4 5 6 7 8 Etai Jacob1,2, Ron Unger1,* and Amnon Horovitz2,* 9 10 11 1The Mina & Everard Goodman Faculty of Life Sciences, Bar-Ilan University, Ramat- 12 Gan, 52900, 2Department of Structural Biology 13 Weizmann Institute of Science, Rehovot 7610001, Israel 14 15 16 *To whom correspondence should be addressed: 17 Amnon Horovitz ([email protected]) 18 Ron Unger ([email protected]) 19 1 20 Abstract 21 Methods for analysing correlated mutations in proteins are becoming an increasingly 22 powerful tool for predicting contacts within and between proteins. Nevertheless, 23 limitations remain due to the requirement for large multiple sequence alignments (MSA) 24 and the fact that, in general, only the relatively small number of top-ranking predictions 25 are reliable. To date, methods for analysing correlated mutations have relied exclusively 26 on amino acid MSAs as inputs. Here, we describe a new approach for analysing 27 correlated mutations that is based on combined analysis of amino acid and codon MSAs. 28 We show that a direct contact is more likely to be present when the correlation between 29 the positions is strong at the amino acid level but weak at the codon level. The 30 performance of different methods for analysing correlated mutations in predicting 31 contacts is shown to be enhanced significantly when amino acid and codon data are 32 combined. 33 2 34 The effects of mutations that disrupt protein structure and/or function at one site are often 35 suppressed by mutations that occur at other sites either in the same protein or in other 36 proteins.
    [Show full text]
  • Annual Scientific Report 2013 on the Cover Structure 3Fof in the Protein Data Bank, Determined by Laponogov, I
    EMBL-European Bioinformatics Institute Annual Scientific Report 2013 On the cover Structure 3fof in the Protein Data Bank, determined by Laponogov, I. et al. (2009) Structural insight into the quinolone-DNA cleavage complex of type IIA topoisomerases. Nature Structural & Molecular Biology 16, 667-669. © 2014 European Molecular Biology Laboratory This publication was produced by the External Relations team at the European Bioinformatics Institute (EMBL-EBI) A digital version of the brochure can be found at www.ebi.ac.uk/about/brochures For more information about EMBL-EBI please contact: [email protected] Contents Introduction & overview 3 Services 8 Genes, genomes and variation 8 Molecular atlas 12 Proteins and protein families 14 Molecular and cellular structures 18 Chemical biology 20 Molecular systems 22 Cross-domain tools and resources 24 Research 26 Support 32 ELIXIR 36 Facts and figures 38 Funding & resource allocation 38 Growth of core resources 40 Collaborations 42 Our staff in 2013 44 Scientific advisory committees 46 Major database collaborations 50 Publications 52 Organisation of EMBL-EBI leadership 61 2013 EMBL-EBI Annual Scientific Report 1 Foreword Welcome to EMBL-EBI’s 2013 Annual Scientific Report. Here we look back on our major achievements during the year, reflecting on the delivery of our world-class services, research, training, industry collaboration and European coordination of life-science data. The past year has been one full of exciting changes, both scientifically and organisationally. We unveiled a new website that helps users explore our resources more seamlessly, saw the publication of ground-breaking work in data storage and synthetic biology, joined the global alliance for global health, built important new relationships with our partners in industry and celebrated the launch of ELIXIR.
    [Show full text]
  • EMBL-EBI Now and in the Future
    SureChEMBL: Open Patent Data Chemaxon UGM, Budapest 21/05/2014 Mark Davies ChEMBL Group, EMBL-EBI EMBL-EBI Resources Genes, genomes & variation European Nucleotide Ensembl European Genome-phenome Archive Archive Ensembl Genomes Metagenomics portal 1000 Genomes Gene, protein & metabolite expression ArrayExpress Metabolights Expression Atlas PRIDE Literature & Protein sequences, families & motifs ontologies InterPro Pfam UniProt Europe PubMed Central Gene Ontology Experimental Factor Molecular structures Ontology Protein Data Bank in Europe Electron Microscopy Data Bank Chemical biology ChEMBL ChEBI Reactions, interactions & pathways Systems BioModels BioSamples IntAct Reactome MetaboLights Enzyme Portal ChEMBL – Data for Drug Discovery 1. Scientific facts 3. Insight, tools and resources for translational drug discovery >Thrombin MAHVRGLQLPGCLALAALCSLVHSQHVFLAPQQARSLLQRVRRANTFLEEVRKGNLE Compound RECVEETCSYEEAFEALESSTATDVFWAKYTACETARTPRDKLAACLEGNCAEGLGT NYRGHVNITRSGIECQLWRSRYPHKPEINSTTHPGADLQENFCRNPDSSTTGPWCYT TDPTVRRQECSIPVCGQDQVTVAMTPRSEGSSVNLSPPLEQCVPDRGQQYQGRLAVT THGLPCLAWASAQAKALSKHQDFNSAVQLVENFCRNPDGDEEGVWCYVAGKPGDFGY CDLNYCEEAVEEETGDGLDEDSDRAIEGRTATSEYQTFFNPRTFGSGEADCGLRPLF EKKSLEDKTERELLESYIDGRIVEGSDAEIGMSPWQVMLFRKSPQELLCGASLISDR WVLTAAHCLLYPPWDKNFTENDLLVRIGKHSRTRYERNIEKISMLEKIYIHPRYNWR ENLDRDIALMKLKKPVAFSDYIHPVCLPDRETAASLLQAGYKGRVTGWGNLKETWTA NVGKGQPSVLQVVNLPIVERPVCKDSTRIRITDNMFCAGYKPDEGKRGDACEGDSGG Ki = 4.5nM PFVMKSPFNNRWYQMGIVSWGEGCDRDGKYGFYTHVFRLKKWIQKVIDQFGE Bioactivity data Assay/Target APTT = 11 min. 2. Organization, integration,
    [Show full text]
  • A Unicellular Relative of Animals Generates a Layer of Polarized Cells
    RESEARCH ARTICLE A unicellular relative of animals generates a layer of polarized cells by actomyosin- dependent cellularization Omaya Dudin1†*, Andrej Ondracka1†, Xavier Grau-Bove´ 1,2, Arthur AB Haraldsen3, Atsushi Toyoda4, Hiroshi Suga5, Jon Bra˚ te3, In˜ aki Ruiz-Trillo1,6,7* 1Institut de Biologia Evolutiva (CSIC-Universitat Pompeu Fabra), Barcelona, Spain; 2Department of Vector Biology, Liverpool School of Tropical Medicine, Liverpool, United Kingdom; 3Section for Genetics and Evolutionary Biology (EVOGENE), Department of Biosciences, University of Oslo, Oslo, Norway; 4Department of Genomics and Evolutionary Biology, National Institute of Genetics, Mishima, Japan; 5Faculty of Life and Environmental Sciences, Prefectural University of Hiroshima, Hiroshima, Japan; 6Departament de Gene`tica, Microbiologia i Estadı´stica, Universitat de Barcelona, Barcelona, Spain; 7ICREA, Barcelona, Spain Abstract In animals, cellularization of a coenocyte is a specialized form of cytokinesis that results in the formation of a polarized epithelium during early embryonic development. It is characterized by coordinated assembly of an actomyosin network, which drives inward membrane invaginations. However, whether coordinated cellularization driven by membrane invagination exists outside animals is not known. To that end, we investigate cellularization in the ichthyosporean Sphaeroforma arctica, a close unicellular relative of animals. We show that the process of cellularization involves coordinated inward plasma membrane invaginations dependent on an *For correspondence: actomyosin network and reveal the temporal order of its assembly. This leads to the formation of a [email protected] (OD); polarized layer of cells resembling an epithelium. We show that this stage is associated with tightly [email protected] (IR-T) regulated transcriptional activation of genes involved in cell adhesion.
    [Show full text]
  • Origin of a Folded Repeat Protein from an Intrinsically Disordered Ancestor
    RESEARCH ARTICLE Origin of a folded repeat protein from an intrinsically disordered ancestor Hongbo Zhu, Edgardo Sepulveda, Marcus D Hartmann, Manjunatha Kogenaru†, Astrid Ursinus, Eva Sulz, Reinhard Albrecht, Murray Coles, Jo¨ rg Martin, Andrei N Lupas* Department of Protein Evolution, Max Planck Institute for Developmental Biology, Tu¨ bingen, Germany Abstract Repetitive proteins are thought to have arisen through the amplification of subdomain- sized peptides. Many of these originated in a non-repetitive context as cofactors of RNA-based replication and catalysis, and required the RNA to assume their active conformation. In search of the origins of one of the most widespread repeat protein families, the tetratricopeptide repeat (TPR), we identified several potential homologs of its repeated helical hairpin in non-repetitive proteins, including the putatively ancient ribosomal protein S20 (RPS20), which only becomes structured in the context of the ribosome. We evaluated the ability of the RPS20 hairpin to form a TPR fold by amplification and obtained structures identical to natural TPRs for variants with 2–5 point mutations per repeat. The mutations were neutral in the parent organism, suggesting that they could have been sampled in the course of evolution. TPRs could thus have plausibly arisen by amplification from an ancestral helical hairpin. DOI: 10.7554/eLife.16761.001 *For correspondence: andrei. [email protected] Introduction † Present address: Department Most present-day proteins arose through the combinatorial shuffling and differentiation of a set of of Life Sciences, Imperial College domain prototypes. In many cases, these prototypes can be traced back to the root of cellular life London, London, United and have since acted as the primary unit of protein evolution (Anantharaman et al., 2001; Kingdom Apic et al., 2001; Koonin, 2003; Kyrpides et al., 1999; Orengo and Thornton, 2005; Ponting and Competing interests: The Russell, 2002; Ranea et al., 2006).
    [Show full text]
  • Strategic Plan 2011-2016
    Strategic Plan 2011-2016 Wellcome Trust Sanger Institute Strategic Plan 2011-2016 Mission The Wellcome Trust Sanger Institute uses genome sequences to advance understanding of the biology of humans and pathogens in order to improve human health. -i- Wellcome Trust Sanger Institute Strategic Plan 2011-2016 - ii - Wellcome Trust Sanger Institute Strategic Plan 2011-2016 CONTENTS Foreword ....................................................................................................................................1 Overview .....................................................................................................................................2 1. History and philosophy ............................................................................................................ 5 2. Organisation of the science ..................................................................................................... 5 3. Developments in the scientific portfolio ................................................................................... 7 4. Summary of the Scientific Programmes 2011 – 2016 .............................................................. 8 4.1 Cancer Genetics and Genomics ................................................................................ 8 4.2 Human Genetics ...................................................................................................... 10 4.3 Pathogen Variation .................................................................................................. 13 4.4 Malaria
    [Show full text]