Polymorphism Analysis Reveals Reduced Negative Selection and Elevated Rate of Insertions and Deletions in Intrinsically Disordered Protein Regions
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
Ensembl Genomes: Extending Ensembl Across the Taxonomic Space P
Published online 1 November 2009 Nucleic Acids Research, 2010, Vol. 38, Database issue D563–D569 doi:10.1093/nar/gkp871 Ensembl Genomes: Extending Ensembl across the taxonomic space P. J. Kersey*, D. Lawson, E. Birney, P. S. Derwent, M. Haimel, J. Herrero, S. Keenan, A. Kerhornou, G. Koscielny, A. Ka¨ ha¨ ri, R. J. Kinsella, E. Kulesha, U. Maheswari, K. Megy, M. Nuhn, G. Proctor, D. Staines, F. Valentin, A. J. Vilella and A. Yates EMBL-European Bioinformatics Institute, Wellcome Trust Genome Campus, Cambridge CB10 1SD, UK Received August 14, 2009; Revised September 28, 2009; Accepted September 29, 2009 ABSTRACT nucleotide archives; numerous other genomes exist in states of partial assembly and annotation; thousands of Ensembl Genomes (http://www.ensemblgenomes viral genomes sequences have also been generated. .org) is a new portal offering integrated access to Moreover, the increasing use of high-throughput genome-scale data from non-vertebrate species sequencing technologies is rapidly reducing the cost of of scientific interest, developed using the Ensembl genome sequencing, leading to an accelerating rate of genome annotation and visualisation platform. data production. This not only makes it likely that in Ensembl Genomes consists of five sub-portals (for the near future, the genomes of all species of scientific bacteria, protists, fungi, plants and invertebrate interest will be sequenced; but also the genomes of many metazoa) designed to complement the availability individuals, with the possibility of providing accurate and of vertebrate genomes in Ensembl. Many of the sophisticated annotation through the similarly low-cost databases supporting the portal have been built in application of functional assays. -
Rare Variant Contribution to Human Disease in 281,104 UK Biobank Exomes W 1,19 1,19 2,19 2 2 Quanli Wang , Ryan S
https://doi.org/10.1038/s41586-021-03855-y Accelerated Article Preview Rare variant contribution to human disease W in 281,104 UK Biobank exomes E VI Received: 3 November 2020 Quanli Wang, Ryan S. Dhindsa, Keren Carss, Andrew R. Harper, Abhishek N ag, I oa nn a Tachmazidou, Dimitrios Vitsios, Sri V. V. Deevi, Alex Mackay, EDaniel Muthas, Accepted: 28 July 2021 Michael Hühn, Sue Monkley, Henric O ls so n , S eb astian Wasilewski, Katherine R. Smith, Accelerated Article Preview Published Ruth March, Adam Platt, Carolina Haefliger & Slavé PetrovskiR online 10 August 2021 P Cite this article as: Wang, Q. et al. Rare variant This is a PDF fle of a peer-reviewed paper that has been accepted for publication. contribution to human disease in 281,104 UK Biobank exomes. Nature https:// Although unedited, the content has been subjectedE to preliminary formatting. Nature doi.org/10.1038/s41586-021-03855-y (2021). is providing this early version of the typeset paper as a service to our authors and Open access readers. The text and fgures will undergoL copyediting and a proof review before the paper is published in its fnal form. Please note that during the production process errors may be discovered which Ccould afect the content, and all legal disclaimers apply. TI R A D E T A R E L E C C A Nature | www.nature.com Article Rare variant contribution to human disease in 281,104 UK Biobank exomes W 1,19 1,19 2,19 2 2 https://doi.org/10.1038/s41586-021-03855-y Quanli Wang , Ryan S. -
Cryptic Inoviruses Revealed As Pervasive in Bacteria and Archaea Across Earth’S Biomes
ARTICLES https://doi.org/10.1038/s41564-019-0510-x Corrected: Author Correction Cryptic inoviruses revealed as pervasive in bacteria and archaea across Earth’s biomes Simon Roux 1*, Mart Krupovic 2, Rebecca A. Daly3, Adair L. Borges4, Stephen Nayfach1, Frederik Schulz 1, Allison Sharrar5, Paula B. Matheus Carnevali 5, Jan-Fang Cheng1, Natalia N. Ivanova 1, Joseph Bondy-Denomy4,6, Kelly C. Wrighton3, Tanja Woyke 1, Axel Visel 1, Nikos C. Kyrpides1 and Emiley A. Eloe-Fadrosh 1* Bacteriophages from the Inoviridae family (inoviruses) are characterized by their unique morphology, genome content and infection cycle. One of the most striking features of inoviruses is their ability to establish a chronic infection whereby the viral genome resides within the cell in either an exclusively episomal state or integrated into the host chromosome and virions are continuously released without killing the host. To date, a relatively small number of inovirus isolates have been extensively studied, either for biotechnological applications, such as phage display, or because of their effect on the toxicity of known bacterial pathogens including Vibrio cholerae and Neisseria meningitidis. Here, we show that the current 56 members of the Inoviridae family represent a minute fraction of a highly diverse group of inoviruses. Using a machine learning approach lever- aging a combination of marker gene and genome features, we identified 10,295 inovirus-like sequences from microbial genomes and metagenomes. Collectively, our results call for reclassification of the current Inoviridae family into a viral order including six distinct proposed families associated with nearly all bacterial phyla across virtually every ecosystem. -
Learning Protein Constitutive Motifs from Sequence Data Je´ Roˆ Me Tubiana, Simona Cocco, Re´ Mi Monasson*
TOOLS AND RESOURCES Learning protein constitutive motifs from sequence data Je´ roˆ me Tubiana, Simona Cocco, Re´ mi Monasson* Laboratory of Physics of the Ecole Normale Supe´rieure, CNRS UMR 8023 & PSL Research, Paris, France Abstract Statistical analysis of evolutionary-related protein sequences provides information about their structure, function, and history. We show that Restricted Boltzmann Machines (RBM), designed to learn complex high-dimensional data and their statistical features, can efficiently model protein families from sequence information. We here apply RBM to 20 protein families, and present detailed results for two short protein domains (Kunitz and WW), one long chaperone protein (Hsp70), and synthetic lattice proteins for benchmarking. The features inferred by the RBM are biologically interpretable: they are related to structure (residue-residue tertiary contacts, extended secondary motifs (a-helixes and b-sheets) and intrinsically disordered regions), to function (activity and ligand specificity), or to phylogenetic identity. In addition, we use RBM to design new protein sequences with putative properties by composing and ’turning up’ or ’turning down’ the different modes at will. Our work therefore shows that RBM are versatile and practical tools that can be used to unveil and exploit the genotype–phenotype relationship for protein families. DOI: https://doi.org/10.7554/eLife.39397.001 Introduction In recent years, the sequencing of many organisms’ genomes has led to the collection of a huge number of protein sequences, which are catalogued in databases such as UniProt or PFAM Finn et al., 2014). Sequences that share a common ancestral origin, defining a family (Figure 1A), *For correspondence: are likely to code for proteins with similar functions and structures, providing a unique window into [email protected] the relationship between genotype (sequence content) and phenotype (biological features). -
DECIPHER: Harnessing Local Sequence Context to Improve Protein Multiple Sequence Alignment Erik S
Wright BMC Bioinformatics (2015) 16:322 DOI 10.1186/s12859-015-0749-z RESEARCH ARTICLE Open Access DECIPHER: harnessing local sequence context to improve protein multiple sequence alignment Erik S. Wright1,2 Abstract Background: Alignment of large and diverse sequence sets is a common task in biological investigations, yet there remains considerable room for improvement in alignment quality. Multiple sequence alignment programs tend to reach maximal accuracy when aligning only a few sequences, and then diminish steadily as more sequences are added. This drop in accuracy can be partly attributed to a build-up of error and ambiguity as more sequences are aligned. Most high-throughput sequence alignment algorithms do not use contextual information under the assumption that sites are independent. This study examines the extent to which local sequence context can be exploited to improve the quality of large multiple sequence alignments. Results: Two predictors based on local sequence context were assessed: (i) single sequence secondary structure predictions, and (ii) modulation of gap costs according to the surrounding residues. The results indicate that context-based predictors have appreciable information content that can be utilized to create more accurate alignments. Furthermore, local context becomes more informative as the number of sequences increases, enabling more accurate protein alignments of large empirical benchmarks. These discoveries became the basis for DECIPHER, a new context-aware program for sequence alignment, which outperformed other programs on largesequencesets. Conclusions: Predicting secondary structure based on local sequence context is an efficient means of breaking the independence assumption in alignment. Since secondary structure is more conserved than primary sequence, it can be leveraged to improve the alignment of distantly related proteins. -
(DDD) Project: What a Genomic Approach Can Achieve
The Deciphering Development Disorders (DDD) project: What a genomic approach can achieve RCP ADVANCED MEDICINE, LONDON FEB 5TH 2018 HELEN FIRTH DM FRCP DCH, SANGER INSTITUTE 3,000,000,000 bases in each human genome Disease & developmental Health & development disorders Fascinating facts about your genome! –~20,000 protein-coding genes –~30% of genes have a known role in disease or developmental disorders –~10,000 protein altering variants –~100 protein truncating variants –~70 de novo mutations (~1-2 coding ie. In exons of genes) Rare Disease affects 1 in 17 people •Prior to DDD, diagnostic success in patients with rare paediatric disease was poor •Not possible to diagnose many patients with current methodology in routine use– maximum benefit in this group •DDD recruited patients with severe/extreme clinical features present from early childhood with high expectation of genetic basis •Recruitment was primarily of trios (ie The Doctor Sir Luke Fildes (1887) child and both parents) ~ 90% Making a genomic diagnosis of a rare disease improves care •Accurate diagnosis is the cornerstone of good medical practice – informing management, treatment, prognosis and prevention •Enables risk to other family members to be determined enabling predictive testing with potential for surveillance and therapy in some disorders February 28th 2018 •Reduces sense of isolation, enabling better access to support and information •Curtails the diagnostic odyssey •Not just a descriptive label; identifies the fundamental cause of disease A genomic diagnosis can be a gateway to better treatment •Not just a descriptive label; identifies the fundamental cause of disease •Biallelic mutations in the CFTR gene cause Cystic Fibrosis • CFTR protein is an epithelial ion channel regulating absorption/ secretion of salt and water in the lung, sweat glands, pancreas & GI tract. -
Different Evolutionary Patterns of Snps Between Domains and Unassigned Regions in Human Protein‑Coding Sequences
View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Springer - Publisher Connector Mol Genet Genomics (2016) 291:1127–1136 DOI 10.1007/s00438-016-1170-7 ORIGINAL ARTICLE Different evolutionary patterns of SNPs between domains and unassigned regions in human protein‑coding sequences Erli Pang1 · Xiaomei Wu2 · Kui Lin1 Received: 14 September 2015 / Accepted: 18 January 2016 / Published online: 30 January 2016 © The Author(s) 2016. This article is published with open access at Springerlink.com Abstract Protein evolution plays an important role in Furthermore, the selective strength on domains is signifi- the evolution of each genome. Because of their functional cantly greater than that on unassigned regions. In addition, nature, in general, most of their parts or sites are differently among all of the human protein sequences, there are 117 constrained selectively, particularly by purifying selection. PfamA domains in which no SNPs are found. Our results Most previous studies on protein evolution considered indi- highlight an important aspect of protein domains and may vidual proteins in their entirety or compared protein-coding contribute to our understanding of protein evolution. sequences with non-coding sequences. Less attention has been paid to the evolution of different parts within each pro- Keywords Human genome · Protein-coding sequence · tein of a given genome. To this end, based on PfamA anno- Protein domain · SNPs · Natural selection tation of all human proteins, each protein sequence can be split into two parts: domains or unassigned regions. Using this rationale, single nucleotide polymorphisms (SNPs) in Introduction protein-coding sequences from the 1000 Genomes Project were mapped according to two classifications: SNPs occur- Studying protein evolution is crucial for understanding ring within protein domains and those within unassigned the evolution of speciation and adaptation, senescence and regions. -
1 Codon-Level Information Improves Predictions of Inter-Residue Contacts in Proteins 2 by Correlated Mutation Analysis 3
1 Codon-level information improves predictions of inter-residue contacts in proteins 2 by correlated mutation analysis 3 4 5 6 7 8 Etai Jacob1,2, Ron Unger1,* and Amnon Horovitz2,* 9 10 11 1The Mina & Everard Goodman Faculty of Life Sciences, Bar-Ilan University, Ramat- 12 Gan, 52900, 2Department of Structural Biology 13 Weizmann Institute of Science, Rehovot 7610001, Israel 14 15 16 *To whom correspondence should be addressed: 17 Amnon Horovitz ([email protected]) 18 Ron Unger ([email protected]) 19 1 20 Abstract 21 Methods for analysing correlated mutations in proteins are becoming an increasingly 22 powerful tool for predicting contacts within and between proteins. Nevertheless, 23 limitations remain due to the requirement for large multiple sequence alignments (MSA) 24 and the fact that, in general, only the relatively small number of top-ranking predictions 25 are reliable. To date, methods for analysing correlated mutations have relied exclusively 26 on amino acid MSAs as inputs. Here, we describe a new approach for analysing 27 correlated mutations that is based on combined analysis of amino acid and codon MSAs. 28 We show that a direct contact is more likely to be present when the correlation between 29 the positions is strong at the amino acid level but weak at the codon level. The 30 performance of different methods for analysing correlated mutations in predicting 31 contacts is shown to be enhanced significantly when amino acid and codon data are 32 combined. 33 2 34 The effects of mutations that disrupt protein structure and/or function at one site are often 35 suppressed by mutations that occur at other sites either in the same protein or in other 36 proteins. -
Annual Scientific Report 2013 on the Cover Structure 3Fof in the Protein Data Bank, Determined by Laponogov, I
EMBL-European Bioinformatics Institute Annual Scientific Report 2013 On the cover Structure 3fof in the Protein Data Bank, determined by Laponogov, I. et al. (2009) Structural insight into the quinolone-DNA cleavage complex of type IIA topoisomerases. Nature Structural & Molecular Biology 16, 667-669. © 2014 European Molecular Biology Laboratory This publication was produced by the External Relations team at the European Bioinformatics Institute (EMBL-EBI) A digital version of the brochure can be found at www.ebi.ac.uk/about/brochures For more information about EMBL-EBI please contact: [email protected] Contents Introduction & overview 3 Services 8 Genes, genomes and variation 8 Molecular atlas 12 Proteins and protein families 14 Molecular and cellular structures 18 Chemical biology 20 Molecular systems 22 Cross-domain tools and resources 24 Research 26 Support 32 ELIXIR 36 Facts and figures 38 Funding & resource allocation 38 Growth of core resources 40 Collaborations 42 Our staff in 2013 44 Scientific advisory committees 46 Major database collaborations 50 Publications 52 Organisation of EMBL-EBI leadership 61 2013 EMBL-EBI Annual Scientific Report 1 Foreword Welcome to EMBL-EBI’s 2013 Annual Scientific Report. Here we look back on our major achievements during the year, reflecting on the delivery of our world-class services, research, training, industry collaboration and European coordination of life-science data. The past year has been one full of exciting changes, both scientifically and organisationally. We unveiled a new website that helps users explore our resources more seamlessly, saw the publication of ground-breaking work in data storage and synthetic biology, joined the global alliance for global health, built important new relationships with our partners in industry and celebrated the launch of ELIXIR. -
1 Constructing the Scientific Population in the Human Genome Diversity and 1000 Genome Projects Joseph Vitti I. Introduction: P
Constructing the Scientific Population in the Human Genome Diversity and 1000 Genome Projects Joseph Vitti I. Introduction: Populations Coming into Focus In November 2012, some eleven years after the publication of the first draft sequence of a human genome, an article published in Nature reported a new ‘map’ of the human genome – created from not one, but 1,092 individuals. For many researchers, however, what was compelling was not the number of individuals sequenced, but rather the fourteen worldwide populations they represented. Comparisons that could be made within and among these populations represented new possibilities for the scientific study of human genetic variation. The paper – which has been cited over 400 times in the subsequent year – was the output of the first phase of the 1000 Genomes Project, one of several international research consortia launched with the intent of identifying and cataloguing such variation. With the project’s phase three data release, anticipated in early spring 2014, the sample size will rise to over 2500 individuals representing twenty-six populations. Each individual’s full sequence data is made publicly available online, and is also preserved through the establishment of immortal cell lines, from which DNA can be extracted and distributed. With these developments, population-based science has been made genomic, and scientific conceptions of human populations have begun to crystallize (see appendix). Such extensive biobanking and databasing of human populations is remarkable for a number of reasons, not least among them the socially charged terrain that such an enterprise inevitably must navigate. While the 1000 Genomes Project (1000G) has been relatively uncontroversial in its reception, predecessors such as the Human Genome Diversity Project (HGDP), first conceived in 1991, faced greater difficulty. -
EMBL-EBI Now and in the Future
SureChEMBL: Open Patent Data Chemaxon UGM, Budapest 21/05/2014 Mark Davies ChEMBL Group, EMBL-EBI EMBL-EBI Resources Genes, genomes & variation European Nucleotide Ensembl European Genome-phenome Archive Archive Ensembl Genomes Metagenomics portal 1000 Genomes Gene, protein & metabolite expression ArrayExpress Metabolights Expression Atlas PRIDE Literature & Protein sequences, families & motifs ontologies InterPro Pfam UniProt Europe PubMed Central Gene Ontology Experimental Factor Molecular structures Ontology Protein Data Bank in Europe Electron Microscopy Data Bank Chemical biology ChEMBL ChEBI Reactions, interactions & pathways Systems BioModels BioSamples IntAct Reactome MetaboLights Enzyme Portal ChEMBL – Data for Drug Discovery 1. Scientific facts 3. Insight, tools and resources for translational drug discovery >Thrombin MAHVRGLQLPGCLALAALCSLVHSQHVFLAPQQARSLLQRVRRANTFLEEVRKGNLE Compound RECVEETCSYEEAFEALESSTATDVFWAKYTACETARTPRDKLAACLEGNCAEGLGT NYRGHVNITRSGIECQLWRSRYPHKPEINSTTHPGADLQENFCRNPDSSTTGPWCYT TDPTVRRQECSIPVCGQDQVTVAMTPRSEGSSVNLSPPLEQCVPDRGQQYQGRLAVT THGLPCLAWASAQAKALSKHQDFNSAVQLVENFCRNPDGDEEGVWCYVAGKPGDFGY CDLNYCEEAVEEETGDGLDEDSDRAIEGRTATSEYQTFFNPRTFGSGEADCGLRPLF EKKSLEDKTERELLESYIDGRIVEGSDAEIGMSPWQVMLFRKSPQELLCGASLISDR WVLTAAHCLLYPPWDKNFTENDLLVRIGKHSRTRYERNIEKISMLEKIYIHPRYNWR ENLDRDIALMKLKKPVAFSDYIHPVCLPDRETAASLLQAGYKGRVTGWGNLKETWTA NVGKGQPSVLQVVNLPIVERPVCKDSTRIRITDNMFCAGYKPDEGKRGDACEGDSGG Ki = 4.5nM PFVMKSPFNNRWYQMGIVSWGEGCDRDGKYGFYTHVFRLKKWIQKVIDQFGE Bioactivity data Assay/Target APTT = 11 min. 2. Organization, integration, -
NIH-GDS: Genomic Data Sharing
NIH-GDS: Genomic Data Sharing National Institutes of Health Data type Explain whether the research being considered for funding involves human data, non- human data, or both. Information to be included in this section: • Type of data being collected: human, non-human, or both human & non-human. • Type of genomic data to be shared: sequence, transcriptomic, epigenomic, and/or gene expression. • Level of the genomic data to be shared: Individual-level, aggregate-level, or both. • Relevant associated data to be shared: phenotype or exposure. • Information needed to interpret the data: study protocols, survey tools, data collection instruments, data dictionary, software (including version), codebook, pipeline metadata, etc. This information should be provided with unrestricted access for all data levels. Data repository Identify the data repositories to which the data will be submitted, and for human data, whether the data will be available through unrestricted or controlled-access. For human genomic data, investigators are expected to register all studies in the database of Genotypes and Phenotypes (dbGaP) by the time data cleaning and quality control measures begin in addition to submitting the data to the relevant NIH-designated data repository (e.g., dbGaP, Gene Expression Omnibus (GEO), Sequence Read Archive (SRA), the Cancer Genomics Hub) after registration. Non-human data may be made available through any widely used data repository, whether NIH- funded or not, such as GEO, SRA, Trace Archive, Array Express, Mouse Genome Informatics, WormBase, the Zebrafish Model Organism Database, GenBank, European Nucleotide Archive, or DNA Data Bank of Japan. Data in unrestricted-access repositories (e.g., The 1000 Genomes Project) are publicly available to anyone.