How Uniprotkb Maps Genomes and Variants and Provides This Information

Hermann ZELLNER, Benoît BELY, Andrew NIGHTINGALE, Alan WILTER SOUSA DA SILVA and Maria JESUS MARTIN EMBL - European Bioinformatics Institute, Welcome Trust Genome Campus,CB10 1SD Hinxton, Cambridge, United Kingdom How UniProtKB Maps Genomes And Variants and Provides This Information. 1 1 1, 2, 3 Complete Proteomes for Complete GenomesAndrew Nightingale1, Jie Luo , Maria Martin and the UniProt Consortium The source of the UniProtKB complete proteomes are genomes from1EMBL-European INSDC and Ensembl Bioinformatics Institute, Cambridge, UK and Ensembl Genomes. 2SIB Swiss Institute of Bioinformatics, Geneva, Switzerland For INSDC, all annotated proteins are imported in UniProtKB (UniProtKB/3ProteinTrEMBL Information) but Resource, only Georgetown University, Washington DC & University of Delaware, USA proteins coming from complete, annotated genomes and WGS genomes detected as complete will be tagged with the keyword Complete Proteome. For Ensembl, all predicted protein sequences are mapped to UniProtKB under stringent conditions: 100% identity over 100% of the length of the two sequences.Mapping Any sequence found the UniProt Human Reference Proteome to be absent from UniProtKB is imported. All UniProtKB entries that map to an Ensembl peptide are used to build the proteome; they are tagged and a cross-referenceto theis added. Reference Genome and Variation Data Reference Proteomes coming out of the line UniProt has defined a set of Reference Proteomes which are ‘landmarks’ in proteome space. Reference proteomes are selected to provide broad coverage of the tree of life, and constitute a representative cross-section of the taxonomic diversity to be found within UniProtKB. They include the proteomes of well-studied model organisms [including those Introductionin the now defunct IPI Mapping the Complete Human sets] and other proteomes of interest for biomedical and biotechnological research. These are the proteomes which are preferentially selected for manual curation when resources permit. Proteome to the Reference Genome Species of particular importance may be represented by numerous reference proteomes for specific ecotypes or strains of interest. ● UniProt has annotated the complete Homo sapiens proteome and ● Mapping was made possible through a process of aligning all human approximately 20,000 protein coding genes are represented by a protein sequences in the UniProt Knowledgebase (UniProtKB) to the New space, new concept, new interface canonical protein sequence in UniProtKB/Swiss-Prot. protein translations in Ensembl, based on 100% amino acid identity over A new interface will be introduced to seamlessly integrate UniProt taxonomy to the associated ● Most of these protein sequences are now mapped to the reference the entire sequence. proteomes. The expanded view for individual taxa will allow users to choose between 1 1 1, 2, 3 ● Andrew Nightingale1, Jie Luo , Maria Martin and the UniProt Consortium alternate complete proteomes available for many species. In addition, Referencegenome proteomes assembly produced by the international Genome Reference Using an example to demonstrate the mapping pipeline in more detail, (1) will aid users in making an informed choice. Consortium (GRC) Figure 1. Proteome pipeline. How complete and reference proteomes are made. Figure 1. The transcription and translation products of the human gene 1EMBL-European Bioinformatics Institute, Cambridge, UK ● 2SIB Swiss Institute of Bioinformatics, Geneva, Switzerland As described in a separate poster, UniProt manually annotate variant GSTZ1 are mapped to the two existing isoforms of MAAI_HUMAN by 3Protein Information Resource, Georgetown University, Washington DC & University of Delaware, USA Humansequences Proteome, with known functional The exception consequences proves from thethe literature. rule amino acid identity. The first transcript is mapped to isoform-1. Two UniProt has annotated the complete Homo sapiens proteome and approximately 20,000 ● Studies on sequence variation are increasing; projects like 1000 separate gene transcripts with identical amino acid sequence are mapped protein coding(2) genes are represented by a canonical protein (3)sequence in UniProtKB/Swiss- Prot.Genomes Studies on and sequence the Cancervariation areGenome increasing; Project projects like are 1000 generating Genomes and a thevast to isoform-2. A forth gene product is mapped to an eXisting TrEMBL entry, Mapping the UniProt Human Reference Proteome Canceramount Genome of variant Project areinformation generating a vastthat amount is, or of variantwill be, information stored that by is, orEnsembl will be, A6NED0; whilst a final unknown product is not found in UniProtKB and storedvariation by Ensembl(4) in their in their databases. variation databases. therefore is added to UniProtKB as a new TrEMBL record. to the Reference Genome and Variation Data ● This exponential exponential growth growth of variation of variationinformation informationmeans a new automatedmeans a strategy new automatedis required forstrategy the selection is required and importing for the of selectionbiomedical andsciences importing and clinical of biomedicalrelevant variants sciences from Ensembland clinical variation relevant into UniProtKB. variants from Ensembl variation into UniProtKB. Mapping the Complete Human Proteome to the Reference Genome Introduction Mapping the Complete Human Using an example to demonstrate the mapping pipeline in more detail, Figure 3. The transcription and translation products of the human gene GSTZ1 are mapped to the two existing isoforms of MAAI_HUMAN by amino acid identity. The first transcript is mapped to Proteome to the Reference Genomeisoform-1. Two separate gene transcripts with identical amino acid sequence are mapped to ● UniProt has annotated the complete Homo sapiens proteome and ● Mapping was made possible through a process of aligning allisoform-2. human A forth gene product is mapped to an existing TrEMBL entry, A6NED0; whilst a final approximately 20,000 protein coding genes are represented by a protein sequences in the UniProt Knowledgebase (UniProtKB)unknown to the product is not found in UniProtKB and therefore is added to UniProtKB as a new TrEMBL record. canonical protein sequence in UniProtKB/Swiss-Prot. protein translations in Ensembl, based on 100% amino acid identity over ● Most of these protein sequences are now mapped to the reference the entire sequence. genome assembly produced by the international Genome Reference ● Using an example to demonstrate the mapping pipeline in more detail, Figure 2: Examples of the new information provided by the new variant import pipeline. Consortium (GRC)(1). Figure 1. The transcription and translation products of the human gene ● As described in a separate poster, UniProt manually annotate variant GSTZ1Figure 2. New proteome interface. are mapped to theThe proposed new interface will allow users a choice two existing isoforms of MAAI_HUMAN by sequences with known functional consequences from the literature. aminoof proteomes where available, with one or more designated ‘Reference’ proteomes. acid identity. The first transcript is mapped to isoform-1. Two ● Studies on sequence variation are increasing; projects like 1000 separateMapping gene Variants transcripts to the withUniProt identical Human amino Reference acid sequence Proteome are mapped Genomes(2) and the Cancer Genome Project(3) are generating a vast to Nowisoform-2. that UniProt A forth has gene the producthuman reference is mapped proteome to an eXistingmapped TrEMBLto the entry, amount of variant information that is, or will be, stored by Ensembl A6NED0;human reference whilst a genome,final unknown UniProt producthas developed is not a found pipeline in toUniProtKB import and variation(4) in their databases. thereforehigh-quality is added 1000 Genomes to UniProtKB and COSMIC as a new non-synonymous TrEMBL record. single amino acid variants from Ensembl variation, Figure 4. 389,935 single amino acid Figure 3. Mapping of the Ensembl Human GSTZ1 gene to the protein sequences of UniProtKB's MAAI_HUMAN ● This exponential growth of variation information means a new automated variants have been identified for import into the UniProt human reference Entry O43708 strategy is required for the selection and importing of biomedical sciences proteome. and clinical relevant variants from Ensembl variation into UniProtKB. Quest for Orthologs (QfO), a gene centric view Figure 1: MappingSince 2009 of the UniProt Ensembl team Human is workingBRCA1 gene in collaboration to the protein with sequences QfO consortium of UniProtKB's in order to BRCA1_HUMANprovide Entry a gene P38398. centric benchmark data set for reference proteomes. This dataset is released once a year and concern 147 species for 2013 base on UniprotKB release 2013_04. This benchmark provide to the ortholog community a standard data set to effectively compare their methods. For each of the proteomes we provide : one fasta file containing non-redundant FASTA sets for the canonical sequences, gene to protein mapping file and idmapping containing all cross-references link to those proteins. Figure 3:The Good and the Bad biochemical and biomedical consequences of missense variants. Figure 4. Examples of the new informaon provided by the new variant import pipeline. See http://www.ebi.ac.uk/reference_proteomes/. FigureUniProt Consor5um. 2: Examples of Reorganizing the protein space at the Universal Protein Resource (UniProt). the new information provided by the new variant import pipeline.

How Uniprotkb Maps Genomes and Variants and Provides This Information

Ensembl Genomes: Extending Ensembl Across the Taxonomic Space P

Abstracts Genome 10K & Genome Science 29 Aug - 1 Sept 2017 Norwich Research Park, Norwich, Uk

Rare Variant Contribution to Human Disease in 281,104 UK Biobank Exomes W 1,19 1,19 2,19 2 2 Quanli Wang , Ryan S

The ELIXIR Core Data Resources: Fundamental Infrastructure for The

(DDD) Project: What a Genomic Approach Can Achieve

Different Evolutionary Patterns of Snps Between Domains and Unassigned Regions in Human Protein‑Coding Sequences

Annual Scientific Report 2013 on the Cover Structure 3Fof in the Protein Data Bank, Determined by Laponogov, I

1 Constructing the Scientific Population in the Human Genome Diversity and 1000 Genome Projects Joseph Vitti I. Introduction: P

NIH-GDS: Genomic Data Sharing

Strategic Plan 2011-2016

Browsing Genomes with Ensembl Annotation

The Genomic Basis of Circadian and Circalunar Timing Adaptations in a Midge Tobias S