<<

Hermann ZELLNER, Benoît BELY, Andrew NIGHTINGALE, Alan WILTER SOUSA DA SILVA and Maria JESUS MARTIN EMBL - European Institute, Welcome Trust Campus,CB10 1SD Hinxton, Cambridge,

How UniProtKB Maps And Variants and Provides This Information.

1 1 1, 2, 3 Complete Proteomes for Complete GenomesAndrew Nightingale1, Jie Luo , Maria Martin and the UniProt Consortium The source of the UniProtKB complete proteomes are genomes from1EMBL-European INSDC and Ensembl Bioinformatics Institute, Cambridge, UK and . 2SIB Swiss Institute of Bioinformatics, Geneva, Switzerland For INSDC, all annotated are imported in UniProtKB (UniProtKB/3ProteinTrEMBL Information) but Resource, only Georgetown University, Washington DC & University of Delaware, USA proteins coming from complete, annotated genomes and WGS genomes detected as complete will be tagged with the keyword Complete Proteome. For Ensembl, all predicted sequences are mapped to UniProtKB under stringent conditions: 100% identity over 100% of the length of the two sequences.Mapping Any sequence found the UniProt Reference Proteome to be absent from UniProtKB is imported. All UniProtKB entries that map to an Ensembl peptide are used to build the proteome; they are tagged and a cross-referenceto theis added. and Variation Data Reference Proteomes coming out of the line UniProt has defined a set of Reference Proteomes which are ‘landmarks’ in proteome space. Reference proteomes are selected to provide broad coverage of the tree of life, and constitute a representative cross-section of the taxonomic diversity to be found within UniProtKB. They include the proteomes of well-studied model [including those Introductionin the now defunct IPI Mapping the Complete Human sets] and other proteomes of interest for biomedical and biotechnological research. These are the proteomes which are preferentially selected for manual curation when resources permit. Proteome to the Reference Genome of particular importance may be represented by numerous reference proteomes for specific ecotypes or strains of interest. ● UniProt has annotated the complete Homo sapiens proteome and ● Mapping was made possible through a process of aligning all human approximately 20,000 protein coding are represented by a protein sequences in the UniProt Knowledgebase (UniProtKB) to the New space, new concept, new interface canonical protein sequence in UniProtKB/Swiss-Prot. protein translations in Ensembl, based on 100% identity over A new interface will be introduced to seamlessly integrate UniProt to the associated ● Most of these protein sequences are now mapped to the reference the entire sequence. proteomes. The expanded view for individual taxa will allow users to choose between 1 1 1, 2, 3 ● Andrew Nightingale1, Jie Luo , Maria Martin and the UniProt Consortium alternate complete proteomes available for many species. In addition, Referencegenome proteomes assembly produced by the international Genome Reference Using an example to demonstrate the mapping pipeline in more detail, (1) will aid users in making an informed choice. Consortium (GRC) Figure 1. Proteome pipeline. . How complete and reference proteomes are made. Figure 1. The and translation products of the human 1EMBL-European Bioinformatics Institute, Cambridge, UK ● 2SIB Swiss Institute of Bioinformatics, Geneva, Switzerland As described in a separate poster, UniProt manually annotate variant GSTZ1 are mapped to the two existing isoforms of MAAI_HUMAN by 3Protein Information Resource, Georgetown University, Washington DC & University of Delaware, USA Humansequences Proteome, with known functional The exception consequences proves from thethe literature. rule amino acid identity. The first transcript is mapped to isoform-1. Two UniProt has annotated the complete Homo sapiens proteome and approximately 20,000 ● Studies on sequence variation are increasing; projects like 1000 separate gene transcripts with identical amino acid sequence are mapped protein coding(2) genes are represented by a canonical protein (3)sequence in UniProtKB/Swiss- Prot.Genomes Studies on and sequence the Cancer variation are Genome increasing; Project projects like are 1000 generating Genomes and a the vast to isoform-2. A forth gene product is mapped to an existing TrEMBL entry, Mapping the UniProt Human Reference Proteome Canceramount Genome of variant Project are information generating a vast that amount is, or of variant will be,information stored that by is, or Ensembl will be, A6NED0; whilst a final unknown product is not found in UniProtKB and storedvariation by Ensembl(4) in their in their databases. variation databases. therefore is added to UniProtKB as a new TrEMBL record. to the Reference Genome and Variation Data ● This exponential exponential growth growth of variation of variationinformation informationmeans a new automatedmeans a strategy new automatedis required forstrategy the selection is required and importing for the of selection biomedical and sciences importing and clinical of biomedical relevant variants sciences from Ensembland clinical variation relevant into UniProtKB. variants from Ensembl variation into UniProtKB.

Mapping the Complete Human Proteome to the Reference Genome

Introduction Mapping the Complete Human Using an example to demonstrate the mapping pipeline in more detail, Figure 3. The transcription and translation products of the human gene GSTZ1 are mapped to the two existing isoforms of MAAI_HUMAN by amino acid identity. The first transcript is mapped to Proteome to the Reference Genomeisoform-1. Two separate gene transcripts with identical amino acid sequence are mapped to ● UniProt has annotated the complete Homo sapiens proteome and ● Mapping was made possible through a process of aligning allisoform-2. human A forth gene product is mapped to an existing TrEMBL entry, A6NED0; whilst a final approximately 20,000 protein coding genes are represented by a protein sequences in the UniProt Knowledgebase (UniProtKB)unknown to the product is not found in UniProtKB and therefore is added to UniProtKB as a new TrEMBL record. canonical protein sequence in UniProtKB/Swiss-Prot. protein translations in Ensembl, based on 100% amino acid identity over ● Most of these protein sequences are now mapped to the reference the entire sequence. genome assembly produced by the international Genome Reference ● Using an example to demonstrate the mapping pipeline in more detail, Figure 2: Examples of the new information provided by the new variant import pipeline. Consortium (GRC)(1). Figure 1. The transcription and translation products of the human gene ● As described in a separate poster, UniProt manually annotate variant GSTZ1Figure 2. New proteome interface. are mapped to theThe proposed new interface will allow users a choice two existing isoforms of MAAI_HUMAN by sequences with known functional consequences from the literature. aminoof proteomes where available, with one or more designated ‘Reference’ proteomes. acid identity. The first transcript is mapped to isoform-1. Two ● Studies on sequence variation are increasing; projects like 1000 separateMapping gene Variants transcripts to the withUniProt identical Human amino Reference acid sequence Proteome are mapped Genomes(2) and the (3) are generating a vast to Nowisoform-2. that UniProt A forth has gene the product human reference is mapped proteome to an existing mapped TrEMBL to the entry, amount of variant information that is, or will be, stored by Ensembl A6NED0;human reference whilst a genome, final unknown UniProt producthas developed is not a found pipeline in to UniProtKB import and variation(4) in their databases. thereforehigh-quality is added 1000 Genomes to UniProtKB and COSMIC as a new non-synonymous TrEMBL record. single amino acid variants from Ensembl variation, Figure 4. 389,935 single amino acid Figure 3. Mapping of the Ensembl Human GSTZ1 gene to the protein sequences of UniProtKB's MAAI_HUMAN ● This exponential growth of variation information means a new automated variants have been identified for import into the UniProt human reference Entry O43708 strategy is required for the selection and importing of biomedical sciences proteome. and clinical relevant variants from Ensembl variation into UniProtKB. Quest for Orthologs (QfO), a gene centric view Figure 1: MappingSince 2009 of the UniProt Ensembl team Human is workingBRCA1 gene in collaboration to the protein with sequences QfO consortium of UniProtKB's in order to BRCA1_HUMANprovide Entry a gene P38398. centric benchmark data set for reference proteomes. This dataset is released once a year and concern 147 species for 2013 base on UniprotKB release 2013_04. This benchmark provide to the ortholog community a standard data set to effectively compare their methods. For each of the proteomes we provide : one fasta file containing non-redundant FASTA sets for the canonical sequences, gene to protein mapping file and idmapping containing all cross-references link to those proteins. Figure 3:The Good and the Bad biochemical and biomedical consequences of missense variants. Figure 4. Examples of the new informaon provided by the new variant import pipeline. See http://www.ebi.ac.uk/reference_proteomes/.

FigureUniProt Consorum. 2: Examples of Reorganizing the protein space at the Universal Protein Resource (UniProt). the new information provided by the new variant import pipeline. Nucleic Acids Res., 40:D71-5, 2012. T. Gabaldon, C. Dessimoz, J. Huxley-Jones, A. Vilella, E. Sonnhammer and S. Lewis. Joining forces in the quest for orthologs. Mapping VariantsGenome Biology, 9:403, 2009. to the UniProt Invaluable Information Provided by Consorum. A map of variaon from populaon-scale sequencing. 467 (7319): 1061–1073, 2010. SA. Forbes, G. Bhamra, S. Bamford, E. Dawson, C. Kok, et al. The Catalogue of Somac Mutaons in Cancer (COSMIC). Human ReferenceCurr Protoc Hum Genet, Chapter 10: Unit 10-11, 2008. Proteome Variants UniProt is mainly supported by the Naonal Instutes of Health (NIH) grant 1U41HG006104-01. Addional support for the EMBL-EBI's involvement in EMBL-EBI Tel. +44 (0) 1223 494 444 UniProt comes from EMBL and the NIH GO grant 2P41HG02273-07. UniProt acvies at the SIB are addionally supported by the Swiss Federal Government through the State Secretariat for Educaon, Research and Innovaon SERI, and by the EC grants GEN2PHEN (200754) and MICROME (222886-2). PIR's Genome Campus [email protected] UniProt acvies are also supported by the NIH grants 5R01GM080646-07, 3R01GM080646-07S1, 5G08LM010720-03, and 8P20GM103446-12, and the Hinxton, Cambridgeshire, CB10 1SD, UK www.ebi.ac.uk Naonal Science Foundaon (NSF) grant DBI-1062520. ● Now that UniProt has the human reference proteome mapped to the ● The imported variant information provides amino acid , human reference genome, UniProt has developed a pipeline to import observed within the sampled population, for specific protein sequence high-quality 1000 Genomes(2) and COSMIC(5) non-synonymous single isoforms, Figure 2. (4) amino acid variants from Ensembl variation , Figure 2. ● Biochemical and biomedical consequences of germline and somatic ● 389,935 single amino acid variants have been identified for import into variants are defined through well characterised observed in Figure 1: Mapping of the Ensembl Human BRCA1 gene to the protein sequences of UniProtKB's the UniProt human reference proteome. sample populations, Figure 3. BRCA1_HUMAN Entry P38398. ● Prediction of the potential consequence of a variant on an individual when a for an observed variant in a population has not been isolated.

Figure 3:The Good and the Bad biochemical and biomedical consequences of missense variants. Mapping Variants to the UniProt Invaluable Information Provided byFuture Developments References 1) The Genome Reference Consortium, http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/index.shtml ● UniProt intend to extend the variant import pipeline to include other Human Reference Proteome Variants 2) Durbin, R. et al., A map of human genome variation from population scale sequencing., Nature 467(7319):1061 (2010) species with a complete proteome. 3) Cancer Genome Project: http://www.sanger.ac.uk/research/projects/cancergenome/ 4) Chen Y. et al., Ensembl Variation Resources., BMC 11(1):293 (2010) 5) Simon A. Forbes et al., COSMIC: http://cancer.sanger.ac.uk ● Now that UniProt has the human reference proteome mapped to the ● The imported variant information provides amino acid mutations, human reference genome, UniProt has developed a pipeline to import observed within the sampled population, for specific protein sequence high-quality 1000 Genomes(2) and COSMIC(5) non-synonymous single isoforms, Figure 2. Funding (4) UniProt is funded by the European Molecular Biology Laboratory, National Institutes of Health, European Union, Swiss Federal Government, British Heart Foundation and National Science Foundation. amino acid variants from Ensembl variation , Figure 2. ● Biochemical and biomedical consequences of germline and somatic ● 389,935 single amino acid variants have been identified for import into variants are defined through well characterised phenotypes observed in the UniProt human reference proteome. sample populations, Figure 3. Email: [email protected] ● Prediction of the potential consequence of a variant on an individual when a phenotype for an observed variant in a population has not been URL: www.uniprot.org isolated.

Future Developments References

● 1) The Genome Reference Consortium, http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/index.shtml UniProt intend to extend the variant import pipeline to include other 2) Durbin, R. et al., A map of human genome variation from population scale sequencing., Nature 467(7319):1061 (2010) species with a complete proteome. 3) Cancer Genome Project: http://www.sanger.ac.uk/research/projects/cancergenome/ 4) Chen Y. et al., Ensembl Variation Resources., BMC Genomics 11(1):293 (2010) 5) Simon A. Forbes et al., COSMIC: http://cancer.sanger.ac.uk

Funding UniProt is funded by the European Molecular Biology Laboratory, National Institutes of Health, European Union, Swiss Federal Government, British Heart Foundation and National Science Foundation.

Email: [email protected] URL: www.uniprot.org