Integration of Proteomics Data Into Uniprotkb

Benoit Bely1, Emanuele Alpi1,Guoying Qi1, Alan da Silva1, Jie Luo1, Maria Martin1 and the UniProt Consortium1, 2, 3 1 EMBL-European Bioinformatics Institute, Cambridge, UK 2 Swiss Institute of Bioinformatics, Geneva, Switzerland 3 Protein Information Resource, Washington DC, USA The Universal Protein Resource EBI is an outstation of the Integration of proteomics data European Molecular Biology Laboratory into UniProtKB Background - The amount of publicly available data in mass spectrometry (MS) proteomics repositories like PRIDE, PeptideAtlas, MaxQB, EPD, CTDP, ProteomicsDB, GPMDB, HPM and others, as well as data from third-party global PRIDE reprocessing, is growing at a fast rate. - The Universal Protein Resource (UniProt, http://www.uniprot.org/) is integrating proteomics data from all available public sources to provide comprehensive protein data sets with high quality experimentally proven evidences. UniProt protein existence (PE) values (http://www.uniprot.org/manual/protein_existence) >sp|Q8NI35|INADL_HUMAN InaD-like protein OS=Homo sapiens GN=INADL PE=1 SV=3 UniProt release 2016_02 1. Evidence at protein level PE value Nr of sequences PE value Nr of PE value Nr of (all species) (canonical only) % (human, reference proteome) sequences % (C. elegans , reference proteome) sequences % 2. Evidence at transcript level 1. Evidence at protein level 216550 0.35 1. Evidence at protein level 69831 75.77 1. Evidence at protein level 11414 41.14 2. Evidence at transcript level 1064707 1.73 2. Evidence at transcript level 7782 8.44 2. Evidence at transcript level 1091 3.93 3. Inferred from homology 3. Inferred from homology 13651703 22.19 3. Inferred from homology 1251 1.36 3. Inferred from homology 2825 10.18 4. Predicted 4. Predicted 46587131 75.72 4. Predicted 12655 13.73 4. Predicted 12408 44.72 5. Uncertain 1950 0.00 5. Uncertain 646 0.70 5. Uncertain 7 0.03 5. Uncertain Sum 61522041 100.00 Sum 92165 100.00 Sum 27745 100.00 UniProt reference proteome Pipeline to update PE values MS-proteomics repositories sequence collection filtered lists of peptides species-specific (isoforms always included, species- specific variants as optional) - In-silico digestion (any cleaving agent, unspecific as optional): loose/strict trypsin, loose/strict Lys-C, Glu-C (aka V8-DE), chymotrypsin, Asp-N and Lys-N two missed cleavages with and without initiator methionine cleavage Peptides - Peptide filtering: removal of 6 AA or shorter, removal of X-, B- or Z-containing - Peptide gene-centric unicity evaluation and distinction between unique and non- unique peptides Exact match ID Proteomics identification. AC KW-1267 DE Protein whose amino acid sequence has been partially or completely UniProtKB/TrEMBL accessions with unique peptides are DE confirmed using publicly available mass spectrometry based data. The DE mapping from the mass spectrometry data to UniProtKB sequences may be assigned a keyword to trigger PE updated to 1. DE subject to quality metrics from both UniProt and the data providers. HI Technical term: Proteomics identification. Unique/non-unique peptides/source(s) are provided on the ftp CA Technical term. for both UniProtKB/Swiss-Prot and UniProtKB/TrEMBL. Peptide unicity Gene-centric unicity: a peptide is considered unique if it belongs to only one specific gene group. Each gene group is constituted by one or more UniProtKB sequences. Gene groups are an underlying feature of the UniProtKB reference proteomes. Representative symbols for each gene group are provided in the ftp output files. Species (UniProt release 2016_02) UPID TaxID Sequences SeqWithUniqueEvidence % SeqWithUnEv MatchedUniquePeps MatchedNonUniquePeps Caenorhabditis elegans UP000001940 6239 27778 12779 46.00 78342 2242 Drosophila melanogaster UP000000803 7227 23332 10033 43.00 34392 570 Danio rerio UP000000437 7955 43309 13411 30.97 58766 5404 Homo sapiens UP000005640 9606 91923 69913 76.06 545841 28251 Sus scrofa UP000008227 9823 26220 7534 28.73 49931 6504 Bos taurus UP000009136 9913 24479 1172 4.79 2532 362 Mus musculus UP000000589 10090 58239 39933 68.57 297067 12595 Rattus norvegicus UP000002494 10116 31456 12442 39.55 135828 10604 Schizosaccharomyces pombe (strain 972 / ATCC 24843) UP000002485 284812 5128 4311 84.07 60345 989 Saccharomyces cerevisiae (strain ATCC 204508 / S288c) UP000002311 559292 6743 4807 71.29 83292 2963 Data availability Proteomics mapping files are publicly available since UniProt release 2015_03 (www.uniprot.org/help/2015/03/04/release) from a dedicated section on the Downloads page of the UniProt website (www.uniprot.org/downloads) which points to the UniProt ftp (ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/proteomics_mapping/) where statistics (ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/proteomics_mapping/relnotes.txt) and a readme file (ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/proteomics_mapping/README) are also provided. Mappings are currently available for ten species coming from three (since UniProt release 2016_03) MS proteomics repositories. More species will be added as soon as additional MS proteomics repositories will be integrated. Next developments - All the proteomics data incorporated so far are bottom-up data; the Consortium for Top Down Proteomics has started providing top-down. Cross references to CTDP are in place since UniProt release 2016_03. - Provide a visual representation of the mapped peptides (unique/non-unique) via the UniProt feature viewer (www.uniprot.org/help/2016/02/17/release) available since UniProt release 2016_02. UniProt feature viewer visualization (under development) Expanded proteomics track showing unique and non unique regions corresponding to the identified peptides of a specific UniProtKB sequence Same as on the left but with tooltip displaying the details of the evidence for a specific unique peptide and pointing to the UniProt ftp readme file for details on the analysis procedure. Contact: [email protected] UniProt posters available at: http://www.ebi.ac.uk/uniprot/posters EMBL-EBI UniProt is mainly supported by the National Institutes of Health (NIH) grant U41HG007822. Additional support for the EMBL-EBI's involvement in UniProt comes from European Molecular Biology Laboratory (EMBL), the British Wellcome Trust Genome Campus Heart Foundation (BHF) (RG/13/5/30112), the Parkinson's Disease United Kingdom (PDUK) GO grant G-1307, and the NIH GO grant U41HG02273. UniProt activities at the SIB are additionally supported by the Swiss Federal Hinxton, Cambridgeshire, CB10 1SD, UK Government through the State Secretariat for Education, Research and Innovation SERI. PIR's UniProt activities are also supported by the NIH grants R01GM080646, G08LM010720, and P20GM103446, and the National Tel. +44 (0) 1223 494 444 Science Foundation (NSF) grant DBI-1062520. www.ebi.ac.uk .

Integration of Proteomics Data Into Uniprotkb

Original Article Text Mining in the Biocuration Workflow: Applications for Literature Curation at Wormbase, Dictybase and TAIR

Biocuration 2016 - Posters

Biocuration - Mapping Resources and Needs [Version 2; Peer Review: 2 Approved]

Improving the Gene Ontology Resource to Facilitate More Informative Analysis and Interpretation of Alzheimer’S Disease Data

Biocuration Experts on the Impact of Duplication and Other Data Quality Issues in Biological Databases

ISB Newsletter

Ontology: Tool for Broad Spectrum Knowledge Integration Barry Smith

Biocuration at the Saccharomyces Genome Database

Perspective on Literature Curation

OM2017 Camerafinal

Building Curation Filters for BCO and Adapting the BCO Framework for Biocuration

Semi-Automated Ontology Generation for Biocuration and Semantic Search