Uniprotkb.Pdf
Total Page:16
File Type:pdf, Size:1020Kb
The UniProt knowledgebase www.uniprot.org a hub of integrated protein data [email protected] Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Science cover, february 2011 data knowledge protein sequence functional information UniProt consortium EBI : European Bioinformatics Institute (UK) SIB : Swiss Institute of Bioinformatics (CH) PIR : Protein information resource (US) www.uniprot.org UniProt databases UniProtKB: protein sequence knowledgebase, 2 sections UniProtKB/Swiss-Prot and UniProtKB/TrEMBL (query, Blast, download) (~15 mo entries) UniParc: protein sequence archive (ENA equivalent at the protein level). Each entry contains a protein sequence with cross- links to other databases where you find the sequence (active or not). Not annotated (query, Blast, download) (~25 mo entries) UniRef: 3 clusters of protein sequences with 100, 90 and 50 % identity; useful to speed up sequence similarity search (BLAST) (query, Blast, download) (UniRef100 10 mo entries; UniRef90 7 mo entries; UniRef50 3.3 mo entries) UniMES: protein sequences derived from metagenomic projects (mostly Global Ocean Sampling (GOS)) (download) (8 mo entries, included in UniParc) UniProt databases The central piece UniProtKB an encyclopedia on proteins composed of 2 sections UniProtKB/TrEMBL and UniProtKB/Swiss-Prot unreviewed and reviewed automatically annotated and manually annotated released every 4 weeks UniProtKB Origin of protein sequences UniProtKB protein sequences are mainly derived from - INSDC (translated submitted coding sequences - CDS) 85 % - Ensembl (gene prediction ) and RefSeq sequences - Sequences of PDB structures 15 % - Direct submission or sequences scanned from literature Notes: - UniProt is not doing any gene prediction - Most non-germline immunoglobulins, T-cell receptors , most patent sequences, highly over-represented data (e.g. viral antigens), pseudogenes sequences are excluded from UniProtKB, - but stored in UniParc - Data from the PIR database have been integrated in UniProtKB since 2003. EMBL Manual annotation of Swiss-Prot the sequence and associated biological information TrEMBL Automated extraction of protein sequence (translated CDS), gene name and references. Automated annotation UniProtKB/TrEMBL unreviewed Automatic annotation released every 4 weeks Protein and gene names Taxonomic information Automated annotation Function, Subcellular location, Catalytic activity, Sequence similarities… References Automated annotation One protein sequence transmembrane domains, One species signal peptide… Cross-references Automated annotation to over 125 databases UniProtKB/TrEMBL Keywords and www.uniprot.org Gene Ontology UniProtKB/TrEMBL Automatic annotation Protein sequence - The quality of the protein sequences is dependent on the information provided by the submitter of the original nucleotide entry (CDS) or of the gene prediction pipeline (i.e. Ensembl). - 100% identical sequences (same length, same organism are merged automatically). Biological information Sources of annotation - Provided by the submitter (EMBL, PDB, TAIR…) - From automated annotation (automated generated annotation rules (i.e. SAAS) and/or manually generated annotation rules (i.e. UniRule)) UniProtKB/TrEMBL Example of fully automatic annotation: SAAS • Rules are derived from the UniProtKB/Swiss-Prot manual annotation. • Fully automated rule generation based on C4.5 decision tree algorithm. • One annotation, one rule. • High stringency – require 99% or greater estimated precision to generate annotation (test on UniProtKB/Swiss-Prot) • Rules are produced, updated and validated at each release. UniProtKB/Swiss-Prot reviewed manually annotated released every 4 weeks Manual annotation Function, Subcellular location, Protein and gene names Catalytic activity, Disease, Taxonomic information Tissue specificty, Pathway… MSKEKFERTKPHVNVGTIGHVDHGKTTLTAAITTVLAKTYGGAAR AFDQIDNAPEEKARGITINTSHVEYDTPTRHYAHVDCPGHADYVK References NMITGAAQMDGAILVVAATDGPMPQTREHILLGRQVGVPYIIVFLOne protein sequence Manual annotation NKCDMVDDEELLELVEMEVRELLSQYDFPGDDTPIVRGSALKALE Post-translational modifications, GDAEWEAKILELAGFLDSYIPEPERAIDKPFLLPIEDVFSISGRGOne gene TVVTGRVERGIIKVGEEVEIVGIKETQKSTCTGVEMFRKLLDEGROne species variants, transmembrane domains, AGENVGVLLRGIKREEIERGQVLAKPGTIKPHTKFESEVYILSKD signal peptide… EGGRHTPFFKGYRPQFYFRTTDVTGTIELPEGVEMVMPGDNIKMV VTLIHPIAMDDGLRFAIREGGRTVGAGVVAKVLG Alternative products: protein sequences produced by alternative splicing, alternative promoter usage, Cross-references Manual annotation alternative initiation… to over 125 databases Keywords and Gene Ontology UniProtKB/Swiss-Prot www.uniprot.org UniProtKB/Swiss-Prot Manual annotation 1. Protein sequence (merge available CDS, annotate sequence discrepancies, report sequencing mistakes…) 2. Biological information (sequence analysis, extract literature information, ortholog data propagation, …) UniProtKB/Swiss-Prot 1- Protein sequence curation UniProtKB/Swiss-Prot a gene-centric view of the protein space 1 entry <-> 1 gene (1 species) The displayed protein sequence: …canonical, representative, consensus… + alternative sequences (described within the entry) What is the current status? • At least 20% of Swiss-Prot entries required a minimal amount of curation effort so as to obtain the “correct” sequence. • Typical problems – unsolved conflicts – uncorrected initiation sites – frameshifts – wrong gene prediction – other „problems‟ UCSC genome browser examples of CDS annotation submitted to INSDC… UniProtKB/Swiss-Prot 2- Biological data curation Extract literature information and protein sequence analysis maximum usage of controlled vocabulary UniProtKB/Swiss-Prot gathers data form multiple sources: - publications (literature/Pubmed) - prediction programs (Prosite, TMHMM, …) - contacts with experts - other databases - nomenclature committees An evidence attribution system allows to easily trace the source of each annotation Protein and gene names General annotation (Comments) …enable researchers to obtain a summary of what is known about a protein… www.uniprot.org Human protein manual annotation: some statistics (June 2011) Sequence annotation (Features) …enable researchers to obtain a summary of what is known about a protein… www.uniprot.org Non-experimental qualifiers UniProtKB/Swiss-Prot considers both experimental and predicted data and makes a clear distinction between both Type of evidence Qualifier Strong experimental evidence None or Ref.X Light experimental evidence Probable Inferred by similarity with homologous protein By similarity Inferred by prediction Potential Find all the proteins localized in the cytoplasm (experimentally proven) which are phosphorylated on a serine (experimentally proven) ‘Protein existence’ tag • The „Protein existence‟ tag indicates what is the evidence for the existence of a given protein; • Different qualifiers: 1. Evidence at protein level (~18%) (MS, western blot (tissue specificity), immuno (subcellular location),…) 2. Evidence at transcript level (~19%) 3. Inferred from homology (~58 %) 4. Predicted (~5%) 5. Uncertain (mainly in TrEMBL) http://www.uniprot.org/docs/pe_criteria UniProtKB Additional information can be found in the cross-references (to more than 140 databases) Organism-specific Sequence Proteomic Genome annotation Polymorphism Family and domain AGD EMBL PeptideAtlas Ensembl dbSNP Gene3D ArachnoServer IPI PRIDE EnsemblBacteria HAMAP CGD PIR ProMEX EnsemblFungi InterPro ConoServer RefSeq EnsemblMetazoa PANTHER CTD UniGene EnsemblPlants Pfam CYGD EnsemblProtists PIRSF dictyBase GeneID PRINTS EchoBASE Gene expression GenomeReviews ProDom EcoGene KEGG PROSITE ArrayExpress euHCVdb NMPDR SMART Bgee EuPathDB TIGR SUPFAM CleanEx FlyBase UCSC TIGRFAMs Genevestigator GeneCards VectorBase GermOnline GeneDB_Spombe Protein family/group GeneFarm GenoList Allergome CAZy Gramene Ontologies H-InvDB MEROPS HGNC GO PeroxiBase HPA PptaseDB LegioList REBASE Leproma UniProtKB/Swiss-Prot: TCDB MaizeGDB MGI 129 explicit links 2D gel MIM neXtProt 2DBase-Ecoli Orphanet and 14 implicit links! ANU-2DPAGE PharmGKB Aarhus/Ghent-2DPAGE (no server) PseudoCAP COMPLUYEAST-2DPAGE RGD Cornea-2DPAGE SGD DOSAC-COBS-2DPAGE TAIR ECO2DBASE (no server) TubercuList OGP WormBase PHCI-2DPAGE Xenbase PMMA-2DPAGE ZFIN Rat-heart-2DPAGE REPRODUCTION-2DPAGE Phylogenomic dbs Siena-2DPAGE SWISS-2DPAGE eggNOG UCD-2DPAGE GeneTree World-2DPAGE HOGENOM 3D structure HOVERGEN DisProt Other PPI Enzyme and pathway InParanoid PTM HSSP OMA BindingDB DIP PDB BioCyc OrthoDB GlycoSuiteDB DrugBank IntAct PDBsum BRENDA PhylomeDB PhosphoSite NextBio MINT ProteinModelPortal Pathway_Interaction_DB ProtClustDB PhosSite PMAP-CutDB STRING SMR Reactome The UniProt web site www.uniprot.org • Powerful search engine, google-like and easy-to-use, but also supports very directed field searches • Scoring mechanism presenting relevant matches first • Entry views, search result views and downloads are customizable • The URL of a result page reflects the query; all pages and queries are bookmarkable, supporting programmatic access • Search, Blast, Align, Retrieve, ID mapping Search A very powerful text search tool with autocompletion and refinement options allowing to look for UniProt entries and documentation by biological information Find all human proteins located in the nucleus The search interface guides users with helpful suggestions and hints Advanced Search A very powerful search tool To be used when you know in which entry section the information is stored Find all the protein localized in the cytoplasm