Inside protein databases: bridging sequences and knowledge
WELCOME !
Geneva, 2017
SIB Swiss Institute of Bioinformatics
[email protected] & [email protected] Swiss-Prot, SIB
[email protected] CALIPHO, SIB
SIB Swiss Institute of Bioinformatics
• 70 groups • 800 collaborators • biologists, biochemists, computer scientists, physicists, physicians, chemists, mathematicians, pharmacists, …
Common point: bioinformatics
Inside protein databases: bridging sequences and knowledge
All the material is available here: http://education.expasy.org/cours/InsideProteinDatabases2017/
Content • a description of the major protein sequence databases and their sequence annotation pipeline, focusing on UniProtKB/Swiss-Prot • an introduction to Gene Ontology (GO) • practical sessions allowing to gain knowledge on how to query protein sequence databases, how to perform enrichment analysis on datasets and how to interpret the results of such analyses.
Objectives
• know the differences between the major protein sequence databases • understand the major sequence annotation pipelines and the GO annotation pipelines • estimate the protein sequence accuracy and the annotation quality
08h30 Protein sequence databases: theory 10h30 COFFEE BREAK 11h00 Controlled vocabularies and standardization resources: theory 12h15 PAUSE 13h30 Protein sequence databases and Gene Ontology: practicals 15h00 COFFEE BREAK 15h30 Analysis tools using ontologies : theory 16h00 Protein sequence databases and Gene Ontology: practicals 17h00 Evaluation / Exam 18h00 END
08h30 Protein sequence databases: theory 10h30 COFFEE BREAK 11h00 Controlled vocabularies and standardization resources: theory 12h15 PAUSE 13h30 Protein sequence databases and Gene Ontology: practicals 15h00 COFFEE BREAK 15h30 Analysis tools using ontologies : theory 16h00 Protein sequence databases and Gene Ontology: practicals 17h00 Evaluation / Exam 18h00 END
Menu
Introduction
Nucleic acid sequence databases EMBL, GenBank, DDBJ
Protein sequence databases UniProt databases (UniProtKB)
NCBI protein databases (NCBInr, RefSeq)
Practicals…
Menu
Introduction
Nucleic acid sequence databases EMBL, GenBank, DDBJ
Protein sequence databases UniProt databases (UniProtKB)
NCBI protein databases (NCBInr, RefSeq)
Practicals…
ID HIDH_SOYBN Reviewed; 319 AA. AC Q5NUF3; ... DE RecName: Full=2-hydroxyisoflavanone dehydratase; Protein databases DE EC=3.1.1.1; DE EC=4.2.1.105; DE AltName: Full=Carboxylesterase HIDH; GN Name=HIDH; OrderedLocusNames=GLYMA01G45020; OS Glycine max (Soybean) (Glycine hispida). OC Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; OC Spermatophyta; Magnoliophyta; eudicotyledons; core eudicotyledons; Text search OC rosids; fabids; Fabales; Fabaceae; Papilionoideae; Phaseoleae; OC Glycine. OX NCBI_TaxID=3847; RN [1] RP NUCLEOTIDE SEQUENCE [MRNA], FUNCTION, CATALYTIC ACTIVITY, MUTAGENESIS OF GLY-78; Training Dataset RP GLY-79; THR-164; ASP-263 AND HIS-295, AND BIOPHYSICOCHEMICAL PROPERTIES. RC TISSUE=Seedling; RX PubMed=15734910; DOI=10.1104/pp.104.056747; RA Akashi T., Aoki T., Ayabe S.; RT "Molecular and biochemical characterization of 2-hydroxyisoflavanone dehydratase. Statistics RT Involvement of carboxylesterase-like proteins in leguminous isoflavone biosynthesis."; RL Plant Physiol. 137:882-891(2005). ... CC -!- FUNCTION: Dehydratase that mediates the biosynthesis of CC isoflavonoids. Can use both 4'-hydroxylated and 4'-methoxylated 2- Genome annotation (Features) CC hydroxyisoflavanones as substrates. Has also a slight CC carboxylesterase activity toward p-nitrophenyl butyrate. CC -!- CATALYTIC ACTIVITY: 2,7,4'-trihydroxyisoflavanone = daidzein + CC H(2)O. ... System biology CC -!- BIOPHYSICOCHEMICAL PROPERTIES: CC Kinetic parameters: CC KM=29 uM for 2,7-dihydroxyBIOLOGICAL-4'-methoxyisoflavanone (at pH 7.5 and CC 30 degrees Celsius); ... … CC -!- PATHWAY: Secondary metabolite biosynthesis; flavonoid CC biosynthesis. CC -!- SIMILARITY: Belongs toKNOWLEDGE the 'GDXG' lipolytic enzyme family. DR EMBL; AB154415; BAD80840.1; -; mRNA. DR EMBL; BT097440; ACU22699.1; -; mRNA. DR EMBL; CM000834; -; NOT_ANNOTATED_CDS; Genomic_DNA. DR RefSeq; NP_001237228.1; NM_001250299.2. DR UniGene; Gma.19376; -. DR ProteinModelPortal; Q5NUF3; -. ... DR Pfam; PF07859; Abhydrolase_3; 1. DR PROSITE; PS01173; LIPASE_GDXG_HIS; 1. PE 1: Evidence at protein level; KW Complete proteome; Flavonoid biosynthesis; Hydrolase; Lyase; Reference proteome. FT CHAIN 1 319 2-hydroxyisoflavanone dehydratase. FT /FTId=PRO_0000424101. ... FT ACT_SITE 77 77 Potential. FT ACT_SITE 164 164 FT ACT_SITE 263 263 FT ACT_SITE 295 295 …
annotation Proteomics BLAST Phylogeny SQ SEQUENCE 319 AA; 35138 MW; E8333CF425FBA4A3 CRC64; MAKEIVKELL PLIRVYKDGS VERLLSSENV AASPEDPQTG VSSKDIVIAD NPYVSARIFL PKSHHTNNKL PIFLYFHGGA FCVESAFSFF VHRYLNILAS EANIIAISVD FRLLPHHPIP Training datasets AAYEDGWTTL KWIASHANNT NTTNPEPWLL NHADFTKVYV GGETSGANIA HNLLLRAGNE SLPGDLKILG GLLCCPFFWGprotein SKPIGSEAVE GHEQSLAMKVsequence WNFACPDAPG GIDNPWINPC VPGAPSLATL ACSKLLVTIT GKDEFRDRDI LYHHTVEQSG WQGELQLFDA GDEEHAFQLF Domains KPETHLAKAM IKRLASFLV // … Annotation: where does it come from ?
Annotation is the process of assigning biological information to DNA or protein sequences
Information (i.e. protein function and subcellular location) may come from publications (experimental data) sequence similarity (…quest for orthologs) protein domain computational analysis (prediction)
Computational analysis can be manually checked (by the ‘biocurators’) or not
Examples: UniProtKB and Gene Ontology annotation Protein sequence: where does it come from? Protein sequences origins
• > 180 billion ‘different’ proteins on earth (∑ N species x M genes)
• ~ 74 million ‘known and public’ protein sequences in 2017
• About 98% of the protein sequences are derived from the translation of nucleotide sequences (mRNA or DNA/genome)
• About 1 % come from direct protein sequencing (Edman, MS/MS…)
The ideal life of a sequence …
http://www.ncbi.nlm.nih.gov/genbank/submit RNA, genes, genomes, …
Nucleic acid sequence databases
Protein sequence databases EMBL/GenBank/DDBJ
http://www.insdc.org/ RefSeq • The Reference Sequence (RefSeq) collection: provides a non- redundant set of sequences, including genomic DNA, transcript (RNA), and protein products, for major research organisms.
RefSeqi NP_000790.2. NM_000799.2.
• Contains protein sequences derived from gene prediction, not submitted to EMBL-Bank/GenBank/DDBJ
Ensembl
• Creates, integrates and distributes reference datasets and analysis tools that enable genomics. • Joint project between EMBL-EBI and the Sanger Centre
Ensembl i ENST00000252723; ENSP00000252723; ENSG00000130427.
• Contains protein sequences derived from gene prediction, not submitted to EMBL-Bank/GenBank/DDBJ Menu
Introduction
Nucleic acid sequence databases EMBL, GenBank, DDBJ
Protein sequence databases UniProt databases (UniProtKB)
NCBI protein databases (NCBInr, RefSeq)
Practicals…
EBI (UK) EMBL (ENA) European Nucleotide Archive GenBank NCBI (US)
DDBJ Japan
Archive of primary sequence data and corresponding annotation submitted by the laboratories that did the sequencing.
DNA sequence of the human EPO gene
Server EBI; Database EMBL/ENA; text format
accession number
taxonomy
References - the submitters
Cross-references CDS CoDing Sequence (proposed by submitters)
5 exons
DNA sequence EMBL/GenBank/DDBJ
• Archive: nothing goes out -> highly redundant !
• Most annotations are done by the submitters: heterogeneity of the quality and of the completion
• Archive: all submitted information remains there; not updated (exception: Third Part Annotation (TPA))
• Many errors: in sequences, in annotations, in CDS attribution, no consistency of annotations
EMBL/GenBank/DDBJ and annotation
“Beyond limited editorial control and some internal integrity checks (for example, proper use of INSD formats and translation of coding regions specified in CDS entries are verified), the quality and accuracy of the record are the responsibility of the submitting author, not of the database. The databases will work with submitters and users of the database to achieve the best quality resource possible.”
http://www.insdc.org/policy EMBL/GenBank/DDBJ and annotation
• many scientists assume that GenBank annotation is kept up to date, and they are surprised to hear that it is not • the annotation has remained static: a gene labeled 'hypothetical protein' a few years ago might now have a known function. • erroneous and inconsistent naming of genes. • a name is transferred from one gene to another on the basis of sequence similarity (usually from a BLAST search). As more genomes are annotated, and more BLAST searches are run, the original source of the name quickly becomes lost. • scientists should fix errors that they find. But this would quickly destroy the archival function of GenBank, as original entries would be erased over time.
(PMID: 17274839) information provided by the submitter of an nucleotide entry…
DR EMBL; DQ339047; ABC68418.1; -; mRNA.
FT source 1..1397 FT /organism="Rattus norvegicus" FT /strain="Sprague-Dawley" FT /mol_type="mRNA" FT /sex="female" FT /tissue_type="ovary" FT /db_xref="taxon:10116" FT CDS 70..1329 FT /codon_start=1 FT /product="testis derived transcript" FT /note="TES" FT /db_xref="GOA:Q2LAP6" EMBL/GenBank/DDBJ
DNA sequence
& CoDing Sequences (CDS) Coding sequence (CDS) annotation
CDS CoD ing Sequence (provided by submitters)
Slide J. McDowall CDS annotation provided by the submitters
The first Met !
CDS translation provided by EMBL/GenBank
This protein sequence is integrated in the protein sequence databases human EPO gene:
Cross-references from UniProtKB to EMBL/GenBank/DDBJ UCSC genome browser: human EPO mRNAs and their corresponding CDS annotation (from EMBL/GenBank/DDBJ)
contig
5’ 3’ CDS & Protein sequence accuracy Coding sequence (CDS) annotation
Slide J. McDowall UCSC genome browser: another gene…
mRNAs and their corresponding CDS annotation (from EMBL/GenBank/DDBJ) How to deal with these sequences: see later…. Menu
Introduction
Nucleic acid sequence databases EMBL, GenBank, DDBJ
Protein sequence databases UniProt databases (UniProtKB)
NCBI protein databases (NCBInr, RefSeq)
Practicals…
The hectic life of a protein sequence …
Data not submitted to public databases, delayed or cancelled… cDNAs, ESTs, genomes, …
Nucleic acid databases EMBL, GenBank, DDBJ …if the submitters provide an annotated Coding Sequence (CDS) no CDS
RefSeq, Ensembl and other
Gene prediction Ensembl, RefSeq Protein sequence databases The hectic life of a sequence …
Data not submitted to public databases, delayed or cancelled… cDNAs, ESTs, genomes, …
Scientific publications EMBL, GenBank, DDBJ derived sequences
CoDing Sequences CoDing Sequences provided by submitters provided by submitters and gene prediction TrEMBL GenPept RefSeq PRF UniProtKB Ensembl TPA CCDS Swiss-Prot NeXtProt (IPI) + all ‘species’ specific databases (EcoGene, TAIR, …) UniParc (PIR) PDB Major ‘general’ protein sequence database ‘sources’
Integrated resources ‘cross-references’ TPA PIR PDB PRF
UniProtKB: Swiss-Prot + TrEMBL Resources are kept separated NCBI-nr: Swiss-Prot + TrEMBL + GenPept + PIR + PDB + PRF + RefSeq + TPA
not complete !!! (only entries created before 2007 ?) UniProtKB/Swiss-Prot: manually annotated protein sequences (13’000 species) UniProtKB/TrEMBL: submitted CDS (EMBL) + Ensembl prediction + automated annotation; non redundant with Swiss-Prot (~710’000 species) GenPept: submitted CDS (GenBank); redundant with Swiss-Prot (~700’000 species); PIR: Protein Information Ressource; archive since 2003; integrated into UniProtKB PDB: Protein Databank: 3D data and associated sequences PRF: Protein Research Foundation journal scan of ‘published’ peptide sequences RefSeq: Reference Sequence for DNA, RNA, protein + gene prediction + some manual annotation (66’000 species) TPA: Third part annotation Menu
Introduction
Nucleic acid sequence databases EMBL, GenBank, DDBJ
Protein sequence databases UniProt databases (UniProtKB)
NCBI protein databases (NCBInr, RefSeq)
Practicals…
UniProt consortium
EBI : European Bioinformatics Institute (UK) SIB : Swiss Institute of Bioinformatics (CH) PIR : Protein Information Resource (USA) www.uniprot.org
~74 millions of proteins/entries derived from ~ 710’000 different species
- 10 millions connexions/year - 500’000 unique visitors/year Resources available UniProt databases Content of a UniProtKB record
(1) Protein sequence(s)
canonical & isoforms (see later)
Origin of protein sequences
UniProtKB protein sequences are mainly derived from:
- INSDC (EMBL/ENA, GenBank, DDBJ) (translated submitted coding sequences - CDS) (95.1 %) - Ensembl (gene prediction; 3.2 %) - RefSeq sequences (0.3 %) - Sequences of PDB - Direct submission or sequences scanned from literature (includes direct protein sequencing)
UniProt is not doing any gene prediction; sequence integration from Ensembl
(2) Annotation Biological knowledge different sections (3) Sequence annotation (features (FT)) (4) Annotation score & evidence for protein existence
Status: Reviewed / Unreviewed Annotation score: http://insideuniprot.blogspot.ch/2014/10/introducing-annotation-scores-in-uniprot.html Evidence for protein existence (5) Help sections
• Context sensitive • Feedback (update)
• Blog: http://insideuniprot.blogspot.ch/ • YouTube: https://www.youtube.com/user/uniprotvideos • Query Help (FAQs, user manual, …)
The 2 sections of UniProtKB UniProtKB is composed of 2 sections
UniProtKB/Swiss-Prot Reviewed - Manually annotated Records with information extracted from literature and curator-evaluated computational analysis.
UniProtKB/TrEMBL Unreviewed – Automatically annotated Records that await full manual annotation.
released every 4 weeks
Source of annotation & Evidence statements Source of annotation/Evidence statements
Experimental data (publication)
Computational analysis (curator-evaluated or not)
UniProtKB - P51787 (KCNQ1_HUMAN) Source of annotation/Evidence statements
UniProtKB/Swiss-Prot: Manual insertion, color in yellow
UniProtKB/TrEMBL: Automated insertion, color in blue
Publication: {ECO:0000269|PubMed:10476968} (text format)
ECO code (ontology) UniProtKB/Swiss-Prot UniProtKB/Swiss-Prot
2 % of UniProtKB protein sequences
UniProtKB/Swiss-Prot Manual annotation
1. Protein sequence (merge available CDS, annotate sequence discrepancies, report sequencing mistakes…)
2. Biological information (sequence analysis, extract literature information, ortholog data propagation, …)
UniProtKB/Swiss-Prot Manual annotation
1. Protein sequence (merge available CDS, annotate sequence discrepancies, report sequencing mistakes…)
2. Biological information (sequence analysis, extract literature information, ortholog data propagation, …)
UniProtKB/ Swiss-Prot
Entry vs Protein sequence(s)
One entry – one gene – one species
One or several protein sequences (isoforms) per entry
canonical & isoform Used to construct the UniProtKB canonical sequence
Automatically mappep to the UniProtKB record
http://www.uniprot.org/uniprot/P54710#cross_references UCSC genome browser: another gene
mRNAs and their corresponding CDS annotation (from EMBL/GenBank/DDBJ) Manually checked (comparison with orthologs, etc.) incorrect correct
The ‘incorrect’ sequence is found at NCBI nr (GenPept) (without warning) UniProtKB/ Swiss-Prot
Entry vs Protein sequence(s)
One entry – one gene – one species
One or several protein sequences (isoforms) per entry
canonical & isoform
…. 553’474 ‘canonical’
+ 39’441 ‘isoforms’ (7 %)
http://web.expasy.org/docs/relnotes/relstat.html Beware
The isoform sequences are not included in all datasets,
Examples:
- Complete proteome -> download Fasta (canonical & isoform) - Blast@ NCBI (NCBInr) UniProtKB/Swiss-Prot Manual annotation
1. Protein sequence (merge available CDS, annotate sequence discrepancies, report sequencing mistakes…)
2. Biological information (sequence analysis, extract literature information, ortholog data propagation, …)
Source of annotation/Evidence statements
• Selected Publication (experimental)
• Another UniProtKB entry (orthologs): by similarity
• An entry from another database: imported
• Curator-evaluated computational analysis* UniRule
• Combined sources
* more details later… Source of annotation/Evidence statements
• Selected Publication (experimental)
• Another UniProtKB entry (orthologs): by similarity
• An entry from another database: imported
• Curator-evaluated computational analysis* UniRule
• Combined sources
* more details later… comprehensive and computer friendly representation of biological knowledge
PubMed=16595657
www.uniprot.org comprehensive and computer friendly representation of biological knowledge UniProtKB Q9FYL3 PubMed=16595657
Controlled vocabulary
www.uniprot.org comprehensive and computer friendly representation of biological knowledge
PubMed=16595657
UniProtKB Q9FYL3
www.uniprot.org Sequence annotation -> Feature viewer
http://insideuniprot.blogspot.ch/2016/02/introducing-uniprot-feature-viewer.html Source of annotation/Evidence statements
• Selected Publication (experimental)
• Another UniProtKB entry (orthologs): by similarity
• An entry from another database: imported
• Curator-evaluated computational analysis UniRule
• Combined sources
* more details later… Protein sequence analysis: in-house resource ‘Anabelle’
Annotation Controlled vocabulary (CV)
• Keywords • Cellular component • Disease • Taxonomy • …
• Gene Ontology Annotation (CV): Keywords
http://www.uniprot.org/help/keywords Annotation (CV): Cellular component Annotation (CV): Gene Ontology www.uniprot.org/uniprot/P01588 http://www.uniprot.org/help/keywords_vs_go Annotation (CV): Gene Ontology www.uniprot.org/uniprot/P01588
Keyword to GO UniProtKB/TrEMBL UniProtKB/TrEMBL
98 % of UniProtKB protein sequences
UniProtKB/TrEMBL Automated annotation
Protein sequence - The quality of the protein sequences is dependent on the information provided by the submitter of the original nucleotide entry (CDS) or of the gene prediction pipeline (i.e. Ensembl). - 100% identical sequences (same length, same organism are merged automatically).
Biological information Sources of annotation - Provided by the submitter (EMBL, PDB, TAIR…) - From automated annotation: SAAS, UniRule, SAM, InterPro UniProtKB /TrEMBL
Entry vs Protein sequence(s) One sequence – one gene – one species
One protein sequence per entry
UniProtKB/TrEMBL Automated annotation
Protein sequence - The quality of the protein sequences is dependent on the information provided by the submitter of the original nucleotide entry (CDS) or of the gene prediction pipeline (i.e. Ensembl). - 100% identical sequences (same length, same organism are merged automatically).
Biological information Sources of annotation - Provided by the submitter (EMBL, PDB, TAIR…) - From automated annotation: SAAS, UniRule, SAM, InterPro Source of annotation: Computational analysis
http://biofunctionprediction.org/cafa/
Automated annotation
Automated generated rules (SAAS) Generates a set of decision trees using data mining (new set every UniProtKB release)
Manually generated rules (UniRule) Maintains a set of manual annotation rules
UniRule = PIR* + HAMAP + Rulebase Sequence analysis methods (SAM) Signal, transmembrane, coils prediction
InterPro Domains & GO terms SAAS
SAAS learns on the properties present in the reviewed UniProtKB (Swiss-Prot) entries and uses the following attribute types to define the learning entries: InterPro protein family, taxonomy and sequence length. This combination allows SAAS to generate rules to annotate protein properties such as function, catalytic activity, pathway membership, subcellular location, protein names and feature predictions.
http://insideuniprot.blogspot.ch/2016/10/automatic-learning-based-annotation-in.html SAAS
Query: source:SAAS00629841, feb 2017 Beware: the SAAS number may change, if the rules changes… Automated annotation
Automated generated rules (SAAS) Generates a set of decision trees using data mining (new set every UniProtKB release)
Manually generated rules (UniRule ) Maintains a set of manual annotation rules
UniRule = PIR* + HAMAP + Rulebase Sequence analysis methods (SAM) Signal, transmembrane, coils prediction
InterPro Domains & GO terms UniRule (1) UniRule (2) UniRule
Query: source:RU361160, Feb 2017 UniRule UniRule UniRule Additional information on UniRule http://insideuniprot.blogspot.ch/2015/11/unirule-automatic-annotation-system-in.html Automated annotation
Automated generated rules (SAAS) Generates a set of decision trees using data mining (new set every UniProtKB release)
Manually generated rules (UniRule ) Maintains a set of manual annotation rules
UniRule = PIR* + HAMAP + Rulebase Sequence analysis methods (SAM) Signal, transmembrane, coils prediction
InterPro Domains & GO terms SAM
No GO (cellular component) annotation in this case… http://www.uniprot.org/help/sam SAM: Transmembrane
annotation:(type:transmem) AND reviewed:no SAM: Signal peptide Automated annotation
Automated generated rules (SAAS) Generates a set of decision trees using data mining (new set every UniProtKB release)
Manually generated rules (UniRule ) Maintains a set of manual annotation rules
UniRule = PIR* + HAMAP + Rulebase Sequence analysis methods (SAM) Signal, transmembrane, coils prediction
InterPro Domains & GO terms
Automated annotation
Some important remarks
UniRule
Total number of records in UniProtKB Records with automated annotation (w/o InterPro)
~ 51 % of TrEMBL records contain automated annotation automated annotation in UniProtKB/TrEMBL Differences between TrEMBL and Swiss-Prot
TrEMBL Swiss-Prot annotation automatic manual Annotation = Partial annotation As complete and complete ? (~50 % of the systematic as entries) possible Set of sequences = As complete as Complete sets only complete ? possible; does not for a few organisms contain Swiss-Prot sequences ! Number of entries 73’000’000 550’000 Number of species 590’000 13’000
When you compare biological information of given datasets of proteins beware the ratio of TrEMBL vs Swiss-Prot entries in your dataset: the results might not be only ‘biological’! Set of mouse proteins with N-glycosylation - March 2017
21.7 % 0.05 %
7.2 %
annotation:(type:carbohyd "n organism:"Mus musculus linked glcnac ellipsis") AND (Mouse) [10090]" organism:"Mus musculus (Mouse) [10090]" The UniProt web site
www.uniprot.org The UniProt web site www.uniprot.org
• Powerful search engine, google-like and easy-to-use, but also supports very directed field searches
• Scoring mechanism presenting relevant matches first
• Entry views, search result views and downloads are customizable
• The URL of a result page reflects the query; all pages and queries are bookmarkable, supporting programmatic access
• Search, Blast, Align, Retrieve/ID mapping
Search
A very powerful text search tool with autocompletion and refinement options allowing to look for UniProt entries and documentation by biological information Find all human proteins located in the nucleus
The search interface guides users with helpful suggestions and hints
http://www.uniprot.org/uniprot/?query=human+nucleus&sort=score
http://www.uniprot.org/uniprot/?query=organism%3Ahuman+AND+location%3Anucleus&sort=score Advanced Search
A very powerful search tool
To be used when you know in which entry section the information is stored
Find all the protein localized in the cytoplasm (experimentally proven) which are phosphorylated on a serine (experimentally proven) Result pages: highly customizable
Add columns (also available for BLAST)
From: http://www.uniprot.org/uniprot/?query=gene:epo%20organism:%22Homo%20sapiens%20%28Human%29%20[9606]%22&fil=&sort=score
To: http://www.uniprot.org/uniprot/?query=gene:epo organism:"Homo sapiens (Human) [9606]"&sort=score&columns=id,entry name,reviewed,protein names,genes,organism,length http://insideuniprot.blogspot.ch/2015/03/customise-and-share-your-search-results.html
Download
Different formats (fasta, txt, excell, RDF, etc.) Highlight sequence annotation in alignment (BLAST or multiple alignment) Formats of a UniProtKB entry UniProtKB entry formats
Format ‘Web site’ UniProt Format Fasta
Format text
Format RDF http://sparql.uniprot.org/
http://insideuniprot.blogspot.ch/2014/08/haveContact us ! -you-tried-uniprot-rdf.html More during practicals… UniProt: other databases UniProt databases UniParc: protein sequence archive (EMBL-EMBL equivalent UniProtat the databases protein level)
Each entry contains a protein sequence, taxonomic information, cross-links to other databases where you find the sequence (active or not)
No annotation
All the public patented sequences are stored in UniParc (EPO, USPO, JPO)
You can: query, Blast, download
~110,000,000 entries http://www.uniprot.org/uniparc/UPI0000033477 (1)
http://www.uniprot.org/uniparc/UPI0000033477 (2)
Remarks:
- UniProt is not doing any gene prediction; integration from Ensembl
- Most non-germline immunoglobulins, T-cell receptors , most patent sequences, highly over- represented data (e.g. viral antigens), pseudogenes sequences are excluded from UniProtKB, but stored in UniParc. UniProt databases
UniRef
3 clusters of protein sequences with 100, 90 and 50 % identity; useful to speed up sequence similarity search (BLAST)
You can: query, Blast, download
UniProt databases Proteomes
• A proteome is the entire set of proteins expressed by a specific organism whose genomes have been completely sequenced. • It normally includes sequences that derive from extra- chromosomal elements such as plasmids or organellar genomes • Some proteomes may also include protein sequences based on high quality cDNAs that cannot be mapped to the current genome assembly due to sequencing errors or gaps. • UniProt proteomes may include both manually reviewed (UniProtKB/Swiss-Prot) and unreviewed (UniProtKB/TrEMBL) entries. Proteomes
• a proteome is formed from all UniProtKB/Swiss- Prot entries (irrespective of whether they map to Ensembl or Ensembl Genomes) plus those UniProtKB/TrEMBL entries mapping to Ensembl or Ensembl Genomes for that proteome.
• there may be several proteomes per taxonomic identifier -> Reference proteome
…for users that prefer to use a single best-annotated proteome from a particular taxonomic group….
To include isoform sequences Around UniProtKB
A ‘one stop shop’ for human proteins
o To visualize all the integrated information
o To extract/export relevant annotations
o To perform complex and precise queries
Data quality assessment neXtProt applies a three-tiered data grading system for data quality:
Gold: Very high confidence level (< 1% error) Silver: Greater than 95% confidence Bronze: Any data with less than 95% confidence is assigned "bronze" quality and not integrated into neXtProt
155 Each protein has a page with multiple aspects (views)
156 Sequence view
157 Medical view
158 Phenotype view: Phenotypes
159 Phenotype view: Variants associated with phenotypes
160 Examples of complex queries
• Proteins which are located in mitochondrion and have at least one HPA antibody and exist in at least one proteome identification set
• Proteins that interact with viral proteins
• Proteins which are targets of drugs for cardiac therapy
• Proteins located on chromosome 2 and having at least one variant on a phosphorylated tyrosine
SPARQL search • Syntax not the most user friendly • Examples available to help users design queries
162 Advanced search: results
163 Around UniProt (2) • Viralzone: http://viralzone.expasy.org/
Around UniProt (3)
• VenomZone: http://venomzone.expasy.org/ [email protected] http://insideuniprot.blogspot.ch/2016/09/how-can-you-increase-impact-of-your.html Menu
Introduction
Nucleic acid sequence databases EMBL, GenBank, DDBJ
Protein sequence databases UniProt databases (UniProtKB)
NCBI protein databases (NCBInr, RefSeq)
Practicals…
NCBI protein databases (Entrez protein, NCBI nr) http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Protein Major ‘general’ protein sequence database ‘sources’
Integrated resources ‘cross-references’ TPA PIR PDB PRF
UniProtKB: Swiss-Prot + TrEMBL Resources are kept separated NCBI-nr: Swiss-Prot + TrEMBL + GenPept + PIR + PDB + PRF + RefSeq + TPA
not complete !!! (only entries created before 2007 ?) UniProtKB/Swiss-Prot: manually annotated protein sequences (13’000 species) UniProtKB/TrEMBL: submitted CDS (EMBL) + Ensembl prediction + automated annotation; non redundant with Swiss-Prot (~590’000 species) GenPept: submitted CDS (GenBank); redundant with Swiss-Prot (~450’000 species) PIR: Protein Information Ressource; archive since 2003; integrated into UniProtKB PDB: Protein Databank: 3D data and associated sequences PRF: Protein Research Foundation journal scan of ‘published’ peptide sequences RefSeq: Reference Sequence for DNA, RNA, protein + gene prediction + some manual annotation (54’000 species) TPA: Third part annotation NCBI nr - Entrez ‘protein’ http://www.ncbi.nlm.nih.gov/protein/
NCBI-nr • GenPept (source: GenBank; translated CDS) • RefSeq • TPA (third part annotation)
• Swiss-Prot (does not include isoform sequences) • PIR (not updated since 2003) • PRF (journal scan of ‘published’ peptide) • PDB (Protein Data Bank, 3D structure) • TrEMBL (some entries….)
GenPept
Translation from annotated CDS in GenBank Contains all translated CDS annotated in GenBank/EMBL/DDBJ sequences
- equivalent to UniProtKB/TrEMBL, except that it is redundant with other databases (Swiss-Prot, RefSeq, PIR….)
GenPept: ‘translations from all annotated coding regions (CDS) in GenBank’
Annotation according to the submitter
No GO term ! RefSeq
Produced by NCBI and NLM
http://www.ncbi.nlm.nih.gov/RefSeq/ http://www.ncbi.nlm.nih.gov/books/bookres.fcgi/handbook/ch18.pdf
FAQ: http://www.ncbi.nlm.nih.gov/books/NBK50679/
http://www.ncbi.nlm.nih.gov/refseq/ RefSeq
RefSeq: The Reference Sequence (RefSeq) collection aims to provide a comprehensive, integrated, non-redondant set of sequences, including genomic DNA, transcript (RNA), and protein products, for major research organisms. One protein sequence -> one entry
RefSeq
Protein: NP_ mRNA: NM_ AC DNA: NC_ The corresponding mRNA
Taxonomy
References Status and Genbank source
Annotation - automated - derived from Swiss-Prot - in-house Annotation - automated - derived from Swiss-Prot - in-house
- no GO annotation
Cross-references
Sequence RefSeq
manual annotation GENOME ANNOTATION No INFERRED No MODEL No PREDICTED No PROVISIONAL No
Yes (sequence + functional information REVIEWED and features) VALIDATED Yes (initial sequence)
Whole Genome Sequencing (WGS) No
http://www.ncbi.nlm.nih.gov/RefSeq/
UniProtKB/Swiss-Prot: One gene -> one entry (9 isoforms) RefSeq: One protein sequence -> one entry Different datasets Exemple: human proteome ~ 20’200 genes
Query for organism:’homo sapiens’ (Nov 2013) • UniProtKB: 134’679 entries + alt sequences ( 19’425) = 125’491 • UniProtKB/Swiss-Prot: 20’278 entries + alt sequences ( 19’425) = 39’703 • UniProtKB/TrEMBL: 114,401 entries • UniParc: 1,082,249 entries • RefSeq: 32’898 sequences • Ensembl: 104’488 peptide sequences
Query for ‘homo sapiens’ + Complete proteome (KW-181) • UniProtKB: 56’392 + alt sequences (15’435) = 71’827 • UniProtKB/Swiss-Prot: 20’272 + alt sequences (19’425) = 39’697 • UniProtKB/TrEMBL: 48’774
• NeXtProt: 20’140 / all isoforms = 39,565 (20’140 + 19’425)
Query: organism:"Homo sapiens (Human) [9606]"
NeXtProt
RefSeq NCBI nr query & BLAST
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Protein Look for human FOXP2 @ NCBI nr BLAST @ NCBI
Does not include isoforms ! BLAST @ NCBI UniProtKB/Swiss-Prot alternative isoform sequences are not included ! NCBI nr: example of ‘cluster’ NCBI-nr clusters: identical proteins (100%) derived from the same organism
UniRefs provide clustered sets of sequences at several resolutions (100%, 90% and 50%) for all the organisms.
UniProtKB entries at NCBI… A UniProtKB/Swiss-Prot entry with the NCBI look
Important remarks concerning the datasets
Different servers…
UniProtKB/TrEMBL entries are not available at NCBI The same protein sequence might be present, but not with the UniProtKB/TrEMBL AC (with some exceptions) (not the case for UniProtKB/Swiss-Prot entries)
UniProtKB/TrEMBL entries are not available at NCBInr with some exceptions… ID/AC mapping Accession number (AC) mapping
These identifiers are all pointing to a TP53 (p53) protein sequence !
P04637, NP_000537, NP_001119584.1, NP_001119585.1, NP_001119584.1, NP_001119584.1, NP_001119584.1, NP_001119584.1, ENSG00000141510, CCDS11118, UPI000002ED67, IPI00025087, etc.
http://www.uniprot.org/uploadlists/ http://www.ebi.ac.uk/Tools/picr/
Understanding protein function is critical to research in many areas of science such as biology, medicine and biotechnology.
Keeping up with all of this information is a daunting task for most researchers. UniProt helps with this in the following ways:
• it provides an up-to-date, comprehensive body of protein information at a single site; • it aids scientific discovery by collecting, interpreting and organising this information so that it is easy to access and use; • it saves researchers countless hours of work in monitoring and collecting this information themselves; • it provides tools to help with protein sequence analysis; • it provides links to related information in more than 150 other biological databases to help you access additional information in more specialised collections.
• https://www.ebi.ac.uk/training/online/course/uniprot-exploring-protein-sequence-and-functional/why-do-we-need-uniprot
Menu
Introduction
Nucleic acid sequence databases EMBL, GenBank, DDBJ
Protein sequence databases UniProt databases (UniProtKB/Swiss-Prot)
NCBI protein databases (NCBInr, RefSeq)
Practicals…
Thank you !
Thanks to Emmanuel Boutet and Ivo Pedruzzi for some of the slides !
Thanks to Diana Marek, Grégoire Rossier and Patricia Palagi for the organisation All documents (including practicals) are online http://education.expasy.org/cours/InsideProteinDatabases2017/
Inside Protein databases SIB Swiss Institute of Bioinformatics
CMU, new building, 4st floor room A04 2711 Building A, 4st floor Room A04 2711
Building A Entrance: Av. de Champel
Building B
Main Entrance: 1 Michel Servet