<<

Inside : bridging sequences and knowledge

WELCOME !

Geneva, 2017

SIB Swiss Institute of

[email protected] & [email protected] Swiss-Prot, SIB

[email protected] CALIPHO, SIB

SIB Swiss Institute of Bioinformatics

• 70 groups • 800 collaborators • , biochemists, computer scientists, physicists, physicians, chemists, mathematicians, pharmacists, …

Common point: bioinformatics

Inside protein databases: bridging sequences and knowledge

All the material is available here: http://education.expasy.org/cours/InsideProteinDatabases2017/

Content • a description of the major protein sequence databases and their sequence annotation pipeline, focusing on UniProtKB/Swiss-Prot • an introduction to Ontology (GO) • practical sessions allowing to gain knowledge on how to query protein sequence databases, how to perform enrichment analysis on datasets and how to interpret the results of such analyses.

Objectives

• know the differences between the major protein sequence databases • understand the major sequence annotation pipelines and the GO annotation pipelines • estimate the protein sequence accuracy and the annotation quality

08h30 Protein sequence databases: theory 10h30 COFFEE BREAK 11h00 Controlled vocabularies and standardization resources: theory 12h15 PAUSE 13h30 Protein sequence databases and : practicals 15h00 COFFEE BREAK 15h30 Analysis tools using ontologies : theory 16h00 Protein sequence databases and Gene Ontology: practicals 17h00 Evaluation / Exam 18h00 END

08h30 Protein sequence databases: theory 10h30 COFFEE BREAK 11h00 Controlled vocabularies and standardization resources: theory 12h15 PAUSE 13h30 Protein sequence databases and Gene Ontology: practicals 15h00 COFFEE BREAK 15h30 Analysis tools using ontologies : theory 16h00 Protein sequence databases and Gene Ontology: practicals 17h00 Evaluation / Exam 18h00 END

Menu

Introduction

Nucleic acid sequence databases EMBL, GenBank, DDBJ

Protein sequence databases UniProt databases (UniProtKB)

NCBI protein databases (NCBInr, RefSeq)

Practicals…

Menu

Introduction

Nucleic acid sequence databases EMBL, GenBank, DDBJ

Protein sequence databases UniProt databases (UniProtKB)

NCBI protein databases (NCBInr, RefSeq)

Practicals…

ID HIDH_SOYBN Reviewed; 319 AA. AC Q5NUF3; ... DE RecName: Full=2-hydroxyisoflavanone dehydratase; Protein databases DE EC=3.1.1.1; DE EC=4.2.1.105; DE AltName: Full=Carboxylesterase HIDH; GN Name=HIDH; OrderedLocusNames=GLYMA01G45020; OS Glycine (Soybean) (Glycine hispida). OC Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; OC Spermatophyta; Magnoliophyta; eudicotyledons; core eudicotyledons; Text search OC rosids; fabids; Fabales; Fabaceae; Papilionoideae; Phaseoleae; OC Glycine. OX NCBI_TaxID=3847; RN [1] RP SEQUENCE [MRNA], FUNCTION, CATALYTIC ACTIVITY, MUTAGENESIS OF GLY-78; Training Dataset RP GLY-79; THR-164; ASP-263 AND HIS-295, AND BIOPHYSICOCHEMICAL PROPERTIES. RC TISSUE=Seedling; RX PubMed=15734910; DOI=10.1104/pp.104.056747; RA Akashi T., Aoki T., Ayabe S.; RT "Molecular and biochemical characterization of 2-hydroxyisoflavanone dehydratase. Statistics RT Involvement of carboxylesterase-like in leguminous isoflavone biosynthesis."; RL Plant Physiol. 137:882-891(2005). ... CC -!- FUNCTION: Dehydratase that mediates the biosynthesis of CC isoflavonoids. Can use both 4'-hydroxylated and 4'-methoxylated 2- annotation (Features) CC hydroxyisoflavanones as substrates. Has also a slight CC carboxylesterase activity toward p-nitrophenyl butyrate. CC -!- CATALYTIC ACTIVITY: 2,7,4'-trihydroxyisoflavanone = daidzein + CC H(2)O. ... System biology CC -!- BIOPHYSICOCHEMICAL PROPERTIES: CC Kinetic parameters: CC KM=29 uM for 2,7-dihydroxyBIOLOGICAL-4'-methoxyisoflavanone (at pH 7.5 and CC 30 degrees Celsius); ... … CC -!- PATHWAY: Secondary metabolite biosynthesis; flavonoid CC biosynthesis. CC -!- SIMILARITY: Belongs toKNOWLEDGE the 'GDXG' lipolytic family. DR EMBL; AB154415; BAD80840.1; -; mRNA. DR EMBL; BT097440; ACU22699.1; -; mRNA. DR EMBL; CM000834; -; NOT_ANNOTATED_CDS; Genomic_DNA. DR RefSeq; NP_001237228.1; NM_001250299.2. DR UniGene; Gma.19376; -. DR ProteinModelPortal; Q5NUF3; -. ... DR ; PF07859; Abhydrolase_3; 1. DR PROSITE; PS01173; LIPASE_GDXG_HIS; 1. PE 1: Evidence at protein level; KW Complete proteome; Flavonoid biosynthesis; Hydrolase; Lyase; Reference proteome. FT CHAIN 1 319 2-hydroxyisoflavanone dehydratase. FT /FTId=PRO_0000424101. ... FT ACT_SITE 77 77 Potential. FT ACT_SITE 164 164 FT ACT_SITE 263 263 FT ACT_SITE 295 295 …

annotation BLAST Phylogeny SQ SEQUENCE 319 AA; 35138 MW; E8333CF425FBA4A3 CRC64; MAKEIVKELL PLIRVYKDGS VERLLSSENV AASPEDPQTG VSSKDIVIAD NPYVSARIFL PKSHHTNNKL PIFLYFHGGA FCVESAFSFF VHRYLNILAS EANIIAISVD FRLLPHHPIP Training datasets AAYEDGWTTL KWIASHANNT NTTNPEPWLL NHADFTKVYV GGETSGANIA HNLLLRAGNE SLPGDLKILG GLLCCPFFWGprotein SKPIGSEAVE GHEQSLAMKVsequence WNFACPDAPG GIDNPWINPC VPGAPSLATL ACSKLLVTIT GKDEFRDRDI LYHHTVEQSG WQGELQLFDA GDEEHAFQLF Domains KPETHLAKAM IKRLASFLV // … Annotation: where does it come from ?

 Annotation is the process of assigning biological information to DNA or protein sequences

 Information (i.e. protein function and subcellular location) may come from  publications (experimental data)  sequence similarity (…quest for orthologs)   computational analysis (prediction)

 Computational analysis can be manually checked (by the ‘biocurators’) or not

 Examples: UniProtKB and Gene Ontology annotation Protein sequence: where does it come from? Protein sequences origins

• > 180 billion ‘different’ proteins on earth (∑ N x M )

• ~ 74 million ‘known and public’ protein sequences in 2017

• About 98% of the protein sequences are derived from the of nucleotide sequences (mRNA or DNA/genome)

• About 1 % come from direct protein (Edman, MS/MS…)

The ideal life of a sequence …

http://www.ncbi.nlm.nih.gov/genbank/submit RNA, genes, , …

Nucleic acid sequence databases

Protein sequence databases EMBL/GenBank/DDBJ

http://www.insdc.org/ RefSeq • The Reference Sequence (RefSeq) collection: provides a non- redundant set of sequences, including genomic DNA, transcript (RNA), and protein products, for major research .

RefSeqi NP_000790.2. NM_000799.2.

• Contains protein sequences derived from , not submitted to EMBL-Bank/GenBank/DDBJ

Ensembl

• Creates, integrates and distributes reference datasets and analysis tools that enable . • Joint project between EMBL-EBI and the Sanger Centre

Ensembl i ENST00000252723; ENSP00000252723; ENSG00000130427.

• Contains protein sequences derived from gene prediction, not submitted to EMBL-Bank/GenBank/DDBJ Menu

Introduction

Nucleic acid sequence databases EMBL, GenBank, DDBJ

Protein sequence databases UniProt databases (UniProtKB)

NCBI protein databases (NCBInr, RefSeq)

Practicals…

EBI (UK) EMBL (ENA) European Nucleotide Archive GenBank NCBI (US)

DDBJ Japan

Archive of primary sequence data and corresponding annotation submitted by the laboratories that did the sequencing.

DNA sequence of the human EPO gene

Server EBI; EMBL/ENA; text format

accession number

taxonomy

References - the submitters

Cross-references CDS CoDing Sequence (proposed by submitters)

5

DNA sequence EMBL/GenBank/DDBJ

• Archive: nothing goes out -> highly redundant !

• Most annotations are done by the submitters: heterogeneity of the quality and of the completion

• Archive: all submitted information remains there; not updated (exception: Third Part Annotation (TPA))

• Many errors: in sequences, in annotations, in CDS attribution, no consistency of annotations

EMBL/GenBank/DDBJ and annotation

“Beyond limited editorial control and some internal integrity checks (for example, proper use of INSD formats and translation of coding regions specified in CDS entries are verified), the quality and accuracy of the record are the responsibility of the submitting author, not of the database. The databases will work with submitters and users of the database to achieve the best quality resource possible.”

http://www.insdc.org/policy EMBL/GenBank/DDBJ and annotation

• many scientists assume that GenBank annotation is kept up to date, and they are surprised to hear that it is not • the annotation has remained static: a gene labeled 'hypothetical protein' a few years ago might now have a known function. • erroneous and inconsistent naming of genes. • a name is transferred from one gene to another on the basis of sequence similarity (usually from a BLAST search). As more genomes are annotated, and more BLAST searches are run, the original source of the name quickly becomes lost. • scientists should fix errors that they find. But this would quickly destroy the archival function of GenBank, as original entries would be erased over time.

(PMID: 17274839) information provided by the submitter of an nucleotide entry…

DR EMBL; DQ339047; ABC68418.1; -; mRNA.

FT source 1..1397 FT /="Rattus norvegicus" FT /strain="Sprague-Dawley" FT /mol_type="mRNA" FT /sex="female" FT /tissue_type="ovary" FT /db_xref="taxon:10116" FT CDS 70..1329 FT /codon_start=1 FT /product="testis derived transcript" FT /note="TES" FT /db_xref="GOA:Q2LAP6" EMBL/GenBank/DDBJ

DNA sequence

& CoDing Sequences (CDS) Coding sequence (CDS) annotation

CDS CoD ing Sequence (provided by submitters)

Slide J. McDowall CDS annotation provided by the submitters

The first Met !

CDS translation provided by EMBL/GenBank

This protein sequence is integrated in the protein sequence databases human EPO gene:

Cross-references from UniProtKB to EMBL/GenBank/DDBJ UCSC genome browser: human EPO mRNAs and their corresponding CDS annotation (from EMBL/GenBank/DDBJ)

contig

5’ 3’ CDS & Protein sequence accuracy Coding sequence (CDS) annotation

Slide J. McDowall UCSC genome browser: another gene…

mRNAs and their corresponding CDS annotation (from EMBL/GenBank/DDBJ) How to deal with these sequences: see later…. Menu

Introduction

Nucleic acid sequence databases EMBL, GenBank, DDBJ

Protein sequence databases UniProt databases (UniProtKB)

NCBI protein databases (NCBInr, RefSeq)

Practicals…

The hectic life of a protein sequence …

Data not submitted to public databases, delayed or cancelled… cDNAs, ESTs, genomes, …

Nucleic acid databases EMBL, GenBank, DDBJ …if the submitters provide an annotated Coding Sequence (CDS) no CDS

RefSeq, Ensembl and other

Gene prediction Ensembl, RefSeq Protein sequence databases The hectic life of a sequence …

Data not submitted to public databases, delayed or cancelled… cDNAs, ESTs, genomes, …

Scientific publications EMBL, GenBank, DDBJ derived sequences

CoDing Sequences CoDing Sequences provided by submitters provided by submitters and gene prediction TrEMBL GenPept RefSeq PRF UniProtKB Ensembl TPA CCDS Swiss-Prot NeXtProt (IPI) + all ‘species’ specific databases (EcoGene, TAIR, …) UniParc (PIR) PDB Major ‘general’ protein ‘sources’

Integrated resources ‘cross-references’ TPA PIR PDB PRF

UniProtKB: Swiss-Prot + TrEMBL Resources are kept separated NCBI-nr: Swiss-Prot + TrEMBL + GenPept + PIR + PDB + PRF + RefSeq + TPA

not complete !!! (only entries created before 2007 ?) UniProtKB/Swiss-Prot: manually annotated protein sequences (13’000 species) UniProtKB/TrEMBL: submitted CDS (EMBL) + Ensembl prediction + automated annotation; non redundant with Swiss-Prot (~710’000 species) GenPept: submitted CDS (GenBank); redundant with Swiss-Prot (~700’000 species); PIR: Protein Information Ressource; archive since 2003; integrated into UniProtKB PDB: Protein Databank: 3D data and associated sequences PRF: Protein Research Foundation journal scan of ‘published’ peptide sequences RefSeq: Reference Sequence for DNA, RNA, protein + gene prediction + some manual annotation (66’000 species) TPA: Third part annotation Menu

Introduction

Nucleic acid sequence databases EMBL, GenBank, DDBJ

Protein sequence databases UniProt databases (UniProtKB)

NCBI protein databases (NCBInr, RefSeq)

Practicals…

UniProt consortium

EBI : European Bioinformatics Institute (UK) SIB : Swiss Institute of Bioinformatics (CH) PIR : Protein Information Resource (USA) www..org

~74 millions of proteins/entries derived from ~ 710’000 different species

- 10 millions connexions/year - 500’000 unique visitors/year Resources available UniProt databases Content of a UniProtKB record

(1) Protein sequence(s)

canonical & isoforms (see later)

Origin of protein sequences

UniProtKB protein sequences are mainly derived from:

- INSDC (EMBL/ENA, GenBank, DDBJ) (translated submitted coding sequences - CDS) (95.1 %) - Ensembl (gene prediction; 3.2 %) - RefSeq sequences (0.3 %) - Sequences of PDB - Direct submission or sequences scanned from literature (includes direct )

UniProt is not doing any gene prediction; sequence integration from Ensembl

(2) Annotation Biological knowledge different sections (3) Sequence annotation (features (FT)) (4) Annotation score & evidence for protein existence

Status:  Reviewed / Unreviewed  Annotation score: http://insideuniprot.blogspot.ch/2014/10/introducing-annotation-scores-in-uniprot.html  Evidence for protein existence (5) Help sections

• Context sensitive • Feedback (update)

• Blog: http://insideuniprot.blogspot.ch/ • YouTube: https://www.youtube.com/user/uniprotvideos • Query Help (FAQs, user manual, …)

The 2 sections of UniProtKB UniProtKB is composed of 2 sections

UniProtKB/Swiss-Prot Reviewed - Manually annotated Records with information extracted from literature and curator-evaluated computational analysis.

UniProtKB/TrEMBL Unreviewed – Automatically annotated Records that await full manual annotation.

released every 4 weeks

Source of annotation & Evidence statements Source of annotation/Evidence statements

Experimental data (publication)

Computational analysis (curator-evaluated or not)

UniProtKB - P51787 (KCNQ1_HUMAN) Source of annotation/Evidence statements

UniProtKB/Swiss-Prot: Manual insertion, color in yellow

UniProtKB/TrEMBL: Automated insertion, color in blue

Publication: {ECO:0000269|PubMed:10476968} (text format)

ECO code (ontology) UniProtKB/Swiss-Prot UniProtKB/Swiss-Prot

2 % of UniProtKB protein sequences

UniProtKB/Swiss-Prot Manual annotation

1. Protein sequence (merge available CDS, annotate sequence discrepancies, report sequencing mistakes…)

2. Biological information (, extract literature information, ortholog data propagation, …)

UniProtKB/Swiss-Prot Manual annotation

1. Protein sequence (merge available CDS, annotate sequence discrepancies, report sequencing mistakes…)

2. Biological information (sequence analysis, extract literature information, ortholog data propagation, …)

UniProtKB/ Swiss-Prot

Entry vs Protein sequence(s)

One entry – one gene – one species

One or several protein sequences (isoforms) per entry

canonical & isoform Used to construct the UniProtKB canonical sequence

Automatically mappep to the UniProtKB record

http://www.uniprot.org/uniprot/P54710#cross_references UCSC genome browser: another gene

mRNAs and their corresponding CDS annotation (from EMBL/GenBank/DDBJ) Manually checked (comparison with orthologs, etc.) incorrect correct

The ‘incorrect’ sequence is found at NCBI nr (GenPept) (without warning) UniProtKB/ Swiss-Prot

Entry vs Protein sequence(s)

One entry – one gene – one species

One or several protein sequences (isoforms) per entry

canonical & isoform

…. 553’474 ‘canonical’

+ 39’441 ‘isoforms’ (7 %)

http://web.expasy.org/docs/relnotes/relstat.html Beware

The isoform sequences are not included in all datasets,

Examples:

- Complete proteome -> download Fasta (canonical & isoform) - Blast@ NCBI (NCBInr) UniProtKB/Swiss-Prot Manual annotation

1. Protein sequence (merge available CDS, annotate sequence discrepancies, report sequencing mistakes…)

2. Biological information (sequence analysis, extract literature information, ortholog data propagation, …)

Source of annotation/Evidence statements

• Selected Publication (experimental)

• Another UniProtKB entry (orthologs): by similarity

• An entry from another database: imported

• Curator-evaluated computational analysis* UniRule

• Combined sources

* more details later… Source of annotation/Evidence statements

• Selected Publication (experimental)

• Another UniProtKB entry (orthologs): by similarity

• An entry from another database: imported

• Curator-evaluated computational analysis* UniRule

• Combined sources

* more details later… comprehensive and computer friendly representation of biological knowledge

PubMed=16595657

www.uniprot.org comprehensive and computer friendly representation of biological knowledge UniProtKB Q9FYL3 PubMed=16595657

Controlled vocabulary

www.uniprot.org comprehensive and computer friendly representation of biological knowledge

PubMed=16595657

UniProtKB Q9FYL3

www.uniprot.org Sequence annotation -> Feature viewer

http://insideuniprot.blogspot.ch/2016/02/introducing-uniprot-feature-viewer.html Source of annotation/Evidence statements

• Selected Publication (experimental)

• Another UniProtKB entry (orthologs): by similarity

• An entry from another database: imported

• Curator-evaluated computational analysis UniRule

• Combined sources

* more details later… Protein sequence analysis: in-house resource ‘Anabelle’

Annotation Controlled vocabulary (CV)

• Keywords • Cellular component • Disease • Taxonomy • …

• Gene Ontology Annotation (CV): Keywords

http://www.uniprot.org/help/keywords Annotation (CV): Cellular component Annotation (CV): Gene Ontology www.uniprot.org/uniprot/P01588 http://www.uniprot.org/help/keywords_vs_go Annotation (CV): Gene Ontology www.uniprot.org/uniprot/P01588

Keyword to GO UniProtKB/TrEMBL UniProtKB/TrEMBL

98 % of UniProtKB protein sequences

UniProtKB/TrEMBL Automated annotation

Protein sequence - The quality of the protein sequences is dependent on the information provided by the submitter of the original nucleotide entry (CDS) or of the gene prediction pipeline (i.e. Ensembl). - 100% identical sequences (same length, same organism are merged automatically).

Biological information Sources of annotation - Provided by the submitter (EMBL, PDB, TAIR…) - From automated annotation: SAAS, UniRule, SAM, InterPro UniProtKB /TrEMBL

Entry vs Protein sequence(s) One sequence – one gene – one species

One protein sequence per entry

UniProtKB/TrEMBL Automated annotation

Protein sequence - The quality of the protein sequences is dependent on the information provided by the submitter of the original nucleotide entry (CDS) or of the gene prediction pipeline (i.e. Ensembl). - 100% identical sequences (same length, same organism are merged automatically).

Biological information Sources of annotation - Provided by the submitter (EMBL, PDB, TAIR…) - From automated annotation: SAAS, UniRule, SAM, InterPro Source of annotation: Computational analysis

http://biofunctionprediction.org/cafa/

Automated annotation

Automated generated rules (SAAS) Generates a set of decision trees using data mining (new set every UniProtKB release)

Manually generated rules (UniRule) Maintains a set of manual annotation rules

UniRule = PIR* + HAMAP + Rulebase Sequence analysis methods (SAM) Signal, transmembrane, coils prediction

InterPro Domains & GO terms SAAS

SAAS learns on the properties present in the reviewed UniProtKB (Swiss-Prot) entries and uses the following attribute types to define the learning entries: InterPro , taxonomy and sequence length. This combination allows SAAS to generate rules to annotate protein properties such as function, catalytic activity, pathway membership, subcellular location, protein names and feature predictions.

http://insideuniprot.blogspot.ch/2016/10/automatic-learning-based-annotation-in.html SAAS

Query: source:SAAS00629841, feb 2017 Beware: the SAAS number may change, if the rules changes… Automated annotation

Automated generated rules (SAAS) Generates a set of decision trees using data mining (new set every UniProtKB release)

Manually generated rules (UniRule ) Maintains a set of manual annotation rules

UniRule = PIR* + HAMAP + Rulebase Sequence analysis methods (SAM) Signal, transmembrane, coils prediction

InterPro Domains & GO terms UniRule (1) UniRule (2) UniRule

Query: source:RU361160, Feb 2017 UniRule UniRule UniRule Additional information on UniRule http://insideuniprot.blogspot.ch/2015/11/unirule-automatic-annotation-system-in.html Automated annotation

Automated generated rules (SAAS) Generates a set of decision trees using data mining (new set every UniProtKB release)

Manually generated rules (UniRule ) Maintains a set of manual annotation rules

UniRule = PIR* + HAMAP + Rulebase Sequence analysis methods (SAM) Signal, transmembrane, coils prediction

InterPro Domains & GO terms SAM

No GO (cellular component) annotation in this case… http://www.uniprot.org/help/sam SAM: Transmembrane

annotation:(type:transmem) AND reviewed:no SAM: Signal peptide Automated annotation

Automated generated rules (SAAS) Generates a set of decision trees using data mining (new set every UniProtKB release)

Manually generated rules (UniRule ) Maintains a set of manual annotation rules

UniRule = PIR* + HAMAP + Rulebase Sequence analysis methods (SAM) Signal, transmembrane, coils prediction

InterPro Domains & GO terms

Automated annotation

Some important remarks

UniRule

Total number of records in UniProtKB Records with automated annotation (w/o InterPro)

~ 51 % of TrEMBL records contain automated annotation automated annotation in UniProtKB/TrEMBL Differences between TrEMBL and Swiss-Prot

TrEMBL Swiss-Prot annotation automatic manual Annotation = Partial annotation As complete and complete ? (~50 % of the systematic as entries) possible Set of sequences = As complete as Complete sets only complete ? possible; does not for a few organisms contain Swiss-Prot sequences ! Number of entries 73’000’000 550’000 Number of species 590’000 13’000

When you compare biological information of given datasets of proteins beware the ratio of TrEMBL vs Swiss-Prot entries in your dataset: the results might not be only ‘biological’! Set of mouse proteins with N-glycosylation - March 2017

21.7 % 0.05 %

7.2 %

annotation:(type:carbohyd "n organism:"Mus musculus linked glcnac ellipsis") AND (Mouse) [10090]" organism:"Mus musculus (Mouse) [10090]" The UniProt web site

www.uniprot.org The UniProt web site www.uniprot.org

• Powerful search engine, google-like and easy-to-use, but also supports very directed field searches

• Scoring mechanism presenting relevant matches first

• Entry views, search result views and downloads are customizable

• The URL of a result page reflects the query; all pages and queries are bookmarkable, supporting programmatic access

• Search, Blast, Align, Retrieve/ID mapping

Search

A very powerful text search tool with autocompletion and refinement options allowing to look for UniProt entries and documentation by biological information Find all human proteins located in the nucleus

The search interface guides users with helpful suggestions and hints

http://www.uniprot.org/uniprot/?query=human+nucleus&sort=score

http://www.uniprot.org/uniprot/?query=organism%3Ahuman+AND+location%3Anucleus&sort=score Advanced Search

A very powerful search tool

To be used when you know in which entry section the information is stored

Find all the protein localized in the cytoplasm (experimentally proven) which are phosphorylated on a serine (experimentally proven) Result pages: highly customizable

Add columns (also available for BLAST)

From: http://www.uniprot.org/uniprot/?query=gene:epo%20organism:%22Homo%20sapiens%20%28Human%29%20[9606]%22&fil=&sort=score

To: http://www.uniprot.org/uniprot/?query=gene:epo organism:"Homo sapiens (Human) [9606]"&sort=score&columns=id,entry name,reviewed,protein names,genes,organism,length http://insideuniprot.blogspot.ch/2015/03/customise-and-share-your-search-results.html

Download

Different formats (, txt, excell, RDF, etc.) Highlight sequence annotation in alignment (BLAST or multiple alignment) Formats of a UniProtKB entry UniProtKB entry formats

Format ‘Web site’ UniProt Format Fasta

Format text

Format RDF http://sparql.uniprot.org/

http://insideuniprot.blogspot.ch/2014/08/haveContact us ! -you-tried-uniprot-rdf.html More during practicals… UniProt: other databases UniProt databases UniParc: protein sequence archive (EMBL-EMBL equivalent UniProtat the databases protein level)

Each entry contains a protein sequence, taxonomic information, cross-links to other databases where you find the sequence (active or not)

No annotation

All the public patented sequences are stored in UniParc (EPO, USPO, JPO)

You can: query, Blast, download

~110,000,000 entries http://www.uniprot.org/uniparc/UPI0000033477 (1)

http://www.uniprot.org/uniparc/UPI0000033477 (2)

Remarks:

- UniProt is not doing any gene prediction; integration from Ensembl

- Most non-germline immunoglobulins, T-cell receptors , most patent sequences, highly over- represented data (e.g. viral antigens), pseudogenes sequences are excluded from UniProtKB, but stored in UniParc. UniProt databases

UniRef

3 clusters of protein sequences with 100, 90 and 50 % identity; useful to speed up sequence similarity search (BLAST)

You can: query, Blast, download

UniProt databases Proteomes

• A proteome is the entire set of proteins expressed by a specific organism whose genomes have been completely sequenced. • It normally includes sequences that derive from extra- chromosomal elements such as plasmids or organellar genomes • Some proteomes may also include protein sequences based on high quality cDNAs that cannot be mapped to the current genome assembly due to sequencing errors or gaps. • UniProt proteomes may include both manually reviewed (UniProtKB/Swiss-Prot) and unreviewed (UniProtKB/TrEMBL) entries. Proteomes

• a proteome is formed from all UniProtKB/Swiss- Prot entries (irrespective of whether they map to Ensembl or Ensembl Genomes) plus those UniProtKB/TrEMBL entries mapping to Ensembl or Ensembl Genomes for that proteome.

• there may be several proteomes per taxonomic identifier -> Reference proteome

…for users that prefer to use a single best-annotated proteome from a particular taxonomic group….

To include isoform sequences Around UniProtKB

A ‘one stop shop’ for human proteins

o To visualize all the integrated information

o To extract/export relevant annotations

o To perform complex and precise queries

Data quality assessment neXtProt applies a three-tiered data grading system for data quality:

Gold: Very high confidence level (< 1% error) Silver: Greater than 95% confidence Bronze: Any data with less than 95% confidence is assigned "bronze" quality and not integrated into neXtProt

155 Each protein has a page with multiple aspects (views)

156 Sequence view

157 Medical view

158 Phenotype view: Phenotypes

159 Phenotype view: Variants associated with phenotypes

160 Examples of complex queries

• Proteins which are located in mitochondrion and have at least one HPA antibody and exist in at least one proteome identification set

• Proteins that interact with viral proteins

• Proteins which are targets of drugs for cardiac therapy

• Proteins located on and having at least one variant on a phosphorylated tyrosine

SPARQL search • Syntax not the most user friendly • Examples available to help users design queries

162 Advanced search: results

163 Around UniProt (2) • Viralzone: http://viralzone.expasy.org/

Around UniProt (3)

• VenomZone: http://venomzone.expasy.org/ [email protected] http://insideuniprot.blogspot.ch/2016/09/how-can-you-increase-impact-of-your.html Menu

Introduction

Nucleic acid sequence databases EMBL, GenBank, DDBJ

Protein sequence databases UniProt databases (UniProtKB)

NCBI protein databases (NCBInr, RefSeq)

Practicals…

NCBI protein databases (Entrez protein, NCBI nr) http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Protein Major ‘general’ protein sequence database ‘sources’

Integrated resources ‘cross-references’ TPA PIR PDB PRF

UniProtKB: Swiss-Prot + TrEMBL Resources are kept separated NCBI-nr: Swiss-Prot + TrEMBL + GenPept + PIR + PDB + PRF + RefSeq + TPA

not complete !!! (only entries created before 2007 ?) UniProtKB/Swiss-Prot: manually annotated protein sequences (13’000 species) UniProtKB/TrEMBL: submitted CDS (EMBL) + Ensembl prediction + automated annotation; non redundant with Swiss-Prot (~590’000 species) GenPept: submitted CDS (GenBank); redundant with Swiss-Prot (~450’000 species) PIR: Protein Information Ressource; archive since 2003; integrated into UniProtKB PDB: Protein Databank: 3D data and associated sequences PRF: Protein Research Foundation journal scan of ‘published’ peptide sequences RefSeq: Reference Sequence for DNA, RNA, protein + gene prediction + some manual annotation (54’000 species) TPA: Third part annotation NCBI nr - Entrez ‘protein’ http://www.ncbi.nlm.nih.gov/protein/

NCBI-nr • GenPept (source: GenBank; translated CDS) • RefSeq • TPA (third part annotation)

• Swiss-Prot (does not include isoform sequences) • PIR (not updated since 2003) • PRF (journal scan of ‘published’ peptide) • PDB (, 3D structure) • TrEMBL (some entries….)

GenPept

Translation from annotated CDS in GenBank Contains all translated CDS annotated in GenBank/EMBL/DDBJ sequences

- equivalent to UniProtKB/TrEMBL, except that it is redundant with other databases (Swiss-Prot, RefSeq, PIR….)

GenPept: ‘translations from all annotated coding regions (CDS) in GenBank’

Annotation according to the submitter

No GO term ! RefSeq

Produced by NCBI and NLM

http://www.ncbi.nlm.nih.gov/RefSeq/ http://www.ncbi.nlm.nih.gov/books/bookres.fcgi/handbook/ch18.pdf

FAQ: http://www.ncbi.nlm.nih.gov/books/NBK50679/

http://www.ncbi.nlm.nih.gov/refseq/ RefSeq

RefSeq: The Reference Sequence (RefSeq) collection aims to provide a comprehensive, integrated, non-redondant set of sequences, including genomic DNA, transcript (RNA), and protein products, for major research organisms. One protein sequence -> one entry

RefSeq

Protein: NP_ mRNA: NM_ AC DNA: NC_ The corresponding mRNA

Taxonomy

References Status and Genbank source

Annotation - automated - derived from Swiss-Prot - in-house Annotation - automated - derived from Swiss-Prot - in-house

- no GO annotation

Cross-references

Sequence RefSeq

manual annotation GENOME ANNOTATION No INFERRED No MODEL No PREDICTED No PROVISIONAL No

Yes (sequence + functional information REVIEWED and features) VALIDATED Yes (initial sequence)

Whole Genome Sequencing (WGS) No

http://www.ncbi.nlm.nih.gov/RefSeq/

UniProtKB/Swiss-Prot: One gene -> one entry (9 isoforms) RefSeq: One protein sequence -> one entry Different datasets Exemple: human proteome ~ 20’200 genes

Query for organism:’homo sapiens’ (Nov 2013) • UniProtKB: 134’679 entries + alt sequences ( 19’425) = 125’491 • UniProtKB/Swiss-Prot: 20’278 entries + alt sequences ( 19’425) = 39’703 • UniProtKB/TrEMBL: 114,401 entries • UniParc: 1,082,249 entries • RefSeq: 32’898 sequences • Ensembl: 104’488 peptide sequences

Query for ‘homo sapiens’ + Complete proteome (KW-181) • UniProtKB: 56’392 + alt sequences (15’435) = 71’827 • UniProtKB/Swiss-Prot: 20’272 + alt sequences (19’425) = 39’697 • UniProtKB/TrEMBL: 48’774

• NeXtProt: 20’140 / all isoforms = 39,565 (20’140 + 19’425)

Query: organism:"Homo sapiens (Human) [9606]"

NeXtProt

RefSeq NCBI nr query & BLAST

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Protein Look for human FOXP2 @ NCBI nr BLAST @ NCBI

Does not include isoforms ! BLAST @ NCBI UniProtKB/Swiss-Prot alternative isoform sequences are not included ! NCBI nr: example of ‘cluster’ NCBI-nr clusters: identical proteins (100%) derived from the same organism

UniRefs provide clustered sets of sequences at several resolutions (100%, 90% and 50%) for all the organisms.

UniProtKB entries at NCBI… A UniProtKB/Swiss-Prot entry with the NCBI look

Important remarks concerning the datasets

Different servers…

UniProtKB/TrEMBL entries are not available at NCBI The same protein sequence might be present, but not with the UniProtKB/TrEMBL AC (with some exceptions) (not the case for UniProtKB/Swiss-Prot entries)

UniProtKB/TrEMBL entries are not available at NCBInr with some exceptions… ID/AC mapping Accession number (AC) mapping

These identifiers are all pointing to a TP53 () protein sequence !

P04637, NP_000537, NP_001119584.1, NP_001119585.1, NP_001119584.1, NP_001119584.1, NP_001119584.1, NP_001119584.1, ENSG00000141510, CCDS11118, UPI000002ED67, IPI00025087, etc.

http://www.uniprot.org/uploadlists/ http://www.ebi.ac.uk/Tools/picr/

Understanding protein function is critical to research in many areas of science such as biology, medicine and .

Keeping up with all of this information is a daunting task for most researchers. UniProt helps with this in the following ways:

• it provides an up-to-date, comprehensive body of protein information at a single site; • it aids scientific discovery by collecting, interpreting and organising this information so that it is easy to access and use; • it saves researchers countless hours of work in monitoring and collecting this information themselves; • it provides tools to help with protein sequence analysis; • it provides links to related information in more than 150 other biological databases to help you access additional information in more specialised collections.

• https://www.ebi.ac.uk/training/online/course/uniprot-exploring-protein-sequence-and-functional/why-do-we-need-uniprot

Menu

Introduction

Nucleic acid sequence databases EMBL, GenBank, DDBJ

Protein sequence databases UniProt databases (UniProtKB/Swiss-Prot)

NCBI protein databases (NCBInr, RefSeq)

Practicals…

Thank you !

Thanks to Emmanuel Boutet and Ivo Pedruzzi for some of the slides !

Thanks to Diana Marek, Grégoire Rossier and Patricia Palagi for the organisation All documents (including practicals) are online http://education.expasy.org/cours/InsideProteinDatabases2017/

Inside Protein databases SIB Swiss Institute of Bioinformatics

CMU, new building, 4st floor room A04 2711 Building A, 4st floor Room A04 2711

Building A Entrance: Av. de Champel

Building B

Main Entrance: 1 Michel Servet