Protein Sequence
Total Page:16
File Type:pdf, Size:1020Kb
Inside protein databases: bridging sequences and knowledge WELCOME ! Geneva, 2017 SIB Swiss Institute of Bioinformatics [email protected] & [email protected] Swiss-Prot, SIB [email protected] CALIPHO, SIB SIB Swiss Institute of Bioinformatics • 70 groups • 800 collaborators • biologists, biochemists, computer scientists, physicists, physicians, chemists, mathematicians, pharmacists, … Common point: bioinformatics Inside protein databases: bridging sequences and knowledge All the material is available here: http://education.expasy.org/cours/InsideProteinDatabases2017/ Content • a description of the major protein sequence databases and their sequence annotation pipeline, focusing on UniProtKB/Swiss-Prot • an introduction to Gene Ontology (GO) • practical sessions allowing to gain knowledge on how to query protein sequence databases, how to perform enrichment analysis on datasets and how to interpret the results of such analyses. Objectives • know the differences between the major protein sequence databases • understand the major sequence annotation pipelines and the GO annotation pipelines • estimate the protein sequence accuracy and the annotation quality 08h30 Protein sequence databases: theory 10h30 COFFEE BREAK 11h00 Controlled vocabularies and standardization resources: theory 12h15 PAUSE 13h30 Protein sequence databases and Gene Ontology: practicals 15h00 COFFEE BREAK 15h30 Analysis tools using ontologies : theory 16h00 Protein sequence databases and Gene Ontology: practicals 17h00 Evaluation / Exam 18h00 END 08h30 Protein sequence databases: theory 10h30 COFFEE BREAK 11h00 Controlled vocabularies and standardization resources: theory 12h15 PAUSE 13h30 Protein sequence databases and Gene Ontology: practicals 15h00 COFFEE BREAK 15h30 Analysis tools using ontologies : theory 16h00 Protein sequence databases and Gene Ontology: practicals 17h00 Evaluation / Exam 18h00 END Menu Introduction Nucleic acid sequence databases EMBL, GenBank, DDBJ Protein sequence databases UniProt databases (UniProtKB) NCBI protein databases (NCBInr, RefSeq) Practicals… Menu Introduction Nucleic acid sequence databases EMBL, GenBank, DDBJ Protein sequence databases UniProt databases (UniProtKB) NCBI protein databases (NCBInr, RefSeq) Practicals… ID HIDH_SOYBN Reviewed; 319 AA. AC Q5NUF3; ... DE RecName: Full=2-hydroxyisoflavanone dehydratase; Protein databases DE EC=3.1.1.1; DE EC=4.2.1.105; DE AltName: Full=Carboxylesterase HIDH; GN Name=HIDH; OrderedLocusNames=GLYMA01G45020; OS Glycine max (Soybean) (Glycine hispida). OC Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; OC Spermatophyta; Magnoliophyta; eudicotyledons; core eudicotyledons; Text search OC rosids; fabids; Fabales; Fabaceae; Papilionoideae; Phaseoleae; OC Glycine. OX NCBI_TaxID=3847; RN [1] RP NUCLEOTIDE SEQUENCE [MRNA], FUNCTION, CATALYTIC ACTIVITY, MUTAGENESIS OF GLY-78; Training Dataset RP GLY-79; THR-164; ASP-263 AND HIS-295, AND BIOPHYSICOCHEMICAL PROPERTIES. RC TISSUE=Seedling; RX PubMed=15734910; DOI=10.1104/pp.104.056747; RA Akashi T., Aoki T., Ayabe S.; RT "Molecular and biochemical characterization of 2-hydroxyisoflavanone dehydratase. Statistics RT Involvement of carboxylesterase-like proteins in leguminous isoflavone biosynthesis."; RL Plant Physiol. 137:882-891(2005). ... CC -!- FUNCTION: Dehydratase that mediates the biosynthesis of CC isoflavonoids. Can use both 4'-hydroxylated and 4'-methoxylated 2- Genome annotation (Features) CC hydroxyisoflavanones as substrates. Has also a slight CC carboxylesterase activity toward p-nitrophenyl butyrate. CC -!- CATALYTIC ACTIVITY: 2,7,4'-trihydroxyisoflavanone = daidzein + CC H(2)O. ... System biology CC -!- BIOPHYSICOCHEMICAL PROPERTIES: CC Kinetic parameters: CC KM=29 uM for 2,7-dihydroxyBIOLOGICAL-4'-methoxyisoflavanone (at pH 7.5 and CC 30 degrees Celsius); ... … CC -!- PATHWAY: Secondary metabolite biosynthesis; flavonoid CC biosynthesis. CC -!- SIMILARITY: Belongs toKNOWLEDGE the 'GDXG' lipolytic enzyme family. DR EMBL; AB154415; BAD80840.1; -; mRNA. DR EMBL; BT097440; ACU22699.1; -; mRNA. DR EMBL; CM000834; -; NOT_ANNOTATED_CDS; Genomic_DNA. DR RefSeq; NP_001237228.1; NM_001250299.2. DR UniGene; Gma.19376; -. DR ProteinModelPortal; Q5NUF3; -. ... DR Pfam; PF07859; Abhydrolase_3; 1. DR PROSITE; PS01173; LIPASE_GDXG_HIS; 1. PE 1: Evidence at protein level; KW Complete proteome; Flavonoid biosynthesis; Hydrolase; Lyase; Reference proteome. FT CHAIN 1 319 2-hydroxyisoflavanone dehydratase. FT /FTId=PRO_0000424101. ... FT ACT_SITE 77 77 Potential. FT ACT_SITE 164 164 FT ACT_SITE 263 263 FT ACT_SITE 295 295 … annotation Proteomics BLAST Phylogeny SQ SEQUENCE 319 AA; 35138 MW; E8333CF425FBA4A3 CRC64; MAKEIVKELL PLIRVYKDGS VERLLSSENV AASPEDPQTG VSSKDIVIAD NPYVSARIFL PKSHHTNNKL PIFLYFHGGA FCVESAFSFF VHRYLNILAS EANIIAISVD FRLLPHHPIP Training datasets AAYEDGWTTL KWIASHANNT NTTNPEPWLL NHADFTKVYV GGETSGANIA HNLLLRAGNE SLPGDLKILG GLLCCPFFWGprotein SKPIGSEAVE GHEQSLAMKVsequence WNFACPDAPG GIDNPWINPC VPGAPSLATL ACSKLLVTIT GKDEFRDRDI LYHHTVEQSG WQGELQLFDA GDEEHAFQLF Domains KPETHLAKAM IKRLASFLV // … Annotation: where does it come from ? Annotation is the process of assigning biological information to DNA or protein sequences Information (i.e. protein function and subcellular location) may come from publications (experimental data) sequence similarity (…quest for orthologs) protein domain computational analysis (prediction) Computational analysis can be manually checked (by the ‘biocurators’) or not Examples: UniProtKB and Gene Ontology annotation Protein sequence: where does it come from? Protein sequences origins • > 180 billion ‘different’ proteins on earth (∑ N species x M genes) • ~ 74 million ‘known and public’ protein sequences in 2017 • About 98% of the protein sequences are derived from the translation of nucleotide sequences (mRNA or DNA/genome) • About 1 % come from direct protein sequencing (Edman, MS/MS…) The ideal life of a sequence … http://www.ncbi.nlm.nih.gov/genbank/submit RNA, genes, genomes, … Nucleic acid sequence databases Protein sequence databases EMBL/GenBank/DDBJ http://www.insdc.org/ RefSeq • The Reference Sequence (RefSeq) collection: provides a non- redundant set of sequences, including genomic DNA, transcript (RNA), and protein products, for major research organisms. RefSeqi NP_000790.2. NM_000799.2. • Contains protein sequences derived from gene prediction, not submitted to EMBL-Bank/GenBank/DDBJ Ensembl • Creates, integrates and distributes reference datasets and analysis tools that enable genomics. • Joint project between EMBL-EBI and the Sanger Centre Ensembl i ENST00000252723; ENSP00000252723; ENSG00000130427. • Contains protein sequences derived from gene prediction, not submitted to EMBL-Bank/GenBank/DDBJ Menu Introduction Nucleic acid sequence databases EMBL, GenBank, DDBJ Protein sequence databases UniProt databases (UniProtKB) NCBI protein databases (NCBInr, RefSeq) Practicals… EBI (UK) EMBL (ENA) European Nucleotide Archive GenBank NCBI (US) DDBJ Japan Archive of primary sequence data and corresponding annotation submitted by the laboratories that did the sequencing. DNA sequence of the human EPO gene Server EBI; Database EMBL/ENA; text format accession number taxonomy References - the submitters Cross-references CDS CoDing Sequence (proposed by submitters) 5 exons DNA sequence EMBL/GenBank/DDBJ • Archive: nothing goes out -> highly redundant ! • Most annotations are done by the submitters: heterogeneity of the quality and of the completion • Archive: all submitted information remains there; not updated (exception: Third Part Annotation (TPA)) • Many errors: in sequences, in annotations, in CDS attribution, no consistency of annotations EMBL/GenBank/DDBJ and annotation “Beyond limited editorial control and some internal integrity checks (for example, proper use of INSD formats and translation of coding regions specified in CDS entries are verified), the quality and accuracy of the record are the responsibility of the submitting author, not of the database. The databases will work with submitters and users of the database to achieve the best quality resource possible.” http://www.insdc.org/policy EMBL/GenBank/DDBJ and annotation • many scientists assume that GenBank annotation is kept up to date, and they are surprised to hear that it is not • the annotation has remained static: a gene labeled 'hypothetical protein' a few years ago might now have a known function. • erroneous and inconsistent naming of genes. • a name is transferred from one gene to another on the basis of sequence similarity (usually from a BLAST search). As more genomes are annotated, and more BLAST searches are run, the original source of the name quickly becomes lost. • scientists should fix errors that they find. But this would quickly destroy the archival function of GenBank, as original entries would be erased over time. (PMID: 17274839) information provided by the submitter of an nucleotide entry… DR EMBL; DQ339047; ABC68418.1; -; mRNA. FT source 1..1397 FT /organism="Rattus norvegicus" FT /strain="Sprague-Dawley" FT /mol_type="mRNA" FT /sex="female" FT /tissue_type="ovary" FT /db_xref="taxon:10116" FT CDS 70..1329 FT /codon_start=1 FT /product="testis derived transcript" FT /note="TES" FT /db_xref="GOA:Q2LAP6" EMBL/GenBank/DDBJ DNA sequence & CoDing Sequences (CDS) Coding sequence (CDS) annotation CDS CoD ing Sequence (provided by submitters) Slide J. McDowall CDS annotation provided by the submitters The first Met ! CDS translation provided by EMBL/GenBank This