Aim and Outline of the courseDb data mining
Db tools can be used to retrieve informa on for a gene or protein
The most important concept is the similarity
Aim and Outline of the courseDb data mining
BLAST
Interpro
Blast2GO
Sequence Alignments
BLAST (Basic Local Alignment Search Tool
• One of the tools of the NCBI - The U.S. National Center for Biotechnology Information.
• Uses word matching like FASTA
• Similarity matching of words (3 AA’s, 11 bases/nucleotides) – does not require identical words.
• If no words are similar, then there is no alignment – won’t find matches for very short sequences 13 Sequence Alignments BLAST word matching
MEAAVKEEISVEDEAVDKNI
MEA EAA AAV Break query AVK VKE into words: KEE EEI EIS ISV ... Break database sequences into words:
14 Sequence Alignments Compare word lists
Query Word List: Database Sequence Words Lists
MEA ? RTT AAQ EAA SDG KSS AAV SRW LLN AVK QEL RWY VKL VKI GKG KEE DKI NIS EEI LFC WDV EIS AAV KVR ISV PFR DEI
Compare word lists by Hashing & allow near matches!
15 Sequence Alignments
Find locations of matching words in all sequences
ELEPRRPRYRVPDVLVADPPIARLSVSGRDENSVELTMEAT MEA EAA TDVRWMSETGIIDVFLLLGPSISDVFRQYASLTGTQALPPLFSLGYHQSRWNY AAV AVK IWLDIEEIHADGKRYFTWDPSRFPQPRTMLERLASKRRVKLVAIVDPH KLV KEE EEI EIS ISV
16 Sequence Alignments Extend hits one base at a time
• Then BLAST extends the matches in both directions, starting at the seed. The un-gapped alignment process extends the initial seed match of length W in each direction in an order to boost the alignment score. Indels are not considered during this stage.
• In the last stage, BLAST performs a gapped alignment between the query sequence and the database sequence using a variation of the Smith-Waterman algorithm. Statistically significant alignments are then displayed to the user. 17 Sequence Alignments Extend hits one base at a time
• Then BLAST extends the matches in both directions, starting at the seed. The un-gapped alignment process extends the initial seed match of length W in each direction in an order to boost the alignment score. Indels are not considered during this stage.
Why statistics??
• In the last stage, BLAST performs a gapped alignment between the query sequence and the database sequence using a variation of the Smith-Waterman algorithm. Statistically significant alignments are then displayed to the user. 17 Sequence Alignments BLAST : example of result
• Job Title: P14867|GBRA1_HUMAN Gamma-aminobutyric acid... Show Conserved Domains Putative conserved domains have been detected, click on the image below for detailed results. * BLASTP 2.2.18 (Mar-02-2008) protein-protein BLAST
Database: Non-redundant SwissProt sequences 309,621 sequences; 115,465,120 total letters
Query= P14867|GBRA1_HUMAN Gamma-aminobutyric acid receptor subunit alpha-1 - Homo sapiens (Human). Length=456
Sequences producing significant alignments: (Bits) Value
sp|P14867.3|GBRA1_HUMAN Gamma-aminobutyric acid receptor subu... 948 0.0 Gene info sp|Q5R6B2.1|GBRA1_PONPY Gamma-aminobutyric acid receptor subu... 944 0.0 sp|Q4R534.1|GBRA1_MACFA Gamma-aminobutyric acid receptor subu... 944 0.0 sp|P08219.1|GBRA1_BOVIN Gamma-aminobutyric acid receptor subu... 939 0.0 Gene info sp|P62813.1|GBRA1_RAT Gamma-aminobutyric acid receptor subuni... 908 0.0 Gene info sp|P19150.1|GBRA1_CHICK Gamma-aminobutyric acid receptor subu... 882 0.0 Gene info sp|P47869.2|GBRA2_HUMAN Gamma-aminobutyric acid receptor subu... 670 0.0 Gene info sp|P26048.1|GBRA2_MOUSE Gamma-aminobutyric acid receptor subu... 669 0.0 Gene info sp|P23576.1|GBRA2_RAT Gamma-aminobutyric acid receptor subuni... 669 0.0 Gene info sp|P10063.1|GBRA2_BOVIN Gamma-aminobutyric acid receptor subu... 667 0.0 Gene info • sp|Q08E50.1|GBRA5_BOVIN Gamma-aminobutyric acid receptor subu... 641 0.0 Gene info sp|Q8BHJ7.1|GBRA5_MOUSE Gamma-aminobutyric acid receptor subu... 640 0.0 Gene info sp|P31644.1|GBRA5_HUMAN Gamma-aminobutyric acid receptor subu... 638 0.0 Gene info sp|P19969.1|GBRA5_RAT Gamma-aminobutyric acid receptor subuni... 636 0.0 Gene info sp|P34903.1|GBRA3_HUMAN Gamma-aminobutyric acid receptor subu... 632 0.0 Gene info sp|P26049.1|GBRA3_MOUSE Gamma-aminobutyric acid receptor subu... 630 6e-180 Gene info sp|P10064.1|GBRA3_BOVIN Gamma-aminobutyric acid receptor subu... 628 2e-179 Gene info sp|P20236.1|GBRA3_RAT Gamma-aminobutyric acid receptor subuni... 627 3e-179 Gene info sp|P30191.1|GBRA6_RAT Gamma-aminobutyric acid receptor subuni... 520 6e-147 Gene info sp|P16305.2|GBRA6_MOUSE Gamma-aminobutyric acid receptor subu... 518 2e-146 Gene info sp|Q90845.1|GBRA6_CHICK Gamma-aminobutyric acid receptor subu... 518 3e-146 Gene info
18 Sequence Alignments BLAST is approximate but fast
• BLAST makes similarity searches very quickly, but also makes errors – misses some important similarities – makes many incorrect matches
• The NCBI BLAST web server lets you compare your query sequence to various sequences stored in the GenBank;
• This is a VERY fast and powerful computer.
• The speed and relatively good accuracy of BLAST are the key why the tool is the most popular bioinformatics search tool.
19
InterPro Database Protein Functional Analysis
Jennifer McDowall, Ph.D. Senior InterPro Curator EBI Sequence Databases
nucleotide protein sequence sequence
translate EMBL UniProtKB >7M TrEMBL
(GenBank, manual DDBJ) annotation
UniProtKB CGCGCCTGTACGCT >400,000 GAACGCTCGTGAC GTGTAGTGCGCG Swiss-Prot CGCGCCTGTACGCT GAACGCTCGTGAC GTGTAGTGCGCG EBI Sequence Databases nucleotide protein protein sequence sequence annotation
translate EMBL UniProtKB TrEMBL InterPro (GenBank, manual DDBJ) annotation Protein signatures UniProtKB groups of CGCGCCTGTACGCT related proteins GAACGCTCGTGAC GTGTAGTGCGCG Swiss-Prot CGCGCCTGTACGCT (same family or GAACGCTCGTGAC GTGTAGTGCGCG share domains) InterPro ~80% Protein Coverage
UniProt/ UniProt/ UniMESS SwissProt TrEMBL Metagenomic proteins proteins proteins
UniProtKB ~400,000 >7M >6M
Signature matches
Available InterPro ~370,000 >5.3M 2009 What are protein signatures?
Multiple sequence alignment
• A signature describes the pa ern of a set of conserved residues in a group of proteins Ø Define a protein family Ø Define a protein feature (domain or conserved site) What value are signatures?
• More sensi ve homology searches Ø Find more distant homologues than BLAST What value are signatures?
• More sensi ve homology searches
• Classifica on of proteins Ø Associate proteins that share: Func on Domains Sequence Structure What value are signatures?
• More sensi ve homology searches
• Classifica on of proteins
• Annota on of protein sequences Ø Define conserved regions of a protein - e.g. loca on and type of domains key structural or func onal sites What value are signatures?
• More sensi ve homology searches
• Classifica on of proteins
• Annota on of protein sequences
• Transfer addi onal (automa c) annota on Ø Associate TrEMBL proteins with well-annotated SwissProt proteins
Transfer annota on Signature methods
• Pattern
• Fingerprint
• Sequence clustering
• HMM
• SAM Patterns
Pattern/motif in sequence à regular expression Can define important sites
EXAMPLE: Insulin Enzyme catalytic site B chain xxxxxxCxxxxxxxxxxxxCxxxxxxxxxProsthetic group attachment
A chain xxxxxCCxxxCxxxxxxxxCxMetal ion binding site | | Cysteines for disulphide bonds Protein or molecule binding Patterns
Pattern/motif in sequence à regular expression Can define important sites
EXAMPLE: PS00262 Insulin family signature B chain xxxxxxCxxxxxxxxxxxxCxxxxxxxxx
A chain xxxxxCCxxxCxxxxxxxxCx | |
MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVE ALYLVCGERGFFYTPKTRREAEDLQVGQVELGGGPGA GSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN Patterns
Pattern/motif in sequence à regular expression Can define important sites
EXAMPLE: PS00262 Insulin family signature B chain xxxxxxCxxxxxxxxxxxxCxxxxxxxxx
A chain xxxxxCCxxxCxxxxxxxxCx | |
MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVE ALYLVCGERGFFYTPKTRREAEDLQVGQVELGGGPGA GSLQPLALEGSLQKRGIVEQ CCTSICSLYQLENYC N Patterns
Pattern/motif in sequence à regular expression Can define important sites
EXAMPLE: PS00262 Insulin family signature B chain xxxxxxCxxxxxxxxxxxxCxxxxxxxxx
A chain xxxxxCCxxxCxxxxxxxxCx | |
MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVE ALYLVCGERGFFYTPKTRREAEDLQVGQVELGGGPGA GSLQPLALEGSLQKRGIVEQ CCTSICSLYQLENYC N Regular expression C-C-{P}-x(2)-C-[STDNEKPI]-x(3)-[LIVMFS]-x(3)-C Patterns – understanding a regular expression
Strictly conserved site; only x denotes any amino acid one amino acid is accepted can occur at a single at this posi on posi on
C - C - {P} - x(2) - C - [STDNEKPI] - x(3) - [LIVMFS] - x(3) - C
There are dashes between Curly brackets denote each posi on amino acids that cannot occur at a single posi on Patterns – understanding a regular expression
X(2) – therefore any amino acid can occur at the next two posi on
C - C - {P} - x(2) - C - [STDNEKPI] - x(3) - [LIVMFS] - x(3) - C
Square brackets denote range of amino acids that occur at a single posi on Patterns
Sequence alignment
Insulin family Define pattern motif
xxxxxx xxxxxx Extract pattern sequences xxxxxx xxxxxx
Build regular C-C-{P}-x(2)-C-[STDNEKPI]-x(3)-[LIVMFS]-x(3)-C expression
Pattern signature
PS00000 Fingerprints
Several motifs à characterise family
Identify small conserved regions in divergent proteins
Different combinations of motifs describe subfamilies
EXAMPLE: PR00107 Phosphocarrier HPr signature
PTHP_ENTFA: MEKKEFHIVAETGIHARPATLLVQTASKFNSDINLEYKGKSVNLKS IMGVMSLGVGQGSDVTITVDGADEAEGMAAIVETLQKEGLAE Fingerprints
Several motifs à characterise family
Identify small conserved regions in divergent proteins
Different combinations of motifs describe subfamilies
EXAMPLE: PR00107 Phosphocarrier HPr signature
PTHP_ENTFA: MEKKEFHIVAET GIHARPATLLVQTASKF NSDINLEYKGKSVNLK SIMGVMSLGVGQGSDVTITVDGADEAEGMAAIVETLQKEGLAE
His phosphorylation site Fingerprints
Several motifs à characterise family
Identify small conserved regions in divergent proteins
Different combinations of motifs describe subfamilies
EXAMPLE: PR00107 Phosphocarrier HPr signature
PTHP_ENTFA: MEKKEFHIVAET GIHARPATLLVQTASKF NSDINLEY KGKSVNLK SIMGVMSL GVGQGSDVTITVDGADEAEGMAAIVETLQKEGLAE
His phosphorylation site
Ser phosphorylation site Fingerprints
Several motifs à characterise family
Identify small conserved regions in divergent proteins
Different combinations of motifs describe subfamilies
EXAMPLE: PR00107 Phosphocarrier HPr signature
PTHP_ENTFA: MEKKEFHIVAET GIHARPATLLVQTASK FNSDINLEY KGKSVNLK SIMGVMSL GVGQGSDVTITVDGADE AEGMAAIVETLQKEGLAE
His phosphorylation site
Conserved site Ser phosphorylation site Fingerprints
Several motifs à characterise family
Identify small conserved regions in divergent proteins
Different combinations of motifs describe subfamilies
EXAMPLE: PR00107 Phosphocarrier HPr signature
PTHP_ENTFA: MEKKEFHIVAET GIHARPATLLVQTASK FNSDINLEY KGKSVNLK SIMGVMSL GVGQGSDVTITVDGADE AEGMAAIVETLQKEGLAE
1) GIHARPATLLVQTASKF 3-motif fingerprint 2) KGKSVNLKSIMGVMSL 3) LGVGQGSDVTITVDGADE Fingerprints
Sequence alignment
His Ser Conserved Define motifs phosphorylation phosphorylation site site site
xxxxxx xxxxxx xxxxxx Extract motif xxxxxx xxxxxx xxxxxx xxxxxx xxxxxx xxxxxx sequences xxxxxx xxxxxx xxxxxx
Correct order Fingerprint signature 1 2 3 Correct spacing
PR00000 Sequence clustering
Automatic clustering of homologous domains
**Rarely covers entire domain (conserved core)
**Signature size can change with release
Known domain families PSI-BLAST Recruit homologous domains MKDOM2 Automatic clustering ProDomAlign Align domain families Hidden Markov Models (HMM)
Can characterise protein over entire length
Models conserved and divergent regions (position-specific scoring)
Models insertions and deletions
Ø Outperform in sensitivity and specificity
Ø More flexible (can use partial alignments) Hidden Markov Models (HMM)
Sequence 1: Sequence Sequence 2: Sequence 3: alignment Sequence 4: Sequence 5: Sequence 6: Sequence 7:
Scoring matrix
(residue frequency at each position in Bayesian statistics alignment) probability scoring
Profile Hidden Markov Models (HMM)
Sequence 1: Sequence 2: Sequence 3: Sequence 4: Sequence 5: Sequence 6: Sequence 7:
M1
M = match state Hidden Markov Models (HMM)
Sequence 1: Sequence 2: Sequence 3: Sequence 4: Sequence 5: Sequence 6: Sequence 7:
M1 M2
M = match state Hidden Markov Models (HMM)
Sequence 1: Sequence 2: Sequence 3: Sequence 4: Sequence 5: Sequence 6: Sequence 7:
M1 M2 M3
M = match state Hidden Markov Models (HMM)
Sequence 1: Sequence 2: Sequence 3: Sequence 4: Sequence 5: Sequence 6: Sequence 7:
M1 M2 M3 M4 M5 M6 M7 M8 M9 M10
M = match state Hidden Markov Models (HMM)
I1 I2 I3 I4 I5 I6 I7 I8 I9
M1 M2 M3 M4 M5 M6 M7 M8 M9 M10
D2 D3 D4 D5 D6 D7 D8 D9
I = insert state D = delete state Hidden Markov Models (HMM)
HMM databases:
• PIR SUPERFAMILY • PANTHER Families conserved in sequence • TIGRFAM
• PFAM Domains conserved • SMART in sequence • SUPERFAMILY Domains conserved in • GENE3D structure SAM Profile HMMs
Homologous structural superfamilies
Start with single seed sequence
Create 1 model for every protein in Few proteins in family superfamily à combine results have PDB structures
Proteins in superfamily may have low sequence identity SAM Profile models
T99 script: Single seed sequence GIHARPATLLVQTASKF WU-BLASTP
Close homologues Low identity
GIHARPATLLVQTASKF GIHARPATLLVQTASKF matches GIHARPATLLVQTASKF search Initial model New larger alignment GIHARPATLLVQTASKF GIHARPATLLVQTASKF GIHARPATLLVQTASKF GIHARPATLLVQTASKF GIHARPATLLVQTASKF Final model Signatures Methods Describe protein features: ac ve sites, binding sites… • Pattern Describe families and • Fingerprint sibling subfamilies Predicts conserved • Sequence clustering domains
• HMM
• SAM Signature Methods
• Pattern Func onal classifica on of families • Fingerprint Func onal domain • Sequence clustering annota on
• HMM
• SAM Structural domain annota on Comprehensive annotation InterPro removes redundancy
SWIB/MDM2 domain
RanBP2-type zinc finger RING-type Domain annota on zinc finger Comprehensive annotation
Conserved site within zinc finger
Annotate features Comprehensive annotation
Mdm2/Mdm4 family Parent
Child Mdm4 subfamily Family classifica on Domain Boundaries
Gene3D (and SSF) determines domain structural boundaries Pfam trims domains to regions of good sequence conservation
ProDom displays shortest conserved sequence Fragmented Signatures
1) Signature method 2) Duplicated domains 3) Repeated elements 4) Non-contiguous domains Fragmented Signatures
1) Signature method • e.g. PRINTS – discrete motifs
2) Duplicated domains 3) Repeated elements 4) Non-contiguous domains Fragmented Signatures
1) Signature method 2) Duplicated domains • e.g. SSF - duplication consisting of 2 domains with same fold
3) Repeated elements 4) Non-contiguous domains Fragmented Signatures
1) Signature method 2) Duplicated domains 3) Repeated elements • e.g. Kringle, WD40
4) Non-contiguous domains Fragmented Signatures
1) Signature method 2) Duplicated domains 3) Repeats 4) Non-contiguous domains • Structural domains can consist of non- contiguous sequence Fragmented Signatures
1) Signature method 2) Duplicated domains 3) Repeats 4) Non-contiguous domains Complementary Annotation
7-blade beta-propeller Beta-propeller repeat
Ø Structural-based signature (SSF) shows boundaries of structural domain
Ø Sequence-based signature (Pfam) shows that the domain is made up of repeating sequence elements Complementary Annotation
PFAM shows domain is composed of two types of repeated sequence mo fs
SUPERFAMILY shows the poten al domain boundaries Complementary Annotation
PFAM/SMART show 2 domains from dis nct sequence families
GENE3D shows that these domains share homologous structure Func onal annota on with Blast2GO Index • Why Blast2GO? Concepts on func onal annota ons – Func on assignment – Vocabularies – GO and GOA – The GO Direct Acyclic Graph – The problem of func on transfer • Blast2GO Java applica on + Prac cals • Blast2GO @ Babelomics + Prac cals
Why Blast2GO? Experiment Workflow analysis Data-Analysis Gene-List
MNAT1 CTNNBL1 ENOX2 GTPBP1 RALY TAGLN2 RAB3A PPP2R5A MAPRE1 Functional ..... Annotation Functional ... interpretation +
Functional Profiling What does Blast2GO do?
Generates annotations Visualization of funcional annotations Concepts of func onal annota on • Gene/Protein func on • Referes to the molecular func on of a gene or a protein: Tyrosine kinase • Func onal annota on • More general, referes to the characteriza on of func onal aspect of the protein. Stress-related, cytoplasm, ABC transporter • Also referes to the process of assingment of a func on label • Habitually, standard vocabularies are used to assign func on
Func onal Vocabularies
Molecular Function Biological Process Cellular Component
Metabolic pathways
KEGG orthologues
Functional motifs The Gene Ontology
• Project developed by the Gene Ontology Consor um • Provides a controlled vocabulary to describe gene and gene product a ributes in any organism • Includes both the development of the Ontology and the maintenance of a Database of annota ons The Ontology ü Annotations are given to te most specific (low) level. ü True path rule: annotation at a term implies annotation to all its parent terms More general ü Annotation is given with an Evidence Code: o IDA: inferred by direct assay o TAS: traceable author statement o ISS: infered by sequence similarity o IEA: electronic annotation o …. More specific The GO has a DAG structure The Gene Ontology Database (GOA) http://www.geneontology.org/GO.current.annotations.shtml
• There is a collabora ng ins tu on per organism to provide annota ons • Most of the GOA annota ons come from UniProt • Most of the annota ons are electronic annota ons InterPro
http://www.ebi.ac.uk/interpro/databases.html
• Collec on of databases with func onal annota on of protein mo fs • Func onal vocabulary at UniProt • There is an equivalence table between GO and InterPro Func onal assignment Annota on
Empirical Transference Literature reference
Filogeny
Molecular Biochemical Sequence interac ons Structure assay Comparison analysis
Gene/protein Sequence expression Iden fica on homology of folds Mo f iden fica on Automa c annota on • GO annota ons can be created by comparision to annotated sequences • To achieve enough coverage, high- throughput, automa c annota on is required • The most effec ve (also error prone) automa c annota on method is transfer from sequence similarity Concerns in func onal transfer by
GO1, GO2, GO3, GO4 similarity HIT QUERY GO1, GO2, GO3, GO4
• Level of homology (~ from 40-60% is possible) • The overlap query and hit sequences (not much a problem) • The domain or structure func on associa on • The paralog problem: genes with similar sequences might have different func onal specifica ons • The evidence for the original annota on • Balance between quality and quan ty: depends on the use Blast2GO • Suite for func onal annota on and data mining on func onal data – Considera ons for annota on • Simlarity • Length of the overlap • Percentage of hit sequence spanned by the overlap • Evidence original annota on • Blast hits and mo f hits • Refinement by addi onal methods – Visualiza on: • Annota on charts • Knowledge discovery on the DAG • Desktop Java applica on • web interface @ Babelomics: Babelomics for non-model Blast2GO Annota on strategy
Hit1 Hit1 go1,go2, go3 go1,go2, go3 Sq1 Sq1 Hit2 Sq1 Hit2 go1,go3, go4 Sq1 go1,go3, go4 Hit3 Hit3 go3,go5, go6,go8 go3,go5, go6,go8 Hit4 Hit4 go1,go4 go1,go4 Hit1 Hit1 go6,go9, go8 go6,go9, go8 Sq2 Sq2 Hit2 Sq2 Hit2 go1,go8 Sq2 go1,go8 Blast Hit3 Mapping Hit3 go4,go1, go8,go9 Annotation go4,go1, go8,go9 Hit4 Hit4 Hit1 Hit1 go2 go2 Sq3 Sq3 Hit2 Sq3 Hit2 go2,go4, go4 Sq3 go2,go4, go4 Hit3 Hit3 go2,go5, go6 go2,go5, go6 Hit4 Hit4 go2,go4 go2,go4
Hit1 Hit1 Sq4 Sq4 Sq4 Sq4 Hit2 Hit2 Blast2GO Annota on Strategy
go1,go2, go3 Sq1 go1,go3, go4 Sq1 go1,go2, go3, go3,go5, go6,go8 GO11 go1,go4
go6,go9, go8 Refinement Sq2 go1,go8 Sq2 go8, go4,go1, go8,go9 GO12, GO13
go2 InterPro Annex Sq3 go2,go4, go4 Sq3 go2,go4 go2,go5, go6 GOSlim go2,go4 Manual
Sq4 Sq4 GO15 B2G Highligh ng on the DAG: The B2G Score • Coloring strategy to highlight regions in the DAG where the most interes ng informa on is concentrated • The confluence score (B2G score) keeps a balance between the number of annotated sequences at one node and the distance to the origin of annota on