Bioinformatics Tools at the Supercomputing Institute

Bioinformatics Tools at the Supercomputing Institute

Sequence Motifs Analysis Presenter: Wayne Xu, PhD Computational Genomics Consultant Supercomputing Institute Email: [email protected] Phone: (612) 624-1447 Help: [email protected] (612) 626-0802 Outline • Introduction • Motifs Representations • Motif Databases • Search for Known Motifs • Discover New Motifs • Discussion Introduction • Huge sequences data available – Increases exponentially, due to sequencing techniques advancements – Now focus on human sequences, then to other spices; from continent lives to ocean lives, even to “outer space lives” 42 40 EMBL Sequence database growth (Records: Millions) 38 36 34 32 30 28 26 24 22 20 0 82 92 93 94 95 96 97 98 99 00 01 02 03 04 By Nucleotides Others Homo sapiens Canis familiaris Pan troglodytes Rattus norvegicus Mus musculus • Indications: – Huge usable resources – We may encounter many function nonknown sequences Researchers’ Perspective • To obtain sequences – Search and retrieve from genBank – Experiment cloning and sequencing • What do we need to know about the sequences? What is this DNA sequence? – Non-coding sequence or gene? – Is it a complete new sequence or similar to some sequences? – What features it has? – Structural feature? – Belongs to a group or family? – Evolution relationship? – Motif features? – Functions of this sequence encoded protein? How? • Bioinformatics approaches – Blast (NCBI, WU, FASTA blast) – Multiple alignment (Clustalw, T-Coffee, SAM) – Structural homology (Insight II) – Evolution relationship(PAUP, Phylip, …) – Motif features (hmmer, meme/mast, ..) Terminology SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE Terminology Motif – Conserved region of sequences – Protein, DNA, RNA. Mostly protein sequences – To predict function, structure, or family membership Terminology Fingerprint • A group of conserved motifs used to characterize a protein family Domain • A discrete unit of a protein • Assumed to fold and function independently of the rest of the protein Terminology Family • Set of sequences that are functionally/structurally related Superfamily • Have the same overall domain architecture (domain number, order) • E.g. Transmembrane superfamily Terminology Pattern • Qualitative method of describing a motif • A regular expression-like syntax – [AG]-x-V-x(2)-x-{YW} Profile • Quantitative method of describing a motif – Matrix table Why Motif Analysis • Four possible results of Blast – 100% identical to known genes (within the same species) ---- the same gene – High similarity to known genes (among different species) ---- orthlogs – Moderate homology (20-40%) – Very low homology (novel genes) • Cases to be considered: – Known sequences are not necessarily known functions (e.g. est) – Novel genes – Moderate homology (20-40%) – Structural motifs share very low sequence homology Why Motif Analysis • Determination of function or structure – A motif may deduce a function or structure of a new sequence even we don’t know overall the sequence – All motifs on a sequences determine its functions and structure features Why Motif Analysis • A recognized method of classifying new protein families – Family members share common structure and/or function – Reduce the sequence "noise" that accompanies other local alignment algorithms, like BLAST – Allows for weakly homologous proteins that share the same function to be grouped together – True members of that protein family will contain all the motifs described in the fingerprint Two Goals 1. Known motifs in your sequences? – Search your sequence against motif databases 2. Discover new motifs in your sequences – Start from a single sequence – Start from multiple sequences (e.g. microarray exp) Motif Representations • Sequence logos • Regular expression • Profile • Blocks • Profile HMM Sequence logo form Scaled position-specific a.a.distribution. Cys-Cys-His-His profile: Regular Expression Example: [AG]-x-V-x(2)-x-{YW} – [] shows either amino acid – X is any amino acid – X(2) any amino acid in the next 2 positions – {} shows any amino acid except these Regular Expression Pattern Example: Zinc Finger c2h2 PROFILES • Table or matrix containing comparison information for aligned sequences • Contains same number of rows as positions in sequences • Row contains score for alignment of position with each residue Example of a Profile BLOCKS Hidden Markov Model (HMM) • Large-scale profile with gaps, insertions and deletions allowed in the alignments • Aligned sequences –HMMbuild • Built around probabilities for each position MOTIF DATABASES –PROSITE –ProDom –BLOCKS –SMART –PRINTS –CDD –Pfam –InterPro MOTIF DATABASES • All are valuable tools in the characterization and categorization of novel protein families • The annotated databases (PRINTS, Pfam and PROSITE) provide a more concise information resource for each entry than automatically generated databases like BLOCKS Prosite • A collaboration – The Swiss Institute of Bioinformatics and the EMBL Outstation - the European Bioinformatics Institute • Patterns – Started in 1988 – Some protein families, functional or structural domains are extreme sequence divergence, can not be detected by patterns • Profiles – Introduced in 1994 – Currently most of the new PROSITE entries are centered around profiles Pattern entry (regular expression) Profile Entry BLOCKS – Multiply aligned ungapped segments corresponding to the most highly conserved regions of proteins – Made automatically by looking for the most highly conserved regions in groups of proteins documented in the Prosite, InterPro and Prints BLOCKS Entry PRINTS – Protein fingerprints database – Motifs do not overlap though they may be contiguous in 3D-space – Encode protein folds and functionalities more flexibly and powerfully than can single motifs PRINTS Entry Several fields: – General information – Bibliographic references – Text description – Summary – Index – Lists of matches – The aligned motifs Pfam • A large collection – Protein multiple sequence alignments – Profile hidden Markov models. • The latest version (6.6) contains 3071 families • Match 69% of proteins in SWISS-PROT 39 and TrEMBL 14. Pfam Pfam-A : based on curated multiple alignments Pfam-B : automatic clustering of the rest of SWISSPROT and TrEMBL derived from the PRODOM database. Useful when no Pfam-A families are found ProDom – Protein domain database – Contains all protein domain families automatically generated from the SWISS- PROT and TrEMBL – ProDom-CG results from a similar domain analysis as applied to completed genomes – addition of Pfam-A entries • Built new ProDom using PSI-BLAST ProDom Entry SMART Database – A Simple Modular Architecture Research Tool(SMART) database – Relational database system (PostGreSQL) – More than 500 domain families found in signaling, extracellular and chromatin- associated proteins – Annotated to phylogeny distributions, functional class, tertiary structures and functionally important residues. – User interfaces allowing searches for proteins containing specific combinations of domains Conserved Domain Database (CDD) CDD Database Compiled collections of three major sources: – Smart – Pfam – COG InterPro – An integrated documentation database for protein families, domains and sites. – Combines a number of databases that use different methodologies and a varying degree of biological information – Capitalizes on their individual strengths, producing a powerful integrated diagnostic tool. Compare Motif Databases • Similarity – in size – Common interest in protein sequence classification • Differences – in content – Different areas of optimum application – Different underlying analysis methods Differences of Motif Databases • Pfam: – Focus on divergent domains • Prosite: – Focus on functional sites • Prints: – Focus on hierarchical definitions from superfamily down to subfamily levels • ProDom: – Uses PSI-Blast to find homologous domains Motifs Search Search Against Prosite http://us.expasy.org/prosite/ Prosite Search Results Search Against BLOCKS http://blocks.fhcrc.org/blocks/blocks_search.html BLOCKS Search Result Search Against PRINTS www.bioinf.man.ac.uk/dbbrowser/PRINTS/ PRINTS Search Results Pfam – Look at multiple alignments – View protein domain architectures – Examine species distribution – Follow links to other databases – View known protein structures Search Against Pfam www.sanger.ac.uk/Software/Pfam/index.shtml Pfam Search Result Search Against ProDom prodes.toulouse.inra.fr/prodom/current/html/form.php?typeform=SPTR ProDom Search Result Search Against Smart http://smart.embl-heidelberg.de/ Smart Search Result Search Against CDD www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi CDD Search Result Search Against InterPro www.ebi.ac.uk/interpro/scan.html InterPro Search Result Local Server InterProScan • For large sets of protein sequences (fasta format) • Search InterProScan databases against protein sequences input % module load proteinmotif % InterProScan.pl demo.fa -ipr % cd /usr/local/interProScan/iprscan/tmp/yours_da ta_number % sh -c "/usr/local/interProScan/iprscan/bin/SunOS/g make htm -j10 -k 2> ERROR 1> OUT“ % mv /usr/local/interProScan/iprscan/tmp/yours_da ta_number . View the merged file cnk_1/???.htm Discover New Motifs -- by HMMER HMMER • Free, developed by Sean Eddy’s lab at Washington University • Use Hidden Markov Model program • Profile-based protein sequence analysis HMMER Programs • Hmmbuild • Hmmpfam – Search HMM database Pfam • Hmmsearch – Search genBank using a HMM • hmmemit – generate sequences from a profile HMM • Hmmalign – Use a HMM as a template • …. Markov Model • Generate sequences according

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    85 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us