Bioinformatics Tools at the Supercomputing Institute

Sequence Motifs Analysis Presenter: Wayne Xu, PhD Computational Genomics Consultant Supercomputing Institute Email: [email protected] Phone: (612) 624-1447 Help: [email protected] (612) 626-0802 Outline • Introduction • Motifs Representations • Motif Databases • Search for Known Motifs • Discover New Motifs • Discussion Introduction • Huge sequences data available – Increases exponentially, due to sequencing techniques advancements – Now focus on human sequences, then to other spices; from continent lives to ocean lives, even to “outer space lives” 42 40 EMBL Sequence database growth (Records: Millions) 38 36 34 32 30 28 26 24 22 20 0 82 92 93 94 95 96 97 98 99 00 01 02 03 04 By Nucleotides Others Homo sapiens Canis familiaris Pan troglodytes Rattus norvegicus Mus musculus • Indications: – Huge usable resources – We may encounter many function nonknown sequences Researchers’ Perspective • To obtain sequences – Search and retrieve from genBank – Experiment cloning and sequencing • What do we need to know about the sequences? What is this DNA sequence? – Non-coding sequence or gene? – Is it a complete new sequence or similar to some sequences? – What features it has? – Structural feature? – Belongs to a group or family? – Evolution relationship? – Motif features? – Functions of this sequence encoded protein? How? • Bioinformatics approaches – Blast (NCBI, WU, FASTA blast) – Multiple alignment (Clustalw, T-Coffee, SAM) – Structural homology (Insight II) – Evolution relationship(PAUP, Phylip, …) – Motif features (hmmer, meme/mast, ..) Terminology SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE Terminology Motif – Conserved region of sequences – Protein, DNA, RNA. Mostly protein sequences – To predict function, structure, or family membership Terminology Fingerprint • A group of conserved motifs used to characterize a protein family Domain • A discrete unit of a protein • Assumed to fold and function independently of the rest of the protein Terminology Family • Set of sequences that are functionally/structurally related Superfamily • Have the same overall domain architecture (domain number, order) • E.g. Transmembrane superfamily Terminology Pattern • Qualitative method of describing a motif • A regular expression-like syntax – [AG]-x-V-x(2)-x-{YW} Profile • Quantitative method of describing a motif – Matrix table Why Motif Analysis • Four possible results of Blast – 100% identical to known genes (within the same species) ---- the same gene – High similarity to known genes (among different species) ---- orthlogs – Moderate homology (20-40%) – Very low homology (novel genes) • Cases to be considered: – Known sequences are not necessarily known functions (e.g. est) – Novel genes – Moderate homology (20-40%) – Structural motifs share very low sequence homology Why Motif Analysis • Determination of function or structure – A motif may deduce a function or structure of a new sequence even we don’t know overall the sequence – All motifs on a sequences determine its functions and structure features Why Motif Analysis • A recognized method of classifying new protein families – Family members share common structure and/or function – Reduce the sequence "noise" that accompanies other local alignment algorithms, like BLAST – Allows for weakly homologous proteins that share the same function to be grouped together – True members of that protein family will contain all the motifs described in the fingerprint Two Goals 1. Known motifs in your sequences? – Search your sequence against motif databases 2. Discover new motifs in your sequences – Start from a single sequence – Start from multiple sequences (e.g. microarray exp) Motif Representations • Sequence logos • Regular expression • Profile • Blocks • Profile HMM Sequence logo form Scaled position-specific a.a.distribution. Cys-Cys-His-His profile: Regular Expression Example: [AG]-x-V-x(2)-x-{YW} – [] shows either amino acid – X is any amino acid – X(2) any amino acid in the next 2 positions – {} shows any amino acid except these Regular Expression Pattern Example: Zinc Finger c2h2 PROFILES • Table or matrix containing comparison information for aligned sequences • Contains same number of rows as positions in sequences • Row contains score for alignment of position with each residue Example of a Profile BLOCKS Hidden Markov Model (HMM) • Large-scale profile with gaps, insertions and deletions allowed in the alignments • Aligned sequences –HMMbuild • Built around probabilities for each position MOTIF DATABASES –PROSITE –ProDom –BLOCKS –SMART –PRINTS –CDD –Pfam –InterPro MOTIF DATABASES • All are valuable tools in the characterization and categorization of novel protein families • The annotated databases (PRINTS, Pfam and PROSITE) provide a more concise information resource for each entry than automatically generated databases like BLOCKS Prosite • A collaboration – The Swiss Institute of Bioinformatics and the EMBL Outstation - the European Bioinformatics Institute • Patterns – Started in 1988 – Some protein families, functional or structural domains are extreme sequence divergence, can not be detected by patterns • Profiles – Introduced in 1994 – Currently most of the new PROSITE entries are centered around profiles Pattern entry (regular expression) Profile Entry BLOCKS – Multiply aligned ungapped segments corresponding to the most highly conserved regions of proteins – Made automatically by looking for the most highly conserved regions in groups of proteins documented in the Prosite, InterPro and Prints BLOCKS Entry PRINTS – Protein fingerprints database – Motifs do not overlap though they may be contiguous in 3D-space – Encode protein folds and functionalities more flexibly and powerfully than can single motifs PRINTS Entry Several fields: – General information – Bibliographic references – Text description – Summary – Index – Lists of matches – The aligned motifs Pfam • A large collection – Protein multiple sequence alignments – Profile hidden Markov models. • The latest version (6.6) contains 3071 families • Match 69% of proteins in SWISS-PROT 39 and TrEMBL 14. Pfam Pfam-A : based on curated multiple alignments Pfam-B : automatic clustering of the rest of SWISSPROT and TrEMBL derived from the PRODOM database. Useful when no Pfam-A families are found ProDom – Protein domain database – Contains all protein domain families automatically generated from the SWISS- PROT and TrEMBL – ProDom-CG results from a similar domain analysis as applied to completed genomes – addition of Pfam-A entries • Built new ProDom using PSI-BLAST ProDom Entry SMART Database – A Simple Modular Architecture Research Tool(SMART) database – Relational database system (PostGreSQL) – More than 500 domain families found in signaling, extracellular and chromatin- associated proteins – Annotated to phylogeny distributions, functional class, tertiary structures and functionally important residues. – User interfaces allowing searches for proteins containing specific combinations of domains Conserved Domain Database (CDD) CDD Database Compiled collections of three major sources: – Smart – Pfam – COG InterPro – An integrated documentation database for protein families, domains and sites. – Combines a number of databases that use different methodologies and a varying degree of biological information – Capitalizes on their individual strengths, producing a powerful integrated diagnostic tool. Compare Motif Databases • Similarity – in size – Common interest in protein sequence classification • Differences – in content – Different areas of optimum application – Different underlying analysis methods Differences of Motif Databases • Pfam: – Focus on divergent domains • Prosite: – Focus on functional sites • Prints: – Focus on hierarchical definitions from superfamily down to subfamily levels • ProDom: – Uses PSI-Blast to find homologous domains Motifs Search Search Against Prosite http://us.expasy.org/prosite/ Prosite Search Results Search Against BLOCKS http://blocks.fhcrc.org/blocks/blocks_search.html BLOCKS Search Result Search Against PRINTS www.bioinf.man.ac.uk/dbbrowser/PRINTS/ PRINTS Search Results Pfam – Look at multiple alignments – View protein domain architectures – Examine species distribution – Follow links to other databases – View known protein structures Search Against Pfam www.sanger.ac.uk/Software/Pfam/index.shtml Pfam Search Result Search Against ProDom prodes.toulouse.inra.fr/prodom/current/html/form.php?typeform=SPTR ProDom Search Result Search Against Smart http://smart.embl-heidelberg.de/ Smart Search Result Search Against CDD www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi CDD Search Result Search Against InterPro www.ebi.ac.uk/interpro/scan.html InterPro Search Result Local Server InterProScan • For large sets of protein sequences (fasta format) • Search InterProScan databases against protein sequences input % module load proteinmotif % InterProScan.pl demo.fa -ipr % cd /usr/local/interProScan/iprscan/tmp/yours_da ta_number % sh -c "/usr/local/interProScan/iprscan/bin/SunOS/g make htm -j10 -k 2> ERROR 1> OUT“ % mv /usr/local/interProScan/iprscan/tmp/yours_da ta_number . View the merged file cnk_1/???.htm Discover New Motifs -- by HMMER HMMER • Free, developed by Sean Eddy’s lab at Washington University • Use Hidden Markov Model program • Profile-based protein sequence analysis HMMER Programs • Hmmbuild • Hmmpfam – Search HMM database Pfam • Hmmsearch – Search genBank using a HMM • hmmemit – generate sequences from a profile HMM • Hmmalign – Use a HMM as a template • …. Markov Model • Generate sequences according

Bioinformatics Tools at the Supercomputing Institute

Sequence Motifs, Correlations and Structural Mapping of Evolutionary

Sequence Motifs, Information Content, and Sequence Logos Morten

Seq2logo: a Method for Construction and Visualization of Amino Acid Binding Motifs and Sequence Profiles Including Sequence Weig

Interpreting a Sequence Logo When Initiating Translation, Ribosomes Bind to an Mrna at a Ribosome Binding Site Upstream of the AUG Start Codon

A New Sequence Logo Plot to Highlight Enrichment and Depletion

A Brief History of Sequence Logos

Lecture 7: Sequence Motif Discovery

Basic Local Alignment Search Tool (BLAST) Biochemistry

A Sequence Logo Generator

Weblogo Documentation Release 3.7.9.Dev2+G7eab5d1.D20210504

Using Sequence Logos and Information Analysis of Lrp DNA Binding Sites to Investigate Discrepancies Between Natural Selection and SELEX Ryan K

A New Way of Visualizing HMM Logo for Sequence -‐ Profile