Sequence Motifs Analysis

Presenter: Wayne Xu, PhD Computational Genomics Consultant Supercomputing Institute Email: [email protected] Phone: (612) 624-1447 Help: [email protected] (612) 626-0802 Outline

• Introduction • Motifs Representations • Motif Databases • Search for Known Motifs • Discover New Motifs • Discussion Introduction

• Huge sequences data available – Increases exponentially, due to sequencing techniques advancements – Now focus on human sequences, then to other spices; from continent lives to ocean lives, even to “outer space lives” EMBL Sequence database growth (Records: Millions) 42 40 38 36 34 32 30 28 26 24 22 20

0 2 2 3 4 5 6 7 8 9 0 1 2 3 4 8 9 9 9 9 9 9 9 9 0 0 0 0 0 By Nucleotides Others

Homo sapiens

Canis familiaris

Pan troglodytes

Rattus norvegicus Mus musculus • Indications:

– Huge usable resources – We may encounter many function nonknown sequences Researchers’ Perspective

• To obtain sequences – Search and retrieve from genBank – Experiment cloning and sequencing • What do we need to know about the sequences? What is this DNA sequence? – Non-coding sequence or gene? – Is it a complete new sequence or similar to some sequences? – What features it has? – Structural feature? – Belongs to a group or family? – Evolution relationship? – Motif features? – Functions of this sequence encoded protein? How? • approaches – Blast (NCBI, WU, FASTA blast) – Multiple alignment (Clustalw, T-Coffee, SAM) – Structural homology (Insight II) – Evolution relationship(PAUP, Phylip, …) – Motif features (, meme/mast, ..) Terminology

SUPERFAMILY

FAMILY

DOMAIN

MOTIF

SITE

RESIDUE Terminology

Motif – Conserved region of sequences – Protein, DNA, RNA. Mostly protein sequences – To predict function, structure, or family membership Terminology Fingerprint • A group of conserved motifs used to characterize a protein family Domain • A discrete unit of a protein • Assumed to fold and function independently of the rest of the protein Terminology

Family • Set of sequences that are functionally/structurally related Superfamily • Have the same overall domain architecture (domain number, order) • E.g. Transmembrane superfamily Terminology Pattern • Qualitative method of describing a motif • A regular expression-like syntax – [AG]-x-V-x(2)-x-{YW} Profile • Quantitative method of describing a motif – Matrix table Why Motif Analysis

• Four possible results of Blast – 100% identical to known genes (within the same species) ---- the same gene – High similarity to known genes (among different species) ---- orthlogs – Moderate homology (20-40%) – Very low homology (novel genes) • Cases to be considered: – Known sequences are not necessarily known functions (e.g. est) – Novel genes – Moderate homology (20-40%) – Structural motifs share very low sequence homology

Why Motif Analysis • Determination of function or structure – A motif may deduce a function or structure of a new sequence even we don’t know overall the sequence – All motifs on a sequences determine its functions and structure features Why Motif Analysis

• A recognized method of classifying new protein families – Family members share common structure and/or function – Reduce the sequence "noise" that accompanies other local alignment algorithms, like BLAST – Allows for weakly homologous proteins that share the same function to be grouped together – True members of that protein family will contain all the motifs described in the fingerprint Two Goals

1. Known motifs in your sequences? – Search your sequence against motif databases

2. Discover new motifs in your sequences – Start from a single sequence – Start from multiple sequences (e.g. microarray exp) Motif Representations

• Sequence logos • Regular expression • Profile • Blocks • Profile HMM form Scaled position-specific a.a.distribution. Cys-Cys-His-His profile: Regular Expression Example:

[AG]-x-V-x(2)-x-{YW} – [] shows either – X is any amino acid – X(2) any amino acid in the next 2 positions – {} shows any amino acid except these Regular Expression Pattern Example: Zinc Finger c2h2 PROFILES • Table or matrix containing comparison for aligned sequences • Contains same number of rows as positions in sequences • Row contains score for alignment of position with each residue Example of a Profile BLOCKS Hidden Markov Model (HMM)

• Large-scale profile with gaps, insertions and deletions allowed in the alignments • Aligned sequences –HMMbuild • Built around probabilities for each position

MOTIF DATABASES

–PROSITE –ProDom –BLOCKS –SMART –PRINTS –CDD –Pfam –InterPro MOTIF DATABASES • All are valuable tools in the characterization and categorization of novel protein families • The annotated databases (PRINTS, Pfam and PROSITE) provide a more concise information resource for each entry than automatically generated databases like BLOCKS Prosite • A collaboration – The Swiss Institute of Bioinformatics and the EMBL Outstation - the European Bioinformatics Institute • Patterns – Started in 1988 – Some protein families, functional or structural domains are extreme sequence divergence, can not be detected by patterns • Profiles – Introduced in 1994 – Currently most of the new PROSITE entries are centered around profiles Pattern entry (regular expression) Profile Entry BLOCKS

– Multiply aligned ungapped segments corresponding to the most highly conserved regions of proteins – Made automatically by looking for the most highly conserved regions in groups of proteins documented in the Prosite, InterPro and Prints BLOCKS Entry PRINTS

– Protein fingerprints database – Motifs do not overlap though they may be contiguous in 3D-space – Encode protein folds and functionalities more flexibly and powerfully than can single motifs PRINTS Entry Several fields: – General information – Bibliographic references – Text description – Summary – Index – Lists of matches – The aligned motifs Pfam • A large collection – Protein multiple sequence alignments – Profile hidden Markov models. • The latest version (6.6) contains 3071 families • Match 69% of proteins in SWISS-PROT 39 and TrEMBL 14. Pfam

Pfam-A : based on curated multiple alignments Pfam-B : automatic clustering of the rest of SWISSPROT and TrEMBL derived from the PRODOM database. Useful when no Pfam-A families are found

ProDom

– Protein domain database – Contains all protein domain families automatically generated from the SWISS- PROT and TrEMBL – ProDom-CG results from a similar domain analysis as applied to completed genomes – addition of Pfam-A entries • Built new ProDom using PSI-BLAST ProDom Entry SMART Database – A Simple Modular Architecture Research Tool(SMART) database – Relational database system (PostGreSQL) – More than 500 domain families found in signaling, extracellular and chromatin- associated proteins – Annotated to phylogeny distributions, functional class, tertiary structures and functionally important residues. – User interfaces allowing searches for proteins containing specific combinations of domains Conserved Domain Database (CDD) CDD Database Compiled collections of three major sources: – Smart – Pfam – COG InterPro

– An integrated documentation database for protein families, domains and sites. – Combines a number of databases that use different methodologies and a varying degree of biological information – Capitalizes on their individual strengths, producing a powerful integrated diagnostic tool.

Compare Motif Databases • Similarity – in size – Common interest in protein sequence classification • Differences – in content – Different areas of optimum application – Different underlying analysis methods Differences of Motif Databases • Pfam: – Focus on divergent domains • Prosite: – Focus on functional sites • Prints: – Focus on hierarchical definitions from superfamily down to subfamily levels • ProDom: – Uses PSI-Blast to find homologous domains Motifs Search Search Against Prosite http://us.expasy.org/prosite/ Prosite Search Results Search Against BLOCKS http://blocks.fhcrc.org/blocks/blocks_search.html BLOCKS Search Result Search Against PRINTS www.bioinf.man.ac.uk/dbbrowser/PRINTS/ PRINTS Search Results Pfam

– Look at multiple alignments – View protein domain architectures – Examine species distribution – Follow links to other databases – View known protein structures Search Against Pfam www.sanger.ac.uk/Software/Pfam/index.shtml Pfam Search Result Search Against ProDom prodes.toulouse.inra.fr/prodom/current/html/form.php?typeform=SPTR ProDom Search Result Search Against Smart http://smart.embl-heidelberg.de/ Smart Search Result Search Against CDD www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi CDD Search Result Search Against InterPro www.ebi.ac.uk/interpro/scan.html InterPro Search Result Local Server InterProScan

• For large sets of protein sequences (fasta format) • Search InterProScan databases against protein sequences input % module load proteinmotif % InterProScan.pl demo.fa -ipr % cd /usr/local/interProScan/iprscan/tmp/yours_da ta_number % sh -c "/usr/local/interProScan/iprscan/bin/SunOS/g make htm -j10 -k 2> ERROR 1> OUT“ % mv /usr/local/interProScan/iprscan/tmp/yours_da ta_number . View the merged file cnk_1/???.htm Discover New Motifs -- by HMMER HMMER

• Free, developed by Sean Eddy’s lab at Washington University • Use Hidden Markov Model program • Profile-based protein sequence analysis HMMER Programs • Hmmbuild • Hmmpfam – Search HMM database Pfam • Hmmsearch – Search genBank using a HMM • hmmemit – generate sequences from a profile HMM • Hmmalign – Use a HMM as a template • …. Markov Model • Generate sequences according to the probabilities • Compute the probability of a sequence – State probability: 20 possible amino acids, each has different probability – Transition probability • A HMM represents a protein domain family by generating sequences from that family with very high probability Transition Prob.

Output Prob.

Scoring a Sequence with an HMM: The probability of ACCY along this path is .4 * .3 * .46 * .6 * .97 * .5 * .015 * .73 *.01 * 1 = 1.76x10-6. Sequences ClustalW Multiple alignment HmmBuild Profile HMM models hmmemit HMMSearch Fasta sequences Search against NIBC nr & SwissProt InterProScan Search against Comparison InterProScan Motifs supporter Comparison sequences New motif? Discover New Motifs -- by MEME MEME

• Discover motifs from a (unaligned) group of DNA or protein sequences • Time-consuming process. Use local server. Cgls1 installed MEME On-line Meme Service Local Server MEME If no motif found by search motif databases

Discover motif: % module load bioinformatics % meme demomeme.fa -mod tcm -w 10 -maxsites 30 - nmotifs 3 > memeout.html

Search nr database to find supporters: % mast memeout.html -d /usr/local/db/blast/current/nr Discussion

• Search for known motifs

– Small number of sequences: Try all web services – Large scale of sequences: Local unix server Discussion

• Discover new motifs: • Alignment – The seq alignment for input – If Blast gives high homology, then Clustalw – If the group seq has low homology, then try other multiple algn (e.g. Iterative) • Start from a single sequence • PSI-blast to obtain more sequences • Multiple alignment • Build model • Search motifs databases to see whether it exists • Start from multiple sequences (e.g. microarray exp) • Multiple alignment • Build model • Search motifs databases to see whether it exists • Search protein database (e.g. Swissprot/TrEMBL) to get more supporter sequences • Choose algorithms: • HMMER – Cannot align, depends on pre-aligned seqs. – Defferent alignment gives different HMM – Can check gapped motifs • MEME – Input group seq without pre-aligned – Only ungapped motifs • Literature, experiment proofs