The universe of biological sequence analysis

Word/pattern recognition- Identification of restriction enzyme cleavage sites

AGGCGGAACGGGTACTGGACGTACTAGGCGAGGCGATCTAGCGAGGGCATGTTGATGGCG Sequence alignment methods CGGTTAGCGAGCTACTATCGGGGGGCGAGCTTATTGGGCGGGGCGGACTATGGGCTGGCG

ATGAAAAAGAAAACAACACTTAGCGAGGAGGACCAGGCTCTGTTTCGCCAGTTGATGGCG

GGGACTCGCAAGATTAAGCAGGACACGATTGTCCACCGACCGCAGCGTAAAAAAATCAGC

GAAGTGCCGGTGAAACGCTTGATCCAGGAGCAGGCTGATGCCAGCCATTATTTCTCCGAT

GAGTTTCAGCCGTTATTAAATACCGAAGGTCCGGTGAAATATGTTCGCCCGGATGTCAGC

CATTTTGAGGCGAAGAAACTGCGCCGTGGCGATTATTCGCCGGAGTTGTTTTTGGATTTA

CACGGTCTGACGCAGCTGCAGGCCAAGCAGGAACTGGGGGCGTTGATTGCCGCCTGCCGC PstI GAGTTGCCCTGATAAGGGTACTATTACGGACGAGTCATCTTATGCGGAGCGATTAGGGCG

GGAGCGGTTTTTAGGGCGTTTTTGGCGGCCCCCTATCTATGCAGCACGAGCGACTATGCC

The universe of biological sequence analysis - prediction of exon structure

CGCCGAGGATGGCCGTCATGGCGCCCCGAACCCTCCTCCTGCTACTCTTGGGGGCCCTGG MetAlaProArgThrLeuLeuLeuLeuLeuLeuGlyAlaLeuAlaExon 1 CCCTGACCCAGACCTGGGCGGGTGAGTGCGGGGTCGTGGGGAAACCGCCTCTGCGGGGAG LeuThrGlnThrTrpAlaGly AAGCAAGGGGCCCGCCCGGCGGGGACGCAGGACCCGGGTAGCCGCGCCGGGAGGAGGGTC Pairwise alignment

GGGTGGGTCTCAGCCACTCCTCGCCCCCAGGCTCCCACTCCATGAGGTATTTCACCACAT SerHisSerMetArgTyrPheThrThrSer CCGTGTCCCGGCCCGGCCGCGGGGAGCCCCGCTTCATCGCCGTGGGCTACGTGGACGACA C G A T A G C A T G A T G T C T ValSerArgProGlyArgGlyGluProArgPheIleAlaValGlyTyrValAspAspThrExon 2 C G A C A G C A T - A T G T C T CGCAGTTCGTGCGGTTTGACAGCGACGCCGCGAGCCAGAGGATGGAGCCGCGGGCACCGT * * * * * * * * * * * * * * GlnPheValArgPheAspSerAspAlaAlaSerGlnArgMetGluProArgAlaProTrp GGATAGAGCAGGAGGGGCCGGAGTATTGGGACCTGCAGACACGGAATGTGAAGGCCCAGT IleGluGlnGluGlyProGluTyrTrpAspLeuGlnThrArgAsnValLysAlaGlnSer CACAGACTGACCGAGCGAACCTGGGGACCCTGCGCGGCTACTACAACCAGAGCGAGGCCG GlnThrAspArgAlaAsnLeuGlyThrLeuArgGlyTyrTyrAsnGlnSerGluAla GTGAGTGACCCCGGCCCGGGGCGCAGGTCACGACCTCTCATCCCCCACGGACGGGCCGGG

1 Prediction of function Why sequence alignments ? Database of sequences

Sequence to be investigated • Prediction of function • Protein family analysis • Comparative • Phylogeny / Evolutionary history • Genome sequencing: Seq. with known function • Assembly • Alignment to reference genome We have a ‘new’ sequence. It is similar to a previously known sequence? We can test by alignment whether it is similar to a sequence with known function. If it is we can assign a possible function to our new sequence

Protein family analysis Comparative genomics - reveals biologically significant regions of the genome

2 Pairwise alignment CGATAGCATGATGTCT Pairwise alignment CGATAGCATGATGTCT dotplot C dotplot C G G A A C C A A G G C C A A T A T T A G T T G C T T C T CGATAGCATGATGTCT *** ***** ****** CGACAGCAT-ATGTCT

Pairwise alignment CGAT AGCAT GAT GT CT Pairwise alignment CGAT AGCAT GAT GT CT dotplot C dotplot C G G A A C C A A G G C C A A T T A A T T G G T T C C T T

CGATAGCATGATGTCT CG-----ATAGCATGATGTCT *** ***** ****** ** *** ***** CGACAGCAT-ATGTCT CGACAGCATA------TGTCT ++----- +++------++ + + + +2221222222222222 + +- + ++++- + ++ ++ + = 25 2 2 2 22 22 22 2 222 2 2222222 = -2

3 More sophisticated scoring of More sophisticated scoring of protein sequence alignments protein sequence alignments

Each amino acid change has a Each amino acid change has a characteristic probability characteristic probability A G L C E | | | | | A A L C D substitution matrix 4+ 0+4 +9+2 =19

Local and global alignments

Frequently used methods in sequence analysis A that are based on sequence alignment B

BLAST - searches in databases for sequence similarity Local alignment ClustalW - multiple alignment of sequences

| | | | | | | | | |

Global alignment

A | | | | | | | | | | | | | | B

4 BLAST Searching databases for sequence similarity - traditional alignment method too slow BLAST - Basic Local Alignment Search Tool

A query sequence (DNA or protein) is tested against all sequences in a database (DNA or protein) , i.e the FASTA, 1988 query is aligned to all the database William Pearson sequences. Final output is a list of the best matching database sequences. BLAST, 1990

David Lipman

BLAST output Searching databases for sequence similarity BLASTP 2.2.9 [May-01-2004] Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, - shortcuts of BLAST Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25:3389-3402. Query= lcl|SRP54_MOUSE (P14576) Signal recognition particle 54 kDa protein (SRP54) "word hit" (504 letters) Improvement MAK I QGLGKRY Database: swissprot of speed as 197,228 sequences; 71,501,181 total letters compared to local M * Searching...... done alignment algorithm: A * K ** Score E * Initial search is L * Sequences producing significant alignments: (bits) Value for word hits. Q SRP54_MOUSE (P14576) Signal recognition particle 54 kDa protein ... 959 0.0 * SRP54_PONPY (Q5R4R6) Signal recognition particle 54 kDa protein ... 958 0.0 G SRP54_MACFA (Q4R965) Signal recognition particle 54 kDa protein ... 958 0.0 * Word hits are then ** SRP54_HUMAN (P61011) Signal recognition particle 54 kDa protein ... 958 0.0 A SRP54_CANFA (P61010) Signal recognition particle 54 kDa protein ... 958 0.0 extended in either * SRP54_RAT (Q6AYB5) Signal recognition particle 54 kDa protein (S... 957 0.0 L * SRP54_GEOCY (Q8MZJ6) Signal recognition particle 54 kDa protein ... 794 0.0 direction. SR542_LYCES (P49972) Signal recognition particle 54 kDa protein ... 565 e-161 G * * SR543_ARATH (P49967) Signal recognition particle 54 kDa protein ... 560 e-159 K SR542_HORVU (P49969) Signal recognition particle 54 kDa protein ... 558 e-158 ** ... R ... * SRPR_MOUSE (Q9DBG7) Signal recognition particle receptor alpha s... 99 3e-20 Y SRPR_HUMAN (P08240) Signal recognition particle receptor alpha s... 99 3e-20 * SRPR_YEAST (P32916) Signal recognition particle receptor alpha s... 98 7e-20

5 BLAST output, cont. Expect value (E) sp|Q9I3P8.1|FLHF_PSEAE RecName: Full=Flagellar biosynthesis prot... 57 3e-07 sp|Q44758.1|FLHF_BORBU RecName: Full=Flagellar biosynthesis prot... 55 2e-06 Parameter that describes the number of hits one can sp|Q01960.1|FLHF_BACSU RecName: Full=Flagellar biosynthesis prot... 53 4e-06 "expect" to see just by chance when searching a sp|O28980.1|Y1289_ARCFU RecName: Full=Uncharacterized protein AF... 39 0.064 database of a particular size. Essentially, the E value sp|B9LKC1.1|CYSC_CHLSY RecName: Full=Adenylyl-sulfate kinase; Al... 38 0.21 sp|Q12U80.1|RADB_METBU RecName: Full=DNA repair and recombinatio... 37 0.29 describes the random background noise that exists for sp|A5D014.1|ACCD_PELTS RecName: Full=Acetyl-coenzyme A carboxyla... 35 0.93 sp|Q03T56.1|RSMA_LACBA RecName: Full=Ribosomal RNA small subunit... 35 1.2 matches between sequences. For example, an E value of sp|Q1I2K4.1|CYSC_PSEE4 RecName: Full=Adenylyl-sulfate kinase; Al... 35 1.6 sp|Q38V22.1|RSMA_LACSS RecName: Full=Ribosomal RNA small subunit... 34 1.8 1 assigned to a hit can be interpreted as meaning that in sp|A1U3X8.1|CYSC_MARAV RecName: Full=Adenylyl-sulfate kinase; Al... 34 2.3 sp|A6TD42.1|CYSC_KLEP7 RecName: Full=Adenylyl-sulfate kinase; Al... 34 2.9 a database of the current size one might expect to see 1 sp|P63890.2|CYSC_SALTI RecName: Full=Adenylyl-sulfate kinase; Al... 34 2.9 ... match with a similar score simply by chance. This means that the lower the E-value, or the closer it is to "0" the more "significant" the match is.

High Scoring Pair (HSP) High Scoring Pair (HSP)

Query: 1 MVLADLGRKITSALRSLSNATIINEEVLNAMLKEVCTALLEADVNIKLVKQLRENVKSAI 60 >SRPR_MOUSE (Q9DBG7) Signal recognition particle receptor alpha subunit MVLADLGRKITSALRSLSNATIINEEVLNAMLKEVCTALLEADVNIKLVKQLRENVKSAI (SR-alpha) (Docking protein alpha) (DP-alpha) Sbjct: 1 MVLADLGRKITSALRSLSNATIINEEVLNAMLKEVCTALLEADVNIKLVKQLRENVKSAI 60 Length = 636

Query: 61 DLEEMASGLNKRKMIQHAVFKELVKLVDPGVKAWTPTKGKQNVIMFVGLQGSGKTTTCSK 120 Score = 99.0 bits (245), Expect = 3e-20 DLEEMASGLNKRKMIQHAVFKELVKLVDPGVKAWTPTKGKQNVIMFVGLQGSGKTTTCSK Identities = 68/313 (21%), Positives = 143/313 (45%), Gaps = 31/313 (9%) Sbjct: 61 DLEEMASGLNKRKMIQHAVFKELVKLVDPGVKAWTPTKGKQNVIMFVGLQGSGKTTTCSK 120 Query: 14 LRSLSNATIINEEVLNAMLKEVCTALLEADVNIKLVKQLRENVKSAIDLEEMASGLNKRK 73 Query: 121 LAYYYQRKGWKTCLICADTFRAGAFDQLKQNATKARIPFYGSYTEMDPVIIASEGVEKFK 180 L+ L + ++ E + ++L ++ L+ +V + QL E+V + ++ + M + LAYYYQRKGWKTCLICADTFRAGAFDQLKQNATKARIPFYGSYTEMDPVIIASEGVEKFK Sbjct: 322 LKGLVGSKSLSREDMESVLDKMRDHLIAKNVAADIAVQLCESVANKLEGKVMGTFSTVTS 381 Sbjct: 121 LAYYYQRKGWKTCLICADTFRAGAFDQLKQNATKARIPFYGSYTEMDPVIIASEGVEKFK 180 Query: 74 MIQHAVFKELVKLVDPGVKAW------TPTKGKQNVIMFVGLQGSGKTTTCSKLAYYYQ 126 Query: 181 NENFEIIIVDTSGRHKQEDSLFEEMLQVSNAIQPDNIVYVMDASIGQACEAQAKAFKDKV 240 ++ A+ + LV+++ P + + + V+ F G+ G GK+T +K++++ NENFEIIIVDTSGRHKQEDSLFEEMLQV+NAIQPDNIVYVMDASIGQACEAQAKAFKDKV Sbjct: 382 TVKQALQESLVQILQPQRRVDMLRDIMDAQRRQRPYVVTFCGVNGVGKSTNLAKISFWLL 441 Sbjct: 181 NENFEIIIVDTSGRHKQEDSLFEEMLQVANAIQPDNIVYVMDASIGQACEAQAKAFKDKV 240 Query: 127 RKGWKTCLICADTFRAGAFDQLK------QNATKARIPFYGSYTEMDPVIIAS 173 Query: 241 DVASVIVTKLDGHAKGGGALSAVAATKSPIIFIGTGEHIDDFEPFKTQPFISKLLGMGDI 300 G+ + DTFRAGA +QL+ ++ + + + D IA DVASVIVTKLDGHAKGGGALSAVAATKSPIIFIGTGEHIDDFEPFKTQPFISKLLGMGDI Sbjct: 442 ENGFSVLIAACDTFRAGAVEQLRTHTRRLTALHPPEKHGGRTMVQLFEKGYGKDAAGIAM 501 Sbjct: 241 DVASVIVTKLDGHAKGGGALSAVAATKSPIIFIGTGEHIDDFEPFKTQPFISKLLGMGDI 300

6 BLAST output – revealing orthologs and paralogs

BLASTP 2.2.9 [May-01-2004] The two kinds of protein evolutionary relationship Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25:3389-3402. Genes or proteins are homologous if they are related Query= lcl|SRP54_MOUSE (P14576) Signal recognition particle 54 kDa protein (SRP54) by divergence from a common ancestor. (504 letters) Database: swissprot 197,228 sequences; 71,501,181 total letters Orthology Sequences that diverged after a speciation event. Searching...... done Orthologous genes often have the same

Score E function in different species. Sequences producing significant alignments: (bits) Value orthologs SRP54_MOUSE (P14576) Signal recognition particle 54 kDa protein ... 959 0.0 Paralogy Sequences that diverged after a gene duplication SRP54_PONPY (Q5R4R6) Signal recognition particle 54 kDa protein ... 958 0.0 SRP54_MACFA (Q4R965) Signal recognition particle 54 kDa protein ... 958 0.0 event.Paralogous genes perform different but related SRP54_HUMAN (P61011) Signal recognition particle 54 kDa protein ... 958 0.0 functions within one organism. SRP54_CANFA (P61010) Signal recognition particle 54 kDa protein ... 958 0.0 SRP54_RAT (Q6AYB5) Signal recognition particle 54 kDa protein (S... 957 0.0 SRP54_GEOCY (Q8MZJ6) Signal recognition particle 54 kDa protein ... 794 0.0 SR542_LYCES (P49972) Signal recognition particle 54 kDa protein ... 565 e-161 SR543_ARATH (P49967) Signal recognition particle 54 kDa protein ... 560 e-159 SR542_HORVU (P49969) Signal recognition particle 54 kDa protein ... 558 e-158 ...... paralogs SRPR_MOUSE (Q9DBG7) Signal recognition particle receptor alpha s... 99 3e-20 SRPR_HUMAN (P08240) Signal recognition particle receptor alpha s... 99 3e-20 SRPR_YEAST (P32916) Signal recognition particle receptor alpha s... 98 7e-20

Orthologs Paralogs

X X Ancestral organism

Speciation Gene duplication Organism A Organism B X X X X

Organism A Organism B X1 X2 Xa Xb

Orthologs Paralogs

7 Example of orthology / paralogy relationships The different variants of BLAST The variants of BLAST Mouse trypsin -- orthologs -- Human trypsin | | Query Database paralogs paralogs | | Mouse chymotrypsin -- orthologs -- Human chymotrypsin blastp Protein Protein blastn DNA DNA tblastn Protein DNA blastx DNA Protein tblastx DNA DNA

Cited 31998 times since 1990 ! When BLAST is too slow:

BLAT

Alignment software specialized for next-generation sequencing technology

BTW Bowtie SOAP2

Align reads to a reference genome

Reference genome

8 Further improvement of computational efficiency - BLAT (http://genome.ucsc.edu/cgi-bin/hgBlat?command=start)

Frequently used methods in sequence analysis that are based on sequence alignment

BLAST - searches in databases for sequence similarity ClustalW - multiple alignment of sequences Cited 34,646 times !

ClustalW Introduction to the practical • Construction of tree based on pairwise alignments “Examining HIV genes and proteins" • Progressive alignment guided by tree.

HIV A E B

D C

9 Introduction to the practical “Examining HIV genes and proteins"

Introduction to the practical “Examining HIV genes and proteins" EMBOSS programs in this practical sixpack plotorf dottup - dotplot analysis water - Smith Waterman local alignment needle - Needleman - Wunsch global alignment

10 Translation of a nucleotide sequence using ‘sixpack’ M A K R K L K K N L K T F V A F S A I T F1 Introduction to the practical W Q R E S * K R T * K L L L H L V L L L F2 G K E K V K K E L K N F C C I * C Y Y C F3 1 ATGGCAAAGAGAAAGTTAAAAAAGAACTTAAAAACTTTTGTTGCATTTAGTGCTATTACT 60 “Examining HIV genes and proteins" ----:----|----:----|----:----|----:----|----:----|----:----| 1 TACCGTTTCTCTTTCAATTTTTTCTTGAATTTTTGAAAACAACGTAAATCACGATAATGA 60 X A F L F N F F F K F V K T A N L A I V F6 X P L S F T L F S S L F K Q Q M * H * * F5 H C L S L * F L V * F S K N C K T S N S F4

A L L L T N G I P I S A L T Q S S N T T F1 L Y C * L M V F Q L V L * L S L P I Q L F2 F I V N * W Y S N * C F N S V F Q Y N * F3 61 GCTTTATTGTTAACTAATGGTATTCCAATTAGTGCTTTAACTCAGTCTTCCAATACAACT 120 ----:----|----:----|----:----|----:----|----:----|----:----| 61 CGAAATAACAATTGATTACCATAAGGTTAATCACGAAATTGAGTCAGAAGGTTATGTTGA 120 A K N N V L P I G I L A K V * D E L V V F6 Q K I T L * H Y E L * H K L E T K W Y L F5 S * Q * S I T N W N T S * S L R G I C S F4

E I T S Q A T T G L R N V M Y Y G D W S F1 R L L H K L L Q G Y V M * C I M V T G L F2 D Y F T S Y Y R V T * C N V L W * L V Y F3 121 GAGATTACTTCACAAGCTACTACAGGGTTACGTAATGTAATGTATTATGGTGACTGGTCT 180 ----:----|----:----|----:----|----:----|----:----|----:----| 121 CTCTAATGAAGTGTTCGATGATGTCCCAATGCATTACATTACATAATACCACTGACCAGA 180 S I V E C A V V P N R L T I Y * P S Q D F6 Q S * K V L * * L T V Y H L T N H H S T F5 L N S * L S S C P * T I Y H I I T V P R F4

Plotorf to show open reading frames (in this case ORF is defined as starting with AUG codon) Introduction to the practical Ribosomal protein L19 3426-3773 “Examining HIV genes and proteins"

Unnamed protein 416-1522 tRNA methyltransferase 2617-3384 Ribosomal protein S16 1771-2019

11 Introduction to the practical Introduction to the practical “Examining HIV genes and proteins" “Examining HIV genes and proteins"

Gag Gag-Pol fusion (5%)

Global alignment of mRNA sequence Global alignment of mRNA sequence to genomic DNA sequence to genomic DNA sequence

Effect of gap parameters Effect of gap parameters

genomic DNA

mature, spliced mRNA

12 Introduction to the practical Dot plot analysis (dottup) reveals repeats “Examining HIV genes and proteins"

Introduction to the "Exercises with biological sequences - examining HIV genes and proteins" - biological questions addressed with BLAST and ClustalX.

BLAST - search databases for sequence similarity

* Identifying homologous proteins. * Non-viral homologues to any HIV proteins? * Are we able to identify a relationship between human HIV and the monkey SIV?

ClustalX - multiple sequence alignment

* Identifying amino acids involved in drug resistance. * What is the relationship between HIV and monkey SIV? * Using a multiple alignment to compute a phylogenetic tree.

13