Introduction to Bioinformatics Outline

Introduction to Bioinformatics Outline • Introduction to sequence alignment • pair wise sequence alignment Pairwise Sequence Alignment – The Dot Matrix – Scoring Matrices Prof. Dr. Nizamettin AYDIN – Gap Penalties – Dynamic Programming [email protected] 1 2 Introduction to sequence alignment Sequence Alignment • In molecular biology, a common question is to • Sequence Alignment ask whether or not two sequences are related. – the identification of residue-residue • The most common way to tell whether or not correspondences. they are related is to compare them to one • It is the basic tool of bioinformatics. another to see if they are similar. • Example: • Question: – pear and tear – Are two sequences related? • Similar words, different meanings • Compare the two sequences, – see if they are similar 3 4 Biological Sequences Relation of sequences • Similar biological sequences tend to be related • Homologs: – similar sequences in 2 different organisms derived from a common • Information: ancestor sequence. – Functional • Orthologs: – Structural – Similar sequences in 2 different organisms that have arisen due to a – Evolutionary speciation event. Functionality Retained. • Paralogs: • Common mistake: – Similar sequences within a single organism that have arisen due to a gene duplication event. – sequence similarity is not homology! • Xenologs: • Homologous sequences: – similar sequences that have arisen out of horizontal transfer events – derived from a common ancestor (symbiosis, viruses, etc) 5 6 Copyright 2000 N. AYDIN. All rights reserved. 1 Relation of sequences Use Protein Sequences for Similarity Searches • DNA sequences tend to be less informative than protein sequences • DNA bases vs. 20 amino acids - less chance similarity • Similarity of AAs can be scored – # of mutations, chemical similarity, PAM matrix • Protein databanks are much smaller than DNA databanks – less random matches. • Similarity is determined by pairwise alignment of Image Source: different sequences http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/Orthology.html 7 8 Pairwise Alignment Sequence Alignment The concept • The alignment of two sequences (DNA or protein) is a relatively straightforward • An alignment is a mutual arrangement of two computational problem. sequences. • There are lots of possible alignments. – Pairwise sequence alignment • Two sequences can always be aligned. • It exhibits where the two sequences are similar, and where they differ. • Sequence alignments have to be scored. • An optimal alignment is one that exhibits the • Often there is more than one solution with the most correspondences, and the least same score. differences. • Sequences that are similar probably have the same function 9 10 Sequence Alignment Sequence Alignment Terms of sequence comparison Things to consider: • Sequence identity • to find the best alignment one needs to examine – exactly the same Amino Acid or Nucleotide in the all possible alignments same position • to reflect the quality of the possible alignments • Sequence similarity one needs to score them – substitutions with similar chemical properties • there can be different alignments with the same • Sequence homology highest score – general term that indicates evolutionary relatedness • variations in the scoring scheme may change the among sequences ranking of alignments – sequences are homologous if they are derived from a common ancestral sequence 11 12 Copyright 2000 N. AYDIN. All rights reserved. 2 Sequence Alignment Sequence Alignment Evolution: A protein sequence alignment Ancestral sequence: ABCD MSTGAVLIY--TSILIKECHAMPAGNE----- ---GGILLFHRTHELIKESHAMANDEGGSNNS * * * **** *** ACCD (B C) ABD (C ø) mutation deletion A DNA sequence alignment attcgttggcaaatcgcccctatccggccttaa ACCD or ACCD Pairwise Alignment att---tggcggatcg-cctctacgggcc---- AB─D A─BD *** **** **** ** ****** true alignment 13 14 Hamming or edit distance Hamming Distance • Simplest method in determining sequence • Minimum number of letters by which two similarity is to determine the edit distance words differ between two sequences • If we take the example of pear and tear, how • Calculated by summing number of mismatches similar are these two words? • An alignment of these two is as follows: • Hamming Distance between PEAR and TEAR is 1 P E A R | | | T E A R 15 16 Gapped Alignments Possible Residue Alignments • With biological sequences, it is often necessary to • An alignment can produce one of the align two sequences that are of – different lengths, following: – that have regions that have been inserted or deleted over – a match between two characters time. • Thus, the notion of gaps needs to be introduced. – a mismatch between two characters – gaps denoted by ‘-’ • also called a substitution or mutation • Consider the words alignment and ligament. – a gap in the first sequence – One alignment of these two words is as follows: • which can be thought of as the deletion of a character in A L I G N M E N T the first sequence | | | | | | | – a gap in the second sequence - L I G A M E N T • which can be thought of as the insertion of a character in the first sequence 17 18 Copyright 2000 N. AYDIN. All rights reserved. 3 Alignments Alignment Scoring Scheme • Consider the following two nucleic acid sequences: • One way to judge this is to assign – ACGGACT and ATCGGATCT. – a + score for each match, • The followings are two valid alignments: – a - score for each mismatch, – a - score for each insertion/deletion (indels). A – C – G G – A C T • Possible scoring scheme: | | | | | match: +2 mismatch: -1 indel: –2 A T C G G A T _ C T – Alignment 1: • 5 * 2 – 1(1) – 4(2) = 10 – 1 – 8 = 1 – Alignment 2: A T C G G A T C T • 6 * 2 – 1(1) – 2 (2) = 12 – 1 – 4 = 7 | | | | | | • Using the above scoring scheme, the 2nd alignment is a better alignment, A – C G G – A C T – since it produces a higher alignment score. • Which alignment is the better alignment? 19 20 Alignment Methods Visual Alignments (Dot Plots) • Visual • One of basic techniques for determining the alignment between two sequences is by using a visual alignment • Brute Force known as dot plots. • Dynamic Programming • Matrix – Rows: • Word-Based (k tuple) • Characters in one sequence – Columns: • Characters in second sequence • Filling – Loop through each row; • if character in row-column match, fill in the cell – Continue until all cells have been examined 21 22 The Dot Matrix Example Dot Plot • established in 1970 by A.J. Gibbs and G.A.McIntyre • method for comparing two amino acid or nucleotide sequences A G C T A G G A • each sequence builds one axis of G the grid A • one puts a dot, at the intersection of same letters appearing in both C sequences T • scan the graph for a series of dots A • reveals similarity • or a string of same G characters G • longer sequences can also be compared on a single page, by C using smaller dots 23 24 Copyright 2000 N. AYDIN. All rights reserved. 4 An entire software module of a telecommunications switch; about two million lines of C Information within Dot Plots • Darker areas indicate regions with a lot of • Dot plots are useful as a first-level filter for determining an matches alignment between two sequences. – a high degree of similarity • Lighter areas indicate – It reveals the presence of insertions or deletions regions with few matches • Comparing a single sequence to itself can reveal the presence – a low degree of similarity • Dark areas along the main of a repeat of a subsequence diagonal indicate sub- – Inverted repeats = reverse complement modules. • Used to determine folding of RNA molecules • Dark areas off the main diagonal indicate a degree • Self comparison can reveal several features: of similarity between sub- – similarity between chromosomes modules. – tandem genes • The largest dark squares are formed by redundancies – repeated domains in a protein sequence in initializations of signal- – regions of low sequence complexity (same characters are often tables and finite-state repeated) machines. 25 26 Insertions/Deletions Insertions/Deletions Two similar, but not identical, sequences An indel (insertion or deletion): • Regions containing insertions/deletions can be readily determined. • One potential application is to determine the number of coding regions (exons) contained within a processed mRNA. 27 28 Duplication Repeats/Inverted Repeats A tandem duplication: Self-dotplot of a tandem duplication: 29 30 Copyright 2000 N. AYDIN. All rights reserved. 5 Repeats/Inverted Repeats The Dot Matrix An inversion: Self dot plot with repeats: Joining sequences: 31 32 Comparing Genome Assemblies • Dot plots can also be used in order to compare two different assemblies of the same sequence. • Below are three dotplots of various chromosomes. • The 1st shows two separate assemblies of human chromosome 5 compared against each other. • The 2nd shows one assembly of chromosome 5 compared against itself, indicating the presence of repetitive regions. • The 3rd shows chromosome Y compared against itself, indicating the presence of inverted repeats. 33 34 Noise in Dot Plots Noise in Dot Plots The very stringent, self-dotplot: The non-stringent self-dotplot: • Nucleic Acids (DNA, RNA) – 1 out of 4 bases matches at random • To filter out random matches, – sliding windows are used – Percentage of bases matching in the window is set as threshold • A dot is printed only if a minimal number of matches occur • Rule of thumb: – larger windows for DNAs (only 4 bases, more random • Stringency is the quality or state of being stringent. matches) • stringent:

Load more