Introduction to Bioinformatics Outline

Introduction to Bioinformatics Outline • Introduction to sequence alignment • pair wise sequence alignment Pairwise Sequence Alignment – The Dot Matrix – Scoring Matrices Prof. Dr. Nizamettin AYDIN – Gap Penalties – Dynamic Programming [email protected] 1 2 Introduction to sequence alignment Sequence Alignment • In molecular biology, a common question is to • Sequence Alignment ask whether or not two sequences are related. – the identification of residue-residue • The most common way to tell whether or not correspondences. they are related is to compare them to one • It is the basic tool of bioinformatics. another to see if they are similar. • Example: • Question: – pear and tear – Are two sequences related? • Similar words, different meanings • Compare the two sequences, – see if they are similar 3 4 Biological Sequences Relation of sequences • Similar biological sequences tend to be related • Homologs: – similar sequences in 2 different organisms derived from a common • Information: ancestor sequence. – Functional • Orthologs: – Structural – Similar sequences in 2 different organisms that have arisen due to a – Evolutionary speciation event. Functionality Retained. • Paralogs: • Common mistake: – Similar sequences within a single organism that have arisen due to a gene duplication event. – sequence similarity is not homology! • Xenologs: • Homologous sequences: – similar sequences that have arisen out of horizontal transfer events – derived from a common ancestor (symbiosis, viruses, etc) 5 6 Copyright 2000 N. AYDIN. All rights reserved. 1 Relation of sequences Use Protein Sequences for Similarity Searches • DNA sequences tend to be less informative than protein sequences • DNA bases vs. 20 amino acids - less chance similarity • Similarity of AAs can be scored – # of mutations, chemical similarity, PAM matrix • Protein databanks are much smaller than DNA databanks – less random matches. • Similarity is determined by pairwise alignment of Image Source: different sequences http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/Orthology.html 7 8 Pairwise Alignment Sequence Alignment The concept • The alignment of two sequences (DNA or protein) is a relatively straightforward • An alignment is a mutual arrangement of two computational problem. sequences. • There are lots of possible alignments. – Pairwise sequence alignment • Two sequences can always be aligned. • It exhibits where the two sequences are similar, and where they differ. • Sequence alignments have to be scored. • An optimal alignment is one that exhibits the • Often there is more than one solution with the most correspondences, and the least same score. differences. • Sequences that are similar probably have the same function 9 10 Sequence Alignment Sequence Alignment Terms of sequence comparison Things to consider: • Sequence identity • to find the best alignment one needs to examine – exactly the same Amino Acid or Nucleotide in the all possible alignments same position • to reflect the quality of the possible alignments • Sequence similarity one needs to score them – substitutions with similar chemical properties • there can be different alignments with the same • Sequence homology highest score – general term that indicates evolutionary relatedness • variations in the scoring scheme may change the among sequences ranking of alignments – sequences are homologous if they are derived from a common ancestral sequence 11 12 Copyright 2000 N. AYDIN. All rights reserved. 2 Sequence Alignment Sequence Alignment Evolution: A protein sequence alignment Ancestral sequence: ABCD MSTGAVLIY--TSILIKECHAMPAGNE----- ---GGILLFHRTHELIKESHAMANDEGGSNNS * * * **** *** ACCD (B C) ABD (C ø) mutation deletion A DNA sequence alignment attcgttggcaaatcgcccctatccggccttaa ACCD or ACCD Pairwise Alignment att---tggcggatcg-cctctacgggcc---- AB─D A─BD *** **** **** ** ****** true alignment 13 14 Hamming or edit distance Hamming Distance • Simplest method in determining sequence • Minimum number of letters by which two similarity is to determine the edit distance words differ between two sequences • If we take the example of pear and tear, how • Calculated by summing number of mismatches similar are these two words? • An alignment of these two is as follows: • Hamming Distance between PEAR and TEAR is 1 P E A R | | | T E A R 15 16 Gapped Alignments Possible Residue Alignments • With biological sequences, it is often necessary to • An alignment can produce one of the align two sequences that are of – different lengths, following: – that have regions that have been inserted or deleted over – a match between two characters time. • Thus, the notion of gaps needs to be introduced. – a mismatch between two characters – gaps denoted by ‘-’ • also called a substitution or mutation • Consider the words alignment and ligament. – a gap in the first sequence – One alignment of these two words is as follows: • which can be thought of as the deletion of a character in A L I G N M E N T the first sequence | | | | | | | – a gap in the second sequence - L I G A M E N T • which can be thought of as the insertion of a character in the first sequence 17 18 Copyright 2000 N. AYDIN. All rights reserved. 3 Alignments Alignment Scoring Scheme • Consider the following two nucleic acid sequences: • One way to judge this is to assign – ACGGACT and ATCGGATCT. – a + score for each match, • The followings are two valid alignments: – a - score for each mismatch, – a - score for each insertion/deletion (indels). A – C – G G – A C T • Possible scoring scheme: | | | | | match: +2 mismatch: -1 indel: –2 A T C G G A T _ C T – Alignment 1: • 5 * 2 – 1(1) – 4(2) = 10 – 1 – 8 = 1 – Alignment 2: A T C G G A T C T • 6 * 2 – 1(1) – 2 (2) = 12 – 1 – 4 = 7 | | | | | | • Using the above scoring scheme, the 2nd alignment is a better alignment, A – C G G – A C T – since it produces a higher alignment score. • Which alignment is the better alignment? 19 20 Alignment Methods Visual Alignments (Dot Plots) • Visual • One of basic techniques for determining the alignment between two sequences is by using a visual alignment • Brute Force known as dot plots. • Dynamic Programming • Matrix – Rows: • Word-Based (k tuple) • Characters in one sequence – Columns: • Characters in second sequence • Filling – Loop through each row; • if character in row-column match, fill in the cell – Continue until all cells have been examined 21 22 The Dot Matrix Example Dot Plot • established in 1970 by A.J. Gibbs and G.A.McIntyre • method for comparing two amino acid or nucleotide sequences A G C T A G G A • each sequence builds one axis of G the grid A • one puts a dot, at the intersection of same letters appearing in both C sequences T • scan the graph for a series of dots A • reveals similarity • or a string of same G characters G • longer sequences can also be compared on a single page, by C using smaller dots 23 24 Copyright 2000 N. AYDIN. All rights reserved. 4 An entire software module of a telecommunications switch; about two million lines of C Information within Dot Plots • Darker areas indicate regions with a lot of • Dot plots are useful as a first-level filter for determining an matches alignment between two sequences. – a high degree of similarity • Lighter areas indicate – It reveals the presence of insertions or deletions regions with few matches • Comparing a single sequence to itself can reveal the presence – a low degree of similarity • Dark areas along the main of a repeat of a subsequence diagonal indicate sub- – Inverted repeats = reverse complement modules. • Used to determine folding of RNA molecules • Dark areas off the main diagonal indicate a degree • Self comparison can reveal several features: of similarity between sub- – similarity between chromosomes modules. – tandem genes • The largest dark squares are formed by redundancies – repeated domains in a protein sequence in initializations of signal- – regions of low sequence complexity (same characters are often tables and finite-state repeated) machines. 25 26 Insertions/Deletions Insertions/Deletions Two similar, but not identical, sequences An indel (insertion or deletion): • Regions containing insertions/deletions can be readily determined. • One potential application is to determine the number of coding regions (exons) contained within a processed mRNA. 27 28 Duplication Repeats/Inverted Repeats A tandem duplication: Self-dotplot of a tandem duplication: 29 30 Copyright 2000 N. AYDIN. All rights reserved. 5 Repeats/Inverted Repeats The Dot Matrix An inversion: Self dot plot with repeats: Joining sequences: 31 32 Comparing Genome Assemblies • Dot plots can also be used in order to compare two different assemblies of the same sequence. • Below are three dotplots of various chromosomes. • The 1st shows two separate assemblies of human chromosome 5 compared against each other. • The 2nd shows one assembly of chromosome 5 compared against itself, indicating the presence of repetitive regions. • The 3rd shows chromosome Y compared against itself, indicating the presence of inverted repeats. 33 34 Noise in Dot Plots Noise in Dot Plots The very stringent, self-dotplot: The non-stringent self-dotplot: • Nucleic Acids (DNA, RNA) – 1 out of 4 bases matches at random • To filter out random matches, – sliding windows are used – Percentage of bases matching in the window is set as threshold • A dot is printed only if a minimal number of matches occur • Rule of thumb: – larger windows for DNAs (only 4 bases, more random • Stringency is the quality or state of being stringent. matches) • stringent:

Introduction to Bioinformatics Outline

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support