Sequence Analysis

Sequence Analysis • Some algorithms analyze biological sequences for patterns – RNA splice sites – Open reading frames (ORFs): stretch of codons – Amino acid propensities in a protein – Conserved regions in • AA sequences [possible active site] • DNA/RNA [possible protein binding site] • Others make predictions based on sequence – Protein/RNA secondary structure folding 1 Bioinformatics Sequence Driven Problems • Genomics – Fragment assembly of the DNA sequence. • Not possible to read entire sequence. • Cut up into small fragments using restriction enzymes. • Then need to do fragment assembly. Overlapping similarities to matching fragments. • N-P complete problem. – Finding Genes • Identify open reading frames – Exons are spliced out. – Junk in between genes • Proteomics – Identification of functional domains in protein’s sequence • Determining functional pieces in proteins. – Protein Folding • 1D Sequence → 3D Structure 2 • What drives this process? Genome is Sequenced, What’s Next? • Tracing Phylogeny – Finding family relationships between species by tracking similarities between species. • Gene Annotation (cooperative genomics) – Comparison of similar species. • Determining Regulatory Networks – The variables that determine how the body reacts to certain stimuli. • Proteomics – From DNA sequence to a folded protein. 3 Modeling • Modeling biological processes tells us if we understand a given process • Because of the large number of variables that exist in biological problems, powerful computers are needed to analyze certain questions • Protein modeling: – Quantum chemistry imaging algorithms of active sites allow us to view possible bonding and reaction mechanisms – Homologous protein modeling is a comparative proteomic approach to determining an unknown protein’s tertiary structure • Regulatory Network Modeling: – Micro array experiments allow us to compare differences in expression for two different states – Algorithms for clustering groups of gene expression help point out possible regulatory networks (e.g. WGCNA) – Other algorithms perform statistical analysis to improve signal to noise contrast • Systems Biology Modeling: – Predictions of whole cell interactions (Organelle processes, expression modeling) – Currently feasible for specific processes (eg. Flux Balance Analysis) 4 Back to the sequence alignment problem 5 Pairwise Sequence Alignment Methods • Visual • Brute Force • Dynamic Programming • Word-Based (k tuple) 6 Visual Alignments (Dot Plots) • Matrix – Rows: Characters in one sequence – Columns: Characters in second sequence • Filling – Loop through each row; if character in row, col match, fill in the cell – Continue until all cells have been examined 7 The Dot Matrix • established in 1970 by A.J. Gibbs and G.A.McIntyre • method for comparing two amino acid or nucleotide sequences A G C T A G G A G • each sequence builds one axis of A the grid C • one puts a dot, at the intersection of same letters appearing in both T sequences • scan the graph for a series of dots A • reveals similarity G • or a string of same characters • longer sequences can also be G compared on a single page, by C using smaller dots 8 Example Dot Plot 9 An entire software module of a telecommunications switch; about two million lines of C Darker areas indicate regions with a lot of matches (a high degree of similarity). Lighter areas indicate regions with few matches (a low degree of similarity). Dark areas along the main diagonal indicate sub- modules. Dark areas off the main diagonal indicate a degree of similarity between sub- modules The largest dark squares are formed by redundancies in initializations of signal- tables and finite-state machines. 10 The Dot Matrix The very stringent, self-dotplot: The non-stringent self-dotplot: 11 Noise in Dot Plots • Nucleic Acids (DNA, RNA) – 1 out of 4 bases matches at random • Stringency (The condition under which a DNA sequence can bind to related or non- specific sequences. For example, high temperature and lower salt increases stringency such that non-specific binding or binding with low melting temperature will dissolve) – Window size is considered – Percentage of bases matching in the window is set as threshold • To filter out random matches, one uses sliding windows • A dot is printed only if a minimal number of matches occur • rule of thumb: – larger windows for DNAs (only 4 bases, more random matches) – typical window size is 15 and stringency of 10 12 Reduction of Dot Plot Noise Self alignment of ACCTGAGCTCACCTGAGTTA 13 The Dot Matrix Two similar, but not identical, sequences An indel (insertion or deletion): 14 The Dot Matrix Self-dotplot of a tandem duplication: A tandem duplication: 15 An inversion: The Dot Matrix Joining sequences: 16 The Dot Matrix Self dotplot with repeats: 17 The Dot Matrix • the dot matrix method reveals the presence of insertions or deletions • comparing a single sequence to itself can reveal the presence of a repeat of a subsequence – Inverted repeats = reverse complement • Used to determine folding of RNA molecules • self comparison can reveal several features: – similarity between chromosomes – tandem genes – repeated domains in a protein sequence – regions of low sequence complexity (same characters are often repeated) 18 Insertions/Deletions 19 Repeats/Inverted Repeats 20 Comparing Genome Assemblies 21 Chromosome Y self comparison 22 Available Dot Plot Programs • Vector NTI software package (under AlignX) 23 Available Dot Plot Programs • Vector NTI software package (under AlignX) GCG software package: • Compare http://www.hku.hk/bruhk/gcgdoc/compare.html • DotPlot+ http://www.hku.hk/bruhk/gcgdoc/dotplot.html • http://www.isrec.isb-sib.ch/java/dotlet/Dotlet.html • http://bioweb.pasteur.fr/cgi-bin/seqanal/dottup.pl • Dotter (http://www.cgr.ki.se/cgr/groups/sonnhammer/Dotter.html) 24 Available Dot Plot Programs Dotlet (Java Applet) http://www.isrec.isb- sib.ch/java/dotlet/Dotlet.html 25 Available Dot Plot Programs Dotter (http://www.cgr.ki.se/cgr/groups/sonnhammer/Dotter.html) 26 Available Dot Plot Programs EMBOSS DotMatcher, DotPath,DotUp 27 The Dot Matrix • When to use the Dot Matrix method? – unless the sequences are known to be very much alike • limits of the Dot Matrix – doesn’t readily resolve similarity that is interrupted by insertion or deletions – Difficult to find the best possible alignment (optimal alignment) – most computer programs don’t show an actual alignment 28 Dot Plot References Gibbs, A. J. & McIntyre, G. A. (1970). The diagram method for comparing sequences. its use with amino acid and nucleotide sequences. Eur. J. Biochem. 16, 1-11. Staden, R. (1982). An interactive graphics program for comparing and aligning nucleic-acid and amino-acid sequences. Nucl. Acid. Res. 10 (9), 2951-2961. 29 Next step We must define quantitative measures of sequence similarity and difference! • Hamming distance: – # of positions with mismatching characters AGTC Hamming distance = 2 CGTA • Levenshtein (or edit) distance: – # of operations required to change one string into the other (deletion, insertion, substitution) AG-TCC Levenshtein distance = 3 CGCTCA 30 Scoring • +1 for a match -1 for a mismatch? • should gaps be allowed? – if yes how should they be scored? • what is the best algorithm for finding the optimal alignment of two sequences? • is the produced alignment significant? 31 Scoring Matrices • match/mismatch score – Not bad for similar sequences – Does not show distantly related sequences • Likelihood matrix – Scores residues dependent upon likelihood substitution is found in nature – More applicable for amino acid sequences 32 Parameters of Sequence Alignment Scoring Systems: • Each symbol pairing is assigned a numerical value, based on a symbol comparison table. Gap Penalties: • Opening: The cost to introduce a gap • Extension: The cost to elongate a gap 33 DNA Scoring Systems -very simple Sequence 1 actaccagttcatttgatacttctcaaa Sequence 2 taccattaccgtgttaactgaaaggacttaaagact A G C T A 1 0 0 0 Match: 1 G 0 1 0 0 Mismatch: 0 C 0 0 1 0 Score = 5 T 0 0 0 1 34 Protein Scoring Systems T:G = -2 Sequence 1 PTHPLASKTQILPEDLASEDLTI T:T = 5 Sequence 2 PTHPLAGERAIGLARLAEEDFGM Score = 48 Scoring C S T P A G N D . matrix C 9 S -1 4 A scoring matrix is a table T -1 1 5 of values that describe the P -3 -1 -1 7 probability of a residue pair A 0 1 0 -1 4 occurring in alignment. G -3 0 -2 -2 0 6 N -3 1 0 -2 -2 0 5 D -3 0 -1 -1 -2 -1 1 6 . 35 . Amino acid exchange matrices Amino acids are not equal: 1. Some are easily substituted because they have similar: • physico-chemical properties • structure 2. Some mutations between amino acids occur more often due to similar codons The two above observations give us ways to define substitution matrices 36 Properties of Amino Acids • Substitutions with similar chemical properties • Amino acids have different biochemical and physical properties that influence their relative replaceability in evolution. 37 Scoring Matrices • table of values that describe the probability of a residue pair occurring in an alignment • the values are logarithms of ratios of two probabilities 1. probability of random occurrence of an amino acid (diagonal) 2. probability of meaningful occurrence of a pair of residues 38 Protein Scoring Systems • Scoring matrices reflect: – # of mutations to convert one to another – chemical similarity – observed mutation frequencies • Log odds matrices: – the values are logarithms of probability ratios of the probability of an aligned pair to the probability of a random alignment. • Widely used scoring matrices:

Sequence Analysis

Traveling Salesman Problem

Early Fixation of an Optimal Genetic Code

Using Local Alignments for Relation Recognition

1 Introduction

Efficient Alignment-Free Sequence Similarity Measurement Based on Kendall Statistics

Amino Acid Substitution Matrices from an Information Theoretic Perspective

Similarity in Design: a Framework for Evaluation

A Web Tool for Protein Semantic Similarity

Similarity Measures for Clustering SNP and Epidemiological Data

Research Article a Topology-Based Metric for Measuring Term Similarity in the Gene Ontology

Significance and Functional Similarity for Identification of Disease Genes 3

A New Measure for Similarity Searching in DNA Sequences 1