Introduction to Sequence Analysis

References Introduction to Sequence Chapters 2-7 of Biological Sequence Analysis (Durbin et al., 2001) Analysis Eddy, S. R. (1998). Profile hidden Markov models. Bioinformatics, 14:755-763 Bodenhofer et al. (2015) Bioinformatics Utah State University – Fall 2019 31(24):3997-3999. Statistical Bioinformatics (Biomedical Big Data) Notes 11 1 2 Review Why look at protein sequence? Genes are: Levels of protein structure Primary structure: order of amino acids - sequences of DNA that “do” something Secondary structure: repeating structures (beta-sheets - can be expressed as a string of: and alpha-helices) in “backbone” nucleic acids: A,C,G,T (4-letter alphabet) Tertiary structure: full three-dimensional folded structure Central Dogma of Molecular Biology Quartenary structure: interaction of multiple “backbones” DNA mRNA protein bio. action Sequence shape function Proteins can be expressed as a string of: amino acids: (20-letter alphabet) (sometime 24 due to “similarities”) Similar sequence similar function -? 3 4 Consider simple pairwise alignment Possible alignments Sequence 1: HEAGAWGHEE Sequence 1: HEAGAWGHEE Sequence 2: PAWHEAE Sequence 2: PAWHEAE How similar are these two sequences? Alignment 1: Alignment 2: Alignment 3: Alignment 4: Match up exactly? HEAGAWGHEE HEAGAWGHEE HEA-GAWGHEE HEAGAWGHE-E Subsequences similar? PAWHEAE PAW-HE-AE PAWHEAE PAW-HEAE Which positions could be possibly matched without severe penalty? To find the “best” alignment, need some way to: Think of gaps in alignment as: rate alignments mutational insertion or deletion 5 6 Basic idea of scoring potential alignments Some Notation qa freq. of letter a in sequence, + score: identities and “conservative” Pab P{a,b from common ancestor} substitutions - score: non- “conservative” changes - Let x be sequence 1, and y be sequence 2. (not expected in “real” alignments) Random Model : P x, y | R q q Add score at each position xi y j i j Equivalent to assuming mutations are: Matched Model : P x, y | M P independent xi yi i Reasonable assumption for DNA and proteins but not structural RNA’s assume independence of assume residues a & b are sequences aligned as a pair with prob. Pab 7 8 Compare these two models Score Matrix – or “substitution matrix” P A R N D ... Y V P x, y | M xi yi Odds Ratio : A | 5 -2 -1 -2 -2 0 These are scaled and rounded P x, y | R i qx qy i i R | -2 7 -1 -2 -1 3 log-odds values N | -1 -1 7 ... (for computational D | -2 -2 ... efficiency) Log Odds Ratio : S s(xi , yi ), ... | s(a,b) i Y | -2 -1 ... P V | 0 3 where s(a,b) log ab qaqb log likelihood ratio of pair (a,b) occurring as This is a portion of the BLOSUM50 substitution matrix; aligned pair, as opposed to unaligned pair others exist. Need : Pab 9 10 How to get these substitution values? Some substitution matrix types BLOSUM (Henikoff) Basic idea: BLOCK substitution matrix Look at existing, “known” alignments derived from BLOCKS database – set of aligned ungapped Compare sequences of aligned proteins and look at protein families, clustered according to threshold percentage (L) of identical residues substitution frequencies – compare residue frequencies between clusters This is a chicken-or-the-egg problem: L=50 BLOSUM50 - alignment - - scoring scheme - PAM (Dayhoff) percentage of acceptable point mutations per 108 years Maybe better to base alignment on: derived from a general model for protein evolution, based tertiary structures on number L of PAMs (evolutionary distance) PAM1 from comparing sequences with <1% divergence L=250 PAM250 = PAM1^250 (or some other alignment) 11 12 Which substitution matrix to use? Which matrix for aligning DNA sequences? No universal “best” way The BLOSUM and PAM matrices are based In general: on similarities between amino acids – low PAM find short alignments of similar seq. high PAM find longer, weaker local alignments - no such similarity assumed for nucleic BLOSUM standards: acids; residues either match or they don’t BLOSUM50 for alignment with gaps BLOSUM62 for ungapped alignments Unitary matrix: identity matrix higher PAM, lower BLOSUM more divergent +1 for identical match – (or +3 or …) (looking for more distantly related proteins) 0 for non-match – (or -2 or …) A reasonable strategy: BLOSUM62 complemented with PAM250 13 14 How to score gaps? Tabular representation of alignment start with 0 One way: affine gap penalty H E A G A W G H E E linear transformation followed by a translation 0 P | (g) d (g 1)e A | W | begin (or continue) gap: -d (or -e) H | gap gap length E | match letters (residues): + s(a,b) opening extension of gap A | penalty penalty E | (e < d) Fill in table to give max. of possible values at each successive element – keep track of which direction Think of gaps in alignment as: mutational insertion or deletion generated max. – then use the “path” that gives highest final score (lower right corner) 15 16 Alignment algorithms Compare global and local alignments Global: Needleman-Wunsch Sequence 1: HEAGAWGHEE - find optimal alignment for entire sequences (prev. slide) Sequence 2: PAWHEAE Local: Smith-Waterman - find optimal alignment for subsequences Global Pairwise Alignment (1 of 1) pattern: [1] HEAGAWGHE-E Repeated matches subject: [1] P---AW-HEAE - allow for starting over sequences score: 23 (find motifs in long sequences) Overlap matches - allow for one sequence to contain or overlap the Local Pairwise Alignment (1 of 1) pattern: [5] AWGHE-E other (for comparing fragments) subject: [2] AW-HEAE Heuristic: BLAST, FASTA score: 32 - for comparing a single sequence against a large database of sequences 17 18 Simple pairwise alignment in R Look at a “bigger” example library(Biostrings) The pairseqsim package (now archived by # Define sequences Bioconductor) has a companion file (ex.fasta) with seq1 <- "HEAGAWGHEE" sequence data for 67 protein sequences in seq2 <- "PAWHEAE" “FASTA” format: >At1g01010 NAC domain protein, putative # perform global alignment MEDQVGFGFRPNDEELVGHYLRNKIEGNTSRDVEVAISEVNICSYDPWNLRFQSKYKSRD g.align <- pairwiseAlignment(seq1, seq2, ... VISWIILVG substitutionMatrix='BLOSUM50', gapOpening=-4, >At1g01020 unknown protein gapExtension=-1, type='global') MAASEHRCVGCGFRVKSLFIQYSPGNIRLMKCGNCKEVADEYIECERMIIFIDLILHRPK VYRHVLYNAINPATVNIQHLLWKLVFAYLLLDCYRSLLLRKSDEESSFSDSPVLLSIKVR g.align SFLFNGLN >At1g01030 DNA-binding protein, putative MDLSLAPTTTTSSDQEQDRDQELTSNIGASSSSGPSGNNNNLPMMMIPPPEKEHMFDKVV # perform local alignment ... l.align <- pairwiseAlignment(seq1, seq2, EESWLVPRGEIGASSSSSSALRLNLSTDHDDDNDDGDDGDDDQFAKKGKSSLSLNFNP >At1g01040 CAF protein substitutionMatrix='BLOSUM50', gapOpening=-4, MVMEDEPREATIKPSYWLDACEDISCDLIDDLVSEFDPSSVAVNESTDENGVINDFFGGI gapExtension=-1, type='local') ... l.align DKDRKRARVCSYQSERSNLSGRGHVNNSREGDRFMNRKRTRNWDEAGNNKKKRECNNYRR ... 19 http://www.stat.usu.edu/jrstevens/bioinf/ex.fasta 20 “Bigger” example: # read in data in FASTA format f1 <- "C://folder//ex.fasta" # saved from website (slide 20) For a given sequence ff <- readAAStringSet(f1, "fasta") (subject), # compare first sequence (subject) with the others (pattern) "At1g01010 NAC domain sub <- ff[1] protein, putative" names(sub) # "At1g01010 NAC domain protein, putative" pat <- ff[2:length(ff)] find the most similar sequence in a list (pattern) # get scores of all global alignments s <- pairwiseAlignment(pat, sub, substitutionMatrix='PAM250', "At1g01190 cytochrome gapOpening=-4, gapExtension=-1, type='global', P450, putative" scoreOnly=TRUE) hist(s, main=c('global alignment scores with',names(sub))) Global Pairwise Alignment (1 of 1) pattern: [1] MRTEIESLWVF-----ALASKFNIYMQQHFASLL---VAIAITWFTITIMRTEIESLWVF----- ... # look at best alignment subject: [1] MEDQVG--FGFRPNDEELVGH---YLRNKIEGNTSRDVEVAIS—EVNICMEDQVG ... k <- which.max(s) score: 313 names(pat[k]) # "At1g01190 cytochrome P450, putative" pairwiseAlignment(pat[k], sub, substitutionMatrix='PAM250', gapOpening=-4, gapExtension=-1, type='global') (names refer to gene name or locus) 21 22 Phylogenetic trees – intro & motivation Quick review of agglomerative clustering Phylogeny: relationship among species Phylogenetic tree: visualization of phylogeny (usually a dendrogram) i How can we do this here? p q Consider multiple sequences (maybe from different species) - define distance between points “Similar” sequences are called homologues - each “point” (sequence here) starts as its own cluster - descended from common ancestor sequence? - find closest clusters and merge them - similar function? - Linkage: how to define distance between new cluster Want to visualize these relationships and existing clusters 23 24 Recall linkage methods (a few) Defining “distance” between sequences i & j Why not Euclidean, Pearson, etc.? - sequences are not points in space i Single (nearest neighbor) : di min d pi ,dqi p q Could use (after pairwise alignment): 1 – normalized score {score (or 0) divided by smaller selfscore} Let p,q,i be clusters, Average : di d pi dqi / 2 1 – %identity based on length of shorter sequence 1 – %similarity d pq be the p q distance, np ni d pi nq ni dqi nid pq di be the distance Ward : d Making use of models for residue substitution (for DNA): i n n n between i and the new p q i Let f = fraction of sites in pairwise alignment where residues differ = 1 - %identity p,q cluster, and n be p n d n d Jukes-Cantor distance: 3 UPGMA : d p pi q qi d log 1 4 f / 3 the number of points in i n n ij p q 4 cluster p. 25 26 # Function to get phylogenetic distance matrix for multiple sequences Visualize relationships # -- don't worry about syntax here; just see next slide for usage among 11 sequences

Load more