Quick viewing(Text Mode)

Introduction to Sequence Analysis

Introduction to Sequence Analysis

References

Introduction to Sequence Chapters 2-7 of Biological Sequence Analysis (Durbin et al., 2001) Analysis Eddy, S. R. (1998). Profile hidden Markov models. , 14:755-763 Bodenhofer et al. (2015) Bioinformatics Utah State University – Fall 2019 31(24):3997-3999. Statistical Bioinformatics (Biomedical Big Data) Notes 11

1 2

Review Why look at sequence?

Genes are: Levels of protein structure Primary structure: order of amino acids - sequences of DNA that “do” something Secondary structure: repeating structures (beta-sheets - can be expressed as a string of: and alpha-helices) in “backbone” nucleic acids: A,C,G,T (4-letter alphabet) Tertiary structure: full three-dimensional folded structure Central Dogma of Molecular Biology Quartenary structure: interaction of multiple “backbones” DNA mRNA protein bio. action Sequence shape function can be expressed as a string of: amino acids: (20-letter alphabet) (sometime 24 due to “similarities”) Similar sequence similar function -?

3 4 Consider simple pairwise alignment Possible alignments

Sequence 1: HEAGAWGHEE Sequence 1: HEAGAWGHEE Sequence 2: PAWHEAE Sequence 2: PAWHEAE

How similar are these two sequences? Alignment 1: Alignment 2: Alignment 3: Alignment 4: Match up exactly? HEAGAWGHEE HEAGAWGHEE HEA-GAWGHEE HEAGAWGHE-E Subsequences similar? PAWHEAE PAW-HE-AE PAWHEAE PAW-HEAE Which positions could be possibly matched without severe penalty?

To find the “best” alignment, need some way to: Think of gaps in alignment as:

rate alignments mutational insertion or deletion

5 6

Basic idea of scoring potential alignments Some Notation

qa freq. of letter a in sequence,

+ score: identities and “conservative” Pab P{a,b from common ancestor} substitutions - score: non- “conservative” changes - Let x be sequence 1, and y be sequence 2. (not expected in “real” alignments) Random Model : P x, y | R q q Add score at each position xi y j i j Equivalent to assuming are: Matched Model : P x, y | M P independent xi yi i Reasonable assumption for DNA and proteins but not structural RNA’s assume independence of assume residues a & b are sequences aligned as a pair with prob. Pab

7 8 Compare these two models Score Matrix – or “

A R N D ... Y V P x, y | M Px y Odds Ratio : i i A | 5 -2 -1 -2 -2 0 These are scaled and rounded P x, y | R i qx qy i i R | -2 7 -1 -2 -1 3 log-odds values N | -1 -1 7 ... (for computational D | -2 -2 ... efficiency) Log Odds Ratio : S s(xi , yi ), ... | s(a,b) i Y | -2 -1 ... P V | 0 3 where s(a,b) log ab qaqb log likelihood ratio of pair (a,b) occurring as This is a portion of the BLOSUM50 substitution matrix; aligned pair, as opposed to unaligned pair others exist. Need : Pab

9 10

How to get these substitution values? Some substitution matrix types BLOSUM (Henikoff) Basic idea: BLOCK substitution matrix Look at existing, “known” alignments derived from BLOCKS database – set of aligned ungapped Compare sequences of aligned proteins and look at protein families, clustered according to threshold percentage (L) of identical residues substitution frequencies – compare residue frequencies between clusters This is a chicken-or-the-egg problem: L=50 BLOSUM50 - alignment - - scoring scheme - PAM (Dayhoff) percentage of acceptable point mutations per 108 years Maybe better to base alignment on: derived from a general model for protein evolution, based tertiary structures on number L of PAMs (evolutionary distance) PAM1 from comparing sequences with <1% divergence L=250 PAM250 = PAM1^250 (or some other alignment)

11 12 Which substitution matrix to use? Which matrix for aligning DNA sequences? No universal “best” way The BLOSUM and PAM matrices are based In general: on similarities between amino acids – low PAM find short alignments of similar seq. high PAM find longer, weaker local alignments - no such similarity assumed for nucleic BLOSUM standards: acids; residues either match or they don’t BLOSUM50 for alignment with gaps BLOSUM62 for ungapped alignments Unitary matrix: identity matrix higher PAM, lower BLOSUM more divergent +1 for identical match – (or +3 or …) (looking for more distantly related proteins) 0 for non-match – (or -2 or …) A reasonable strategy: BLOSUM62 complemented with PAM250

13 14

How to score gaps? Tabular representation of alignment

start with 0 One way: affine gap penalty H E A G A W G H E E linear transformation followed by a translation 0 P | (g) d (g 1)e A | W | begin (or continue) gap: -d (or -e) H | gap gap length E | match letters (residues): + s(a,b) opening extension of gap A | penalty penalty E | (e < d) Fill in table to give max. of possible values at each successive element – keep track of which direction Think of gaps in alignment as: mutational insertion or deletion generated max. – then use the “path” that gives highest final score (lower right corner)

15 16 Alignment algorithms Compare global and local alignments

Global: Needleman-Wunsch Sequence 1: HEAGAWGHEE - find optimal alignment for entire sequences (prev. slide) Sequence 2: PAWHEAE Local: Smith-Waterman - find optimal alignment for subsequences Global Pairwise Alignment (1 of 1) pattern: [1] HEAGAWGHE-E Repeated matches subject: [1] P---AW-HEAE - allow for starting over sequences score: 23 (find motifs in long sequences) Overlap matches - allow for one sequence to contain or overlap the Local Pairwise Alignment (1 of 1) pattern: [5] AWGHE-E other (for comparing fragments) subject: [2] AW-HEAE Heuristic: BLAST, FASTA score: 32 - for comparing a single sequence against a large database of sequences

17 18

Simple pairwise alignment in R Look at a “bigger” example library(Biostrings) The pairseqsim package (now archived by # Define sequences Bioconductor) has a companion file (ex.fasta) with seq1 <- "HEAGAWGHEE" sequence data for 67 protein sequences in seq2 <- "PAWHEAE" “FASTA” format: >At1g01010 NAC domain protein, putative # perform global alignment MEDQVGFGFRPNDEELVGHYLRNKIEGNTSRDVEVAISEVNICSYDPWNLRFQSKYKSRD g.align <- pairwiseAlignment(seq1, seq2, ... VISWIILVG substitutionMatrix='BLOSUM50', gapOpening=-4, >At1g01020 unknown protein gapExtension=-1, type='global') MAASEHRCVGCGFRVKSLFIQYSPGNIRLMKCGNCKEVADEYIECERMIIFIDLILHRPK VYRHVLYNAINPATVNIQHLLWKLVFAYLLLDCYRSLLLRKSDEESSFSDSPVLLSIKVR g.align SFLFNGLN >At1g01030 DNA-binding protein, putative MDLSLAPTTTTSSDQEQDRDQELTSNIGASSSSGPSGNNNNLPMMMIPPPEKEHMFDKVV # perform local alignment ... l.align <- pairwiseAlignment(seq1, seq2, EESWLVPRGEIGASSSSSSALRLNLSTDHDDDNDDGDDGDDDQFAKKGKSSLSLNFNP >At1g01040 CAF protein substitutionMatrix='BLOSUM50', gapOpening=-4, MVMEDEPREATIKPSYWLDACEDISCDLIDDLVSEFDPSSVAVNESTDENGVINDFFGGI gapExtension=-1, type='local') ... l.align DKDRKRARVCSYQSERSNLSGRGHVNNSREGDRFMNRKRTRNWDEAGNNKKKRECNNYRR ...

19 http://www.stat.usu.edu/jrstevens/bioinf/ex.fasta 20 “Bigger” example: # read in data in FASTA format f1 <- "C://folder//ex.fasta" # saved from website (slide 20) For a given sequence ff <- readAAStringSet(f1, "fasta") (subject), # compare first sequence (subject) with the others (pattern) "At1g01010 NAC domain sub <- ff[1] protein, putative" names(sub) # "At1g01010 NAC domain protein, putative" pat <- ff[2:length(ff)] find the most similar sequence in a list (pattern) # get scores of all global alignments s <- pairwiseAlignment(pat, sub, substitutionMatrix='PAM250', "At1g01190 cytochrome gapOpening=-4, gapExtension=-1, type='global', P450, putative" scoreOnly=TRUE) hist(s, main=c('global alignment scores with',names(sub))) Global Pairwise Alignment (1 of 1) pattern: [1] MRTEIESLWVF-----ALASKFNIYMQQHFASLL---VAIAITWFTITIMRTEIESLWVF----- ... # look at best alignment subject: [1] MEDQVG--FGFRPNDEELVGH---YLRNKIEGNTSRDVEVAIS—EVNICMEDQVG ... k <- which.max(s) score: 313 names(pat[k]) # "At1g01190 cytochrome P450, putative" pairwiseAlignment(pat[k], sub, substitutionMatrix='PAM250', gapOpening=-4, gapExtension=-1, type='global')

(names refer to gene name or locus) 21 22

Phylogenetic trees – intro & motivation Quick review of agglomerative clustering

Phylogeny: relationship among species Phylogenetic tree: visualization of phylogeny (usually a dendrogram) i

How can we do this here? p q Consider multiple sequences (maybe from different species) - define distance between points “Similar” sequences are called homologues - each “point” (sequence here) starts as its own cluster - descended from common ancestor sequence? - find closest clusters and merge them - similar function? - Linkage: how to define distance between new cluster Want to visualize these relationships and existing clusters

23 24 Recall linkage methods (a few) Defining “distance” between sequences i & j

Why not Euclidean, Pearson, etc.? - sequences are not points in space i Single (nearest neighbor) : di min d pi ,dqi p q Could use (after pairwise alignment): 1 – normalized score {score (or 0) divided by smaller selfscore}

Let p,q,i be clusters, Average : di d pi dqi / 2 1 – %identity based on length of shorter sequence 1 – %similarity d pq be the p q distance,

np ni d pi nq ni dqi nid pq di be the distance Ward : d Making use of models for residue substitution (for DNA): i n n n between i and the new p q i Let f = fraction of sites in pairwise alignment where residues differ = 1 - %identity p,q cluster, and n be p n d n d Jukes-Cantor distance: 3 UPGMA : d p pi q qi d log 1 4 f / 3 the number of points in i n n ij p q 4 cluster p.

25 26

# Function to get phylogenetic distance matrix for multiple sequences Visualize relationships # -- don't worry about syntax here; just see next slide for usage among 11 sequences get.phylo.dist <- function(seqs,subM='BLOSUM62',open=-4,ext=-1,type='local') { from ex.fasta file # Get matrix of pairwise local alignment scores num.seq <- length(seqs) s.mat <- matrix(ncol=num.seq, nrow=num.seq) for(i in 1:num.seq) { for(j in i:num.seq) { s.mat[i,j] <- s.mat[j,i] <- pairwiseAlignment(seqs[i], seqs[j], substitutionMatrix=subM, gapOpening=open, gapExtension=ext, type=type, scoreOnly=TRUE) } }

# Convert scores to normalized scores norm.mat <- matrix(ncol=num.seq, nrow=num.seq) for(i in 1:num.seq) { for(j in i:num.seq) { min.self <- min(s.mat[i,i],s.mat[j,j]) norm.mat[i,j] <- norm.mat[j,i] <- s.mat[i,j]/min.self } norm.mat[i,i] <- 0 }

# Return distance matrix colnames(norm.mat) <- rownames(norm.mat) <- substr(names(seqs),1,9) return(as.dist(1-norm.mat)) }

27 28 R code for phylogenetic trees Aside: visualizing sequence content from pairwise distances tab <- table(strsplit(as.character(ff[1]),"")) use.col <- rep('yellow',length(tab)) t <- names(tab)=='S' # Choose sequences use.col[t] <- 'blue' seqs <- ff[50:60] # recall ff object from slide 22 barplot(tab,col=use.col,main=names(ff[1])) # Phylogenetic tree dmat <- get.phylo.dist(seqs,subM='BLOSUM62',type='local') plot(hclust(dmat,method="average"),main='Phylogenetic Tree', Probably more useful for: xlab='Normalized Score') assessing C-G counts in DNA sequences # heatmap representation library(cluster) library(RColorBrewer) hmcol <- colorRampPalette(brewer.pal(10,"PuOr"))(256) hclust.ave <- function(d){hclust(d,method="average")} heatmap(as.matrix(dmat),sym=TRUE,col=hmcol, cexRow=4,cexCol=1,hclustfun=hclust.ave)

29 30

# get sequence (coding region) of a gene; tab <- table(strsplit(seq[1,1],"")) # example: ENSG00000160551 use.col <- rep('yellow', length(tab)) library(biomaRt) t <- names(tab)=='A' use.mart <- useMart("ensembl",dataset="hsapiens_gene_ensembl") use.col[t] <- 'blue' barplot(tab,col=use.col, main="sequence content of ENSG00000160551") seq <- getSequence(id="ENSG00000160551", type="ensembl_gene_id", seqType="coding", mart=use.mart) seq[,1] # this returns three sequences; compare these: #1 looks like a substring of both 3 & 4; #3 appears to be mostly a substring of 4

[1] "ATGCCATCAAC … CAAGTTTC [2] "Sequence unavailable" [3] "ATGCCATCAAC … CAAGTTTCTAC … GCTTAAAGAGTCTAAAGAACT … [4] "ATGCCATCAAC … CAAGTTTCTAC … GCTTAAAGAGGAGCTAAATGA …

31 32 What about more than two sequences? Common summary: “pretty-printing”

Multiple - many possible strategies to find and score possible alignments

One common way: ClustalW a “progressive alignment” approach construct pairwise distances based on evolutionary distance essentially follow an agglomerative clustering approach, progressively aligning nodes in order of decreasing similarity additional heuristics make final alignment more accurate

33 (See R package msa, published 2015 Bioinformatics) 34

Follow-up to a sequence alignment Using HMMs to describe a “family” Suppose we have an alignment of multiple Consider pairwise (or multiple) alignment sequences – we can model their “relationship” as a family of sequences What does alignment mean? – call this the family’s: “profile” possibly represents common ancestry PSSM – position-specific score matrix - estimate this to: describe this particular profile Possible questions (e.g., should ‘A’ count for more at a particular position Does alignment describe some “family”? in the alignment?) How can we describe its internal structure? Allow for insertions and deletions, where “cost” could also be position-specific Can sometimes characterize these “family” Use this profile to describe the alignment and look for structures as profile Hidden Markov Model other similar sequences

35 36 Profile example (from hmmer / hmmbuild) Summary

HMM A C D... Q R S T...... Look at sequence similarity to find functional 15 2.35 4.27 3.26 3.50 3.44 0.99 2.83 15 similarity (and families) 16 3.08 4.91 3.57 3.09 0.88 3.16 3.34 16 17 2.66 0.81 4.20 4.12 3.89 2.93 3.13 17 Pairwise alignment basics 18 2.35 4.27 3.26 3.50 3.44 0.99 2.83 18 Scoring matrix 19 2.35 4.27 3.26 3.50 3.44 0.99 2.83 19 BLOSUM, PAM, etc. ... Alignment algorithm global, local, etc. Tools for multiple alignment & pattern (motif) finding Coming up: searching online databases (BLAST)

37 38