Bioe 144/244 Introduction to ProteinMultiple Informatics sequence alignment (MSA) algorithms, uses and benchmarking

Subset (to avoid Bioe 190 redundancy with GFKLP previous lectures GYKLP Fall 2016 and Pevsner text) GFRVP GF-LP

1 Bioe 144/244 Introduction to Protein Informatics Topics Covered

• Uses of multiple sequence alignments (MSAs) • Global, glocal and local alignments • Structural vs sequence alignments • Progressive, iterative, consistency-based and master-slave alignment methods • MAFFT and SATCHMO-JS • Benchmark datasets & scoring functions • MSA formats (aligned FASTA, UCSC a2m)

2 Bioe 144/244 What is an MSA? Introduction to Protein Informatics

• An MSA is an assertion of homology – > 2 nucleotide or amino acid sequences – Characters in the columns have descended from an ancestral character except for indel characters that represent insertions and deletions • An MSA is a matrix

– Mi,j = the character for sequence i at column j. – Lower-case characters may have different meanings from upper- case characters – Dot (.) and dash (-) are indel characters (understand UCSC a2m format)

3 Bioe 144/244 Uses of sequence alignment Introduction to Protein Informatics

Phylogenetic tree

Active and binding GFKLP site prediction GYKLP Homology Models GFRVP GF-LP Profiles/HMM construction And more… Domain Prediction Substitution Matrices Function prediction by phylogenomic analysis Secondary structure prediction Subfamily identification 4 … Bioe 144/244The relationship of trees and MSAs Introduction to Protein Informatics

• An MSA is typically used as input to estimate a phylogenetic tree • Some progressive MSA algorithms start by estimating a hierarchical tree based on pairwise similarity (or distance) scores, from which a guide tree can be constructed; the guide tree is then used to determine the order in which sequences (or groups of sequences are aligned to each other. • Simulataneous sequence alignment and phylogenetic tree estimation methods also exist (e.g., SATCHMO); – Note that although SATCHMO is a progressive alignment method it does not use a guide tree

5 Major techniques used in MSA Bioe 144/244 Introduction to Protein Informatics estimation

• Optimizing a scoring function (e.g., sum of pairs) – note sensitivity to the substitution/match/indel scores – ClustalW set indel costs higher in hydrophobic regions • Dynamic Programming (to align sequences/groups to each other) • Divide-and-conquer techniques – Tree-based partitioning (MUSCLE, MAFFT) – Similar idea used in SATCHMO-JS • Guide trees (in some progressive methods) – Estimated initially without an MSA (based on pairwise sequence comparison scores) – Subsequent iterations use the current MSA as the basis for estimating a tree (for tree-dependent partitioning) 6 Bioe 144/244 Introduction to Protein Informatics Some alignments are easy

Note: Belvu MSA coloring (based on BLOSUM62 average pairwise score Light blue: highest conservation, Dark blue: moderate conservation. Grey: few residues conserved Uncolored: mixed distributions

7 SATCHMO simultaneous MSA & Tree Bioe 144/244 Introduction to Protein Informatics estimation

8 Bioe 144/244 Introduction to Protein InformaticsLocal, glocal and global-global

• Local-local

Best for boosting remote homolog detection, identification of evolutionary domains. Default protocol of BLAST and PSI-BLAST. • Global-local (aka glocal)

Global to the query, potentially local to the hit. Best for gathering homologs to a structural or functional domain. • Global-global

Restrict sequences to those appearing to have the same domain architecture. Default protocol of FlowerPower. 9 As proteins diverge from their common Bioe 144/244 Introduction to Proteinancestor, Informatics structurally superposable positions decrease

Restricted to the “common core”

“The relation between the divergence of sequence and structure in proteins”, Chothia and Lesk. EMBO Journal 1986 Structural alignment is the gold standard Bioe 144/244 Introduction to Protein forInformatics evaluating sequence alignment

• Structural superposition of two PDB structures provides correspondences/equivalences between residues • Since primary sequence diverges more rapidly than 3D structure, structural alignment is the gold standard against which sequence alignment is assessed • Not all structural aligners agree on all pairs • However, clearly superposable pairs (within 2.5 Angstroms) are normally agreed upon by structural aligners • Example structural aligners include: JFAT-CAT, CE, DALI, VAST, Structal 11 RCSB/PDB Structure superposition: Bioe 144/244 Introduction to Proteinhelices Informatics and strands superpose better than loops

12 Bioe 144/244 Introduction to Protein Informatics

VAST multiple alignment based on structure superposition: Non-equivalent positions are in lower-case

1SN4 Scorpion neurotoxin (blue positions are not superposable) 13 Bioe 144/244 Basic MSA method classes Introduction to Protein Informatics • Progressive: – Key feature: once two sequences are aligned to each other, their respective alignment will not change (errors cannot be fixed) – Note: Sometimes (not always) a guide tree (based on pairwise sequence similarity) is used to determine the join order of the sequences (e.g., ClustalW) – Pros: output alignments generally good within clusters of closely related sequences – Cons: cannot fix errors – Examples: ClustalW, SATCHMO • Iterative – Key feature: alignments of sequences may be adjusted based on some objective function – Pros: improves accuracy; Cons: additional computational complexity over progressive alignment – Note that many iterative alignment algorithms include progressive alignment – Examples: MUSCLE, MAFFT • Master-slave – One sequence (or profile/HMM) is the master; other sequences are aligned to it – Pros: fast. Cons: low accuracy (especially in regions where the profile/HMM is noisy, or sequences being aligned diverge in sequence) – Examples: BLAST, PSI-BLAST, aligning sequences to an HMM • Consistency-based – Key feature: the algorithm attempts to enforce that if in the set of pairwise alignments to be integrated into the MSA, residue i is aligned to residue j, and residue j is aligned to residue k, that in the output MSA, residues i, j and k all align to each other. – Pros: makes biological sense. Cons: computationally expensive (slow, problem on large inputs) – Examples: ProbCons, T-Coffee MAFFT Iterative Alignment

Bioe 144/244 Introduction to Protein(includes Informatics an initial progressive step)

MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform, Katoh, Misawa, Kuma, and Miyataa, Nucleic Acids Res. 2002 Jul 15;30(14):3059-66.

The basic steps of the MAFFT algorithm

1. A guide tree is estimated based on pairwise sequence comparison distance matrix 2. Input sequences are progressively aligned following the branching order of sequences in the guide tree. 3. The alignment is subjected to further improvement, in which the alignment is divided into two groups and realigned using a technique called tree-dependent restricted partitioning. 4. This process is iterated until no better scoring alignment is obtained

http://nar.oxfordjournals.org/content/30/14/3059.full 15 Tree-dependant partitioning used in the

Bioe 144/244 Introduction to Protein Informatics MAFFT iterative step

16 http://nar.oxfordjournals.org/content/30/14/3059.full Bioe 144/244 Introduction to Protein Informatics Benchmark datasets

• BAliBase (Benchmark Alignment Database) • One of the oldest benchmark datasets. Numerous small test sets evaluating different types of data (high vs low %ID, uneven or similar lengths, etc.). Limitations: many sequences have no known structure (reference alignments manually refined); most reference/ test datasets are small in size (do not evaluate how methods perform on large inputs). • PREFAB (Protein Reference Alignment database) • Pairwise structural alignments at different levels of divergence • SABmark (Sequence and structure Alignment Benchmark) • Recommended reading: “SABmark—a benchmark for sequence alignment that covers the entire known fold space”, Ivo Van Walle,, Ignace Laster and Lode Wyns, (2005) 21 (7): 1267-1268. 17 SATCHMO: Simultaneous Alignment and Tree Construction using Hidden Markov mOdels

Edgar, R., and Sjölander, K., Bioinformatics 2003 Hagopian, R., Davidson, J., Datta, R., Samad, B., Jarvis, G., and Sjölander, K. "SATCHMO-JS: a webserver for simultaneous protein multiple sequence alignment and phylogenetic tree construction," Nucleic Acids Research 2010 Related work

• “A Probabilistic Treatment of Phylogeny and Sequence Alignment”, G.J. Mitchison, Journal of Molecular Evolution, 1998, Volume 49, Number 1, 11-22. • “POY version 4: phylogenetic analysis using dynamic homologies”, Andrés Varón, Le Sy Vinh, Ward C. Wheeler, Cladistics, Volume 26, Issue 1, pages 72– 85, February 2010 SATCHMO algorithm

• Input: unaligned sequences. • Initialize: a profile HMM is constructed for each sequence using Dirichlet mixture densities; each sequence forms a separate subtree (of a single sequence each) – Dirichlet mixture densities avoid the problems of small counts • While (#subtrees > 1) { – Use profile-profile scoring to select closest pair to join – Align pair to each other, keeping columns fixed within each subtree – Mask columns with many gaps or low scores (use a window). – Construct a profile HMM for the new masked MSA } • Output: Tree and MSA Benchmarking SATCHMO accuracy Evaluating the phylogenetic tree accuracy is difficult ● Simulation studies are used to evaluate evolutionary tree methods ● These rarely attempt to model the effects of duplication and structural and functional changes ● We don’t know the evolutionary history of multi-gene families, so benchmark datasets of real protein family phylogenies are not available However, we can directly assess the alignment accuracy by way of 3D structure ● The structural alignment of two proteins is accepted as “ground truth” by the computational structural biology community We can also assess the functional predictive power of a phylogenetic tree against what is known about the functions of proteins p21 ● This approach is not universally accepted p22 p23 p24 SATCHMO is more robust to extreme structural divergence than other methods

Alignment accuracy as a function of % ID (including homologs, full-length sequences) 1 0.9

0.8 0.7

0.6

0.5 0.4

0.3 Average CS score Average 0.2

0.1 0 10-15% 15-20% 20-25% 25-30% 30-35% 35-40% Percent ID CLUSTALW MUSCLE MAFFT SATCHMO

SATCHMO succeeds at alignment of proteins with different overall folds

MAFFT

SATCHMO SATCHMO: A cool idea but some limitations

Computationally very intensive Poor accuracy at high sequence similarity relative to other methods SATCHMO-JS • JS stands for Jump-Start, using a divide and conquer approach to reduce computational complexity • Input: unaligned sequences 1. Align sequences with MAFFT (5 iterations refinement) 2. Build a NJ tree using QuickTree 3. Cut tree into subtrees s.t. no pair of sequences in any subtree has below a specified %ID (default is 35%ID) 4. Mask subtree MSAs; build an HMM for each subtree MSA 5. Submit resulting subtree MSAs to SATCHMO for tree and MSA construction from that point to the root of the tree 6. Use RAxML to refine the tree edge lengths, keeping the tree topology fixed – Output: MSA and phylogenetic tree Sequence alignment accuracy degrades as sequence and structural divergence increase

The Q_Developer score measures recall (sensitivity). In other words, it measures the fraction of the reference alignment that is correctly predicted by the sequence alignment

Q_Developer = TP / (TP+FN)

TP = # of correctly aligned residue pairs in the sequence alignment (i.e., agree with the reference alignment). FN = # of aligned residue pairs in the reference alignment which are not in the sequence alignment (i.e., they are missed by the sequence alignment)

Results on PREFAB benchmark of pairwise structural alignments at different levels of sequence divergence. In these experiments, alignment methods were given many homologs, but only evaluated on the structurally aligned pair.

SATCHMO-JS: a webserver for simultaneous protein multiple sequence alignment and phylogenetic tree construction," Hagopian et al, Nucleic Acids Research 2010 Modeler scores measure precision of a

Bioe 144/244 sequence alignment relative to the Introduction to Protein Informatics structural alignment

The Q_Modeler score measures precision (selectivity).

Q_Modeler = TP / (TP + FP)

TP = # of correctly aligned residue pairs in the sequence alignment (i.e, that agree with the reference) FP = # of incorrectly aligned residue pairs in the sequence alignment (i.e., pairs in the sequence alignment that are not aligned in the reference)

31 The Cline Shift Score Bioe 144/244 Introduction to Protein Informatics

The Cline Shift score includes a small positive score for being close to the reference alignment, and a small penalty for overalignment.

SATCHMO-JS: a webserver for simultaneous protein multiple sequence alignment and phylogenetic tree construction," Hagopian et al, Nucleic Acids Research 2010 32 http://makana.berkeley.edu/satchmo/supplementary/webserver/ Evaluating the significance of score differences using the Wilcoxon paired-score signed rank test

Details: Results are shown on 983 pairs from the PREFAB benchmark dataset, divided into bins based on the percent identity in the reference structural alignment. The Modeler score (Qmodeler) is a measure of the precision of an alignment, while the Developer score (Qdeveloper) is a measure of the recall. For every percent identity bin, either SATCHMO or SATCHMO-JS produces the best overall performance in both Modeler and Developer scores, with SATCHMO-JS generally producing better results than SATCHMO. Over the dataset as a whole, SATCHMO-JS’s improvement relative to other methods tested is statistically significant (P < 0.05 using Wilcoxon paired score signed rank tests) for all scoring functions (including Qcombined and the Cline Shift score, which balance recall and precision) with a single exception: relative to MAFFT, the difference is significant only for the Developer score (P = 1.138e-05). BIG DATA: Scalability assessment Benchmarking MSA methods for time required

The first column gives the number of sequences and average sequence length for each dataset. ProbCons and MAFFT were run with five iterations of refinement; SATCHMO, SATCHMO-JS and T-Coffee used default parameters. The time to run SATCHMO-JS includes the time required for MAFFT, QuickTree and the subtree-selection program. MUSCLE’s run-time on these datasets is slightly longer than that of MAFFT (data not shown). T-Coffee failed to complete on the dataset with 500 sequences. Aligned FASTA format Bioe 144/244 Introduction to Protein Informatics

Similar to FASTA. Dashes indicate positions where a 35 sequence does not match the others in the MSA. UCSC a2m (align to HMM model) format

Bioe 144/244 http://compbio.soe.ucsc.edu/a2m-desc.html Introduction to Protein Informatics The UCSC SAM HMM software uses a specialized format for alignments, to describe how a sequence was emitted by an HMM (or, equivalently, aligns to an HMM).

a2m format MSA columns are of two types:

• Consensus positions: Columns consisting of upper- case characters and dashes correspond to nodes in the HMM representing the consensus structure • Uppercase characters are emitted in an HMM match state • Dashes are placed to indicate passage through an HMM skip/delete state • I.e., a dash indicates a sequence does not have the consensus structure at that position • Inserted positions: Columns consisting of lower- case characters and dots correspond to residues SATCHMO view of UCSC a2m format emitted in HMM insert states, representing inserts colored to highlight columns with between positions in the consensus structure • Dots are inserted post-hoc so that all different levels of similarity (based on sequences in the MSA have the same number of BLOSUM62) characters. NB: Not all sequences in the MSA are • Dots in one sequence indicate that another shown, which is why the coloring sequence has inserted characters using an insert state at that position might be a bit confusing. 36 Uses of MSA analysis tools Bioe 144/244 Introduction to Protein Informatics editors/viewers

• Identification of conserved motifs • Detection of non-homologs included accidentally in the MSA • Masking (deleting columns) prior to tree estimation • Cropping an alignment to a selected region – To build an HMM for a domain • Quick-and-dirty tree estimation (sometimes included in MSA viewers)

37 Bioe 144/244 Introduction to ProteinThe Informatics Belvu alignment viewer/editor

Belvu allows: q Coloring columns according to characteristics q Changing sequence order (by %ID, tree topology) q Deleting columns (specified range or characteristics) q Deleting sequences individually, or according to characteristics (fraction gaps, low %ID) q And more… 38 Bioe 144/244 Introduction to Protein Informatics

Jalview alignment viewer/editor

Software can be downloaded from http://www.jalview.org/ 39 Bioe 144/244 Introduction to Protein Informatics Summary (part 1)

• Sequence “signal” guides the sequence alignment – Results on benchmark datasets show that when the signal is weak (low %ID, especially if there are gaps), sequence-based alignment does not match the structural alignment • As proteins diverge from a common ancestor, their structures and functions often change – Chothia and Lesk analysis of “conserved core” regions shows that the fraction superposable drops as sequences diverge from their common ancestor – Not all positions can be superposed, and insertions and deletions relative to the MRCA (most recent common ancestor) accumulate – Even structural superposition can be challenging, and not all structural aligners will agree on which positions are equivalent

40 Bioe 144/244 Introduction to Protein Informatics Summary (part 2)

• MSA methods vary in computational complexity – Some are very fast (e.g., MAFFT, MUSCLE, -Omega) – Some are too slow for large datasets (e.g., ProbCons, T-Coffee, FSA, SATCHMO) – Some have moderate complexity (SATCHMO-JS) – Faster methods are generally preferred, but may not always perform as well as slower methods • Some methods are optimized for local alignment, while others are optimized for global alignment – Most methods assume input sequences are globally alignable – Restrict sequences to the homologous regions (NCBI BLAST lets you download the aligned regions) • Some methods perform worse when many homologs are included!

41 Bioe 144/244 Introduction to Protein Informatics Summary (part 3)

• Selection of sequences, alignment method and subsequent editing protocols must be guided by the intended use – E.g., MSAs are typically masked to remove noisy regions prior to tree estimation (but this may remove important information) • Inclusion of sequences with different multi-domain architectures in a dataset can cause serious errors in a multiple sequence alignment – Master-slave alignment methods such as BLAST, PSI-BLAST and HMM methods enable the detection of homologous regions in database hits; you can extract the alignable region and then use a more sophisticated tool to generate a good MSA

42