Phylogenetics of Tandem Repeats with Circular HMMs: A Case Study on Spencer Bliven1,2,* Maria Anisimova1,2 1 Institute for Applied Simulation, Zurich University of Applied Sciences, Wädenswil, Switzerland 2 Swiss Institute of (SIB), Lausanne, Switzerland

Abstract Evolution Tandem repeat proteins are characterized by multiple sequential Determining the evolution of tandem repeat proteins is a copies of repeats with significant structural or sequence TRAL availability challenge. The repeats typically have significant homology (for similarity. Tandem repeats evolve via repeat expansion, http://elkeschaper.github.io/tral/ instance, Asn39 is conserved in 51% of human ArmRP; Leu22 duplication and loss, and many families exhibit very in 67% (Gul, 2017)). This leads to the hypothesis that most diverse repeat counts. Identifying the complex relationships tandem repeats evolved through duplication and fusion events. A between homologous proteins and between individual repeats is number of mechanisms exist for the duplication/fusion of a challenging task. Using tools developed in our group, we repeats, both individually and collectively through gene present a detailed phylogenetic analysis of the repeats in the duplication/fusion. Combining this with gene-level speciation Armadillo Repeat Protein (ArmRP) family. and duplications creating paralogs can explain the complex The ArmRP family is very diverse, appearing throughout the relationships we see among TR families. and having a wide range of functions. They are well Tandem Repeat Annotation characterized structurally, with ~42 amino acid repeats forming a) 1 b) 1 c) 1 three alpha-helices which assemble into a solenoid structure. One issue with identifying tandem repeats is that the first and ArmRP are exciting candidates for protein design, as they have last repeats are often truncated or atypical. With standard 12 12 12 been shown to bind peptides in a modular manner (Reichen sequence alignment methods, this can lead to missing partial 2016). repeats and incorrect phase identification: 1234 A1B1 A2 B2 A1B1 A2 B2 Phylogenetic analysis of tandem repeats has several unique challenges. Identifying homologous regions is complicated by Canonical phase the repetitive nature of the sequence, which can cause register A1B1 A3 B3 A2 B2 A4 B4 A1 A3B1 B3 A2 A4 B2 B4 A1 A3A2 B1 A4 B3B2 B4 shifts when applying standard alignment tools. We surmount this Key: Duplication Speciation problem using the Tandem Repeat Annotation Library (TRAL), Possible evolutionary histories for two 4-repeat proteins. a) TR expansion occurs before speciation, giving identical TR orders in both species. b) Both species undergo independent a tool for accurately identifying repeats using circular profile Detected phase whole-gene duplications, giving similar subtree patterns in both species. c) Both species hidden Markov models (Schaper 2015). After constructing a undergo independent TR duplications showing different patterns. multiple alignment of the repeats of ArmRP representatives, we (Top) Protein with 3.5 repeats with repeat boundaries assigned from literature. (Bottom) KPNA2 1 infer a phylogenetic tree relating the different ArmRP and use it KPNA4 1 Incorrect tandem repeat identification, with non-standard boundaries, a missing partial KPNA5 1 KPNA1 1 to analyze the conservation and diversification patterns through repeat, and overlapping repeats. KPNA6 1 KPNA1 6 KPNA5 6 evolution, based on the information about tandem repeat KPNA6 6 KPNA2 6 KPNA4 6 number, order and their distribution on phylogenies. Additionally, misalignments near the repeat boundaries can cause KPNA1 9 KPNA5 9 overlapping repeats which must be reconciled. KPNA6 9 KPNA4 9 KPNA2 9 KPNA1 10 These issues are resolved in the Tandem Repeat Annotation KPNA5 10 KPNA6 10 KPNA4 10 Library (TRAL) (Schaper 2015). The library makes use of KPNA2 10 KPNA1 8 KPNA5 8 circular profile hidden Markov models (cpHMM) to capture TR KPNA6 8 KPNA4 8 KPNA2 8 profiles. KPNA1 3 KPNA6 3 KPNA5 3 KPNA2 3 KPNA4 3 KPNA1 7 KPNA5 7 Armadillo Repeat Proteins KPNA6 7 KPNA2 7 KPNA4 7 KPNA1 4 KPNA5 4 KPNA6 4 B M1 ... MN E KPNA2 4 KPNA4 4 KPNA1 2 KPNA6 2 KPNA5 2 KPNA2 2 KPNA4 2 KPNA1 5 KPNA5 5 DN KPNA6 5 DN D1 ... D1 KPNA2 5 KPNA4 5

0.7 Phylogenetic tree for human (KPNA) proteins. Each protein has a 10 Arm IN I1 ... IN repeats. Each repeat clusters together, indicating that the order of repeats has not changed since the paralogous expansion. The support for this tree is too weak to draw conclusions about the relationships between repeats. Calculated using IQ-Tree (Nguyen 2015). Circular TR sequence profile HMM (cpHMM). The alignment starts in any position within the TR with equal probability (blue). States representing alignment matches, insertions and API5 3 RQCD1 3 UNC45A 7 CAB39L 1 deletions are included as in the HMMER model (black). Additional transitions are added CAB39 1 CTNNBL1 2 Preliminary tree of human ArmRP proteins CTNNBL1 3 DIAPH1 4 FMNL1 4 from the end of a repeat back to the beginning (red) to align multiple repeats. The alignment FMNL2 4 RQCD1 6 with structural information in the Protein UNC45A 8 CTNNB1 7 JUP 7 UNC45A 1 may end at any position with equal probability (green). UNC45A 6 APC 2 Data Bank. The shown tree does not CTNND2 2 PKP1 2 PKP2 2 Designed ArmRP YIIIM5AII bound to a (KR)5 peptide [5AEI] APC 3 UNC45A 3 explicitly model duplication events. Leaves APC 4 DAAM1 4 RQCD1 1 USO1 8 RQCD1 5 CTNND2 3 are colored by gene. Calculated using PKP1 3 PKP2 3 CTNNBL1 6 HSPBP1 4 CTNNB1 8 JUP 8 IQ-Tree. The cpHMM allows the TR region to start and end at any point APC 7 CTNNB1 10 Armadillo repeat proteins (ArmRP) are composed of ~40 JUP 10 CTNND2 6 DAAM1 1 PKP1 6 PKP2 6 CAB39L 4 in the repeat, making the phase inconsequential. Additionally, it CAB39 4 DAAM1 3 amino acid repeats forming a alpha solenoid structure. Each FMNL1 3 FMNL2 3 USO1 11 CTNNBL1 5 UNC45A 5 HSPBP1 1 prevents overlaps and provides a rigorous statistical model for HSPBP1 2 USO1 3 repeat contains three alpha helices, with a hydrophobic core and USO1 6 USO1 1 USO1 4 USO1 5 CTNNBL1 1 dealing with insertions between and within repeats. CTNNBL1 4 DAAM1 5 DIAPH1 2 conserved binding residues along the third helix. FMNL1 5 FMNL2 5 DIAPH1 5 KPNA1 10 KPNA5 10 KPNA6 10 KPNA4 10 KPNA2 10 1.67 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 TRAL contains functionality for constructing a cpHMM from a CTNNB1 12 JUP 12 APC 5 CTNND2 4 PKP1 4 PKP2 4 CTNNB1 6 JUP 6 set of possibly inconsistent or overlapping seed TR hits. CTNND2 5 PKP1 5 PKP2 5 APC 6 CAB39L 5 CAB39 5 KPNA1 1 KPNA6 1 0.83 CTNNB1 4 JUP 4 CTNNB1 5 L JUP 5 L USO1 10 P I S CTNND2 9 V AG KPNA2 1 G K L I Q V KPNA5 1 A E I GG D WAV KPNA4 1 L T PKP1 9 L QE N N R S L L I Q V A S E E N K V I UNC45A 2 E V E S EA R E L R Q L E E A A PKP2 9 S A P I Q L L E S E V V R Q A D V K S V N R I S Q D S L K I V CTNNB1 2 D N L AY QA RKM I N QKP S PD A E D Q CR T NQQAA E V A K D S L QS T A JUP 2 G EN YM K R R S S R C EQ V P HKR K D K T MF D T I T C G I N F C E USO1 9 P F F T M L S G T K S DM D F MN Q T AM N R N T N F K C F F Q N R N D F C N K T F Q TQ I D F R Y M H F R D N D T CTNNB1 9 Y R M CK Y C H H Q M WC F MC FM I C D Q Y Q 0 H C HAYQHWMWHMHM NHY C CQL F DHH C H I C H MCWWW CWY M I T T H JUP 9 0.25 0.47 0.80 0.89 0.93 0.95 0.99 1.00 1.00 1.00 1.00 1.00 0.95 0.98 0.99 0.99 0.99 1.00 0.98 0.99 1.00 1.00 1.00 1.00 0.99 0.99 0.90 0.96 0.99 1.00 1.00 1.00 0.99 0.99 0.99 1.00 1.00 0.98 0.97 0.96 0.95 0.90 DIAPH1 1 RQCD1 4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 PKP2 7 PKP1 7 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9 0 CAB39L 3 CAB39 3 PKP1 8 CTNND2 8 PKP2 8 USO1 2 CTNNB1 1 JUP 1 KPNA1 5 KPNA5 5 KPNA6 5 KPNA2 5 KPNA4 5 CAB39L 2 CAB39 2 Consensus Arm repeat. (Top) HMM logo based on a multiple CTNND2 7 CTNNB1 3 JUP 3 DIAPH1 3 HSPBP1 3 CTNND2 1 alignment of human Arm proteins by Gul (2017). Image RQCD1 2 APC 1 PKP1 1 PKP2 1 API5 2 KPNA1 4 generated with Skylign (Wheeler 2014). KPNA5 4 KPNA6 4 KPNA2 4 KPNA4 4 KPNA1 9 (Left) Repeat structure, showing hydrophobic core (grey), KPNA5 9 ArmRP Prevalence KPNA6 9 KPNA2 9 KPNA4 9 USO1 7 KPNA1 6 glycine turns (green), positive (blue), negative (red), and amidic KPNA5 6 KPNA6 6 KPNA2 6 KPNA4 6 KPNA1 8 KPNA5 8 (magenta) conserved residues. Side chains in the binding Prevalence of ArmRP KPNA6 8 KPNA4 8 KPNA2 8 CTNNB1 11 JUP 11 pocket are shown as sticks. KPNA1 3 KPNA6 3 KPNA5 3 members across KPNA2 3 KPNA4 3 KPNA2 7 KPNA4 7 KPNA1 7 KPNA5 7 KPNA6 7 KPNA1 2 metazoan species. KPNA6 2 KPNA5 2 UNC45A 4 KPNA2 2 KPNA4 2

ArmRP proteins were 2.0 ArmRP bind extended peptides, which often are disordered detected using TRAL prior to binding. Two amino acids binding each repeat. The across a 94 metazoan conserved Asn at position 39 of the repeat typically forms a species. Between 14 and hydrogen bond to the peptide backbone. Specificity is conferred 170 proteins were by a shallow binding pocket between helix 3 of adjacent repeats, detected containing at Conclusions & Outlook which binds the peptide side chain. However, many ArmRP least two consecutive We have shown that using cpHMM in TRAL we are able to show weak specificity and engage multiple partners. repeats. improve our identification of ArmRP repeats. It is able to capture the correct phase of the repeats, as well as handle partial terminal repeats. Using the ArmRP alignment we construct phylogenetic trees of the repeats. Work continues to reconstruct detailed evolutionary References histories for individual repeats. This will allow the identification of repeat duplication and loss across this complex family. Gul, I. S., Hulpiau, P., Saeys, Y., & van Roy, F. (2017). Biology, 428(22), 4467–4489. http://doi.org/10.1016/ Metazoan evolution of the armadillo repeat j.jmb.2016.09.012 superfamily. Cellular and Molecular Life Sciences : Schaper, E., Korsunsky, A., Pečerska, J., Messina, A., Work is also ongoing to incorporate duplication and fusion CMLS, 74(3), 525–541. http://doi.org/10.1007/ Murri, R., Stockinger, H., et al. (2015). TRAL: tandem s00018-016-2319-6 repeat annotation library. Bioinformatics, 31(18), events into the model for phylogenetic inference. This would Nguyen, L.-T., Schmidt, H. A., Haeseler, von, A., & Minh, 3051–3053. http://doi.org/10.1093/bioinformatics/ B. Q. (2015). IQ-TREE: A Fast and Effective btv306 allow reconstruction of the evolutionary events leading to Stochastic Algorithm for Estimating Maximum- Wheeler, T. J., Clements, J., & Finn, R. D. (2014). Skylign: Likelihood Phylogenies. Molecular Biology and a tool for creating informative, interactive logos tandem repeats. Evolution, 32(1), 268–274. http://doi.org/10.1093/ representing sequence alignments and profile hidden molbev/msu300 Markov models. BMC Bioinformatics, 15(1), 7. http:// Reichen, C., Hansen, S., Forzani, C., Honegger, A., doi.org/10.1186/1471-2105-15-7 Fleishman, S. J., Zhou, T., et al. (2016). Computationally Designed Armadillo Repeat Proteins Species Icons curtesy of the TogoDB for Modular Peptide Recognition. Journal of Molecular (http://togodb.biosciencedbc.jp)

Poster presented at SIB Days 2018 (Biel, CH) This research was supported by COST Action BM1405 and COST Switzerland SEFRI project IZCNZ0-174836. This work is licensed under a Creative Commons Attribution 3.0 Unported License.