Spencer Bliven1,2,* Maria Anisimova1,2
Total Page:16
File Type:pdf, Size:1020Kb
Phylogenetics of Tandem Repeats with Circular HMMs: A Case Study on Armadillo Repeat Proteins Spencer Bliven1,2,* Maria Anisimova1,2 1 Institute for Applied Simulation, Zurich University of Applied Sciences, Wädenswil, Switzerland 2 Swiss Institute of Bioinformatics (SIB), Lausanne, Switzerland Abstract Evolution Tandem repeat proteins are characterized by multiple sequential Determining the evolution of tandem repeat proteins is a copies of repeats with significant structural or sequence TRAL availability challenge. The repeats typically have significant homology (for similarity. Tandem repeats evolve via repeat expansion, http://elkeschaper.github.io/tral/ instance, Asn39 is conserved in 51% of human ArmRP; Leu22 duplication and loss, and many protein families exhibit very in 67% (Gul, 2017)). This leads to the hypothesis that most diverse repeat counts. Identifying the complex relationships tandem repeats evolved through duplication and fusion events. A between homologous proteins and between individual repeats is number of mechanisms exist for the duplication/fusion of a challenging task. Using tools developed in our group, we repeats, both individually and collectively through gene present a detailed phylogenetic analysis of the repeats in the duplication/fusion. Combining this with gene-level speciation Armadillo Repeat Protein (ArmRP) family. and duplications creating paralogs can explain the complex The ArmRP family is very diverse, appearing throughout the relationships we see among TR families. eukaryotes and having a wide range of functions. They are well Tandem Repeat Annotation characterized structurally, with ~42 amino acid repeats forming a) 1 b) 1 c) 1 three alpha-helices which assemble into a solenoid structure. One issue with identifying tandem repeats is that the first and ArmRP are exciting candidates for protein design, as they have last repeats are often truncated or atypical. With standard 12 12 12 been shown to bind peptides in a modular manner (Reichen sequence alignment methods, this can lead to missing partial 2016). repeats and incorrect phase identification: 1234 A1B1 A2 B2 A1B1 A2 B2 Phylogenetic analysis of tandem repeats has several unique challenges. Identifying homologous regions is complicated by Canonical phase the repetitive nature of the sequence, which can cause register A1B1 A3 B3 A2 B2 A4 B4 A1 A3B1 B3 A2 A4 B2 B4 A1 A3A2 B1 A4 B3B2 B4 shifts when applying standard alignment tools. We surmount this Key: Duplication Speciation problem using the Tandem Repeat Annotation Library (TRAL), Possible evolutionary histories for two 4-repeat proteins. a) TR expansion occurs before speciation, giving identical TR orders in both species. b) Both species undergo independent a tool for accurately identifying repeats using circular profile Detected phase whole-gene duplications, giving similar subtree patterns in both species. c) Both species hidden Markov models (Schaper 2015). After constructing a undergo independent TR duplications showing different patterns. multiple alignment of the repeats of ArmRP representatives, we (Top) Protein with 3.5 repeats with repeat boundaries assigned from literature. (Bottom) KPNA2 1 infer a phylogenetic tree relating the different ArmRP and use it KPNA4 1 Incorrect tandem repeat identification, with non-standard boundaries, a missing partial KPNA5 1 KPNA1 1 to analyze the conservation and diversification patterns through repeat, and overlapping repeats. KPNA6 1 KPNA1 6 KPNA5 6 evolution, based on the information about tandem repeat KPNA6 6 KPNA2 6 KPNA4 6 number, order and their distribution on phylogenies. Additionally, misalignments near the repeat boundaries can cause KPNA1 9 KPNA5 9 overlapping repeats which must be reconciled. KPNA6 9 KPNA4 9 KPNA2 9 KPNA1 10 These issues are resolved in the Tandem Repeat Annotation KPNA5 10 KPNA6 10 KPNA4 10 Library (TRAL) (Schaper 2015). The library makes use of KPNA2 10 KPNA1 8 KPNA5 8 circular profile hidden Markov models (cpHMM) to capture TR KPNA6 8 KPNA4 8 KPNA2 8 profiles. KPNA1 3 KPNA6 3 KPNA5 3 KPNA2 3 KPNA4 3 KPNA1 7 KPNA5 7 Armadillo Repeat Proteins KPNA6 7 KPNA2 7 KPNA4 7 KPNA1 4 KPNA5 4 KPNA6 4 B M1 ... MN E KPNA2 4 KPNA4 4 KPNA1 2 KPNA6 2 KPNA5 2 KPNA2 2 KPNA4 2 KPNA1 5 KPNA5 5 DN KPNA6 5 DN D1 ... D1 KPNA2 5 KPNA4 5 0.7 Phylogenetic tree for human karyopherin (KPNA) proteins. Each protein has a 10 Arm IN I1 ... IN repeats. Each repeat clusters together, indicating that the order of repeats has not changed since the paralogous expansion. The support for this tree is too weak to draw conclusions about the relationships between repeats. Calculated using IQ-Tree (Nguyen 2015). Circular TR sequence profile HMM (cpHMM). The alignment starts in any position within the TR with equal probability (blue). States representing alignment matches, insertions and API5 3 RQCD1 3 UNC45A 7 CAB39L 1 deletions are included as in the HMMER model (black). Additional transitions are added CAB39 1 CTNNBL1 2 Preliminary tree of human ArmRP proteins CTNNBL1 3 DIAPH1 4 FMNL1 4 from the end of a repeat back to the beginning (red) to align multiple repeats. The alignment FMNL2 4 RQCD1 6 with structural information in the Protein UNC45A 8 CTNNB1 7 JUP 7 UNC45A 1 may end at any position with equal probability (green). UNC45A 6 APC 2 Data Bank. The shown tree does not CTNND2 2 PKP1 2 PKP2 2 Designed ArmRP YIIIM5AII bound to a (KR)5 peptide [5AEI] APC 3 UNC45A 3 explicitly model duplication events. Leaves APC 4 DAAM1 4 RQCD1 1 USO1 8 RQCD1 5 CTNND2 3 are colored by gene. Calculated using PKP1 3 PKP2 3 CTNNBL1 6 HSPBP1 4 CTNNB1 8 JUP 8 IQ-Tree. The cpHMM allows the TR region to start and end at any point APC 7 CTNNB1 10 Armadillo repeat proteins (ArmRP) are composed of ~40 JUP 10 CTNND2 6 DAAM1 1 PKP1 6 PKP2 6 CAB39L 4 in the repeat, making the phase inconsequential. Additionally, it CAB39 4 DAAM1 3 amino acid repeats forming a alpha solenoid structure. Each FMNL1 3 FMNL2 3 USO1 11 CTNNBL1 5 UNC45A 5 HSPBP1 1 prevents overlaps and provides a rigorous statistical model for HSPBP1 2 USO1 3 repeat contains three alpha helices, with a hydrophobic core and USO1 6 USO1 1 USO1 4 USO1 5 CTNNBL1 1 dealing with insertions between and within repeats. CTNNBL1 4 DAAM1 5 DIAPH1 2 conserved binding residues along the third helix. FMNL1 5 FMNL2 5 DIAPH1 5 KPNA1 10 KPNA5 10 KPNA6 10 KPNA4 10 KPNA2 10 1.67 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 TRAL contains functionality for constructing a cpHMM from a CTNNB1 12 JUP 12 APC 5 CTNND2 4 PKP1 4 PKP2 4 CTNNB1 6 JUP 6 set of possibly inconsistent or overlapping seed TR hits. CTNND2 5 PKP1 5 PKP2 5 APC 6 CAB39L 5 CAB39 5 KPNA1 1 KPNA6 1 0.83 CTNNB1 4 JUP 4 CTNNB1 5 L JUP 5 L USO1 10 P I S CTNND2 9 V AG KPNA2 1 G K L I Q V KPNA5 1 A E I GG D WAV KPNA4 1 L T PKP1 9 L QE N N R S L L I Q V A S E E N K V I UNC45A 2 E V E S EA R E L R Q L E E A A PKP2 9 S A P I Q L L E S E V V R Q A D V K S V N R I S Q D S L K I V CTNNB1 2 D N L AY QA RKM I N QKP S PD A E D Q CR T NQQAA E V A K D S L QS T A JUP 2 G EN YM K R R S S R C EQ V P HKR K D K T MF D T I T C G I N F C E USO1 9 P F F T M L S G T K S DM D F MN Q T AM N R N T N F K C F F Q N R N D F C N K T F Q TQ I D F R Y M H F R D N D T CTNNB1 9 Y R M CK Y C H H Q M WC F MC FM I C D Q Y Q 0 H C HAYQHWMWHMHM NHY C CQL F DHH C H I C H MCWWW CWY M I T T H JUP 9 0.25 0.47 0.80 0.89 0.93 0.95 0.99 1.00 1.00 1.00 1.00 1.00 0.95 0.98 0.99 0.99 0.99 1.00 0.98 0.99 1.00 1.00 1.00 1.00 0.99 0.99 0.90 0.96 0.99 1.00 1.00 1.00 0.99 0.99 0.99 1.00 1.00 0.98 0.97 0.96 0.95 0.90 DIAPH1 1 RQCD1 4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 PKP2 7 PKP1 7 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9 0 CAB39L 3 CAB39 3 PKP1 8 CTNND2 8 PKP2 8 USO1 2 CTNNB1 1 JUP 1 KPNA1 5 KPNA5 5 KPNA6 5 KPNA2 5 KPNA4 5 CAB39L 2 CAB39 2 Consensus Arm repeat.