<<

Liu et al. Genome Medicine (2017) 9:65 DOI 10.1186/s13073-017-0456-7

METHOD Open Access Interrogating the “unsequenceable” genomic trinucleotide repeat disorders by long-read sequencing Qian Liu1, Peng Zhang2, Depeng Wang2, Weihong Gu3 and Kai Wang1,4*

Abstract expansion, such as trinucleotide repeat expansion (TRE), is known to cause a number of genetic diseases. Sanger sequencing and next-generation short-read sequencing are unable to interrogate TRE reliably. We developed a novel algorithm called RepeatHMM to estimate repeat counts from long-read sequencing data. Evaluation on simulation data, real amplicon sequencing data on two repeat expansion disorders, and whole-genome sequencing data generated by PacBio and Oxford Nanopore technologies showed superior performance over competing approaches. We concluded that long-read sequencing coupled with RepeatHMM can estimate repeat counts on and can interrogate the “unsequenceable” genomic trinucleotide repeat disorders. Keywords: Trinucleotide repeats, Trinucleotide repeat disorders, Microsatellites, RepeatHMM, PacBio, Nanopore, Long-read sequencing

Background including Huntington’s disease, dentatorubropallidoluy- Trinucleotide repeat represents repetitive stretches of sian atrophy, spinal and bulbar muscular atrophy [5], three base-pair motifs in DNA sequences. For example, and six types of spinocerebellar ataxia, where the repeat the DNA sequence “CAGCAGCAGCAGCAG” contains thresholds for pathogenicity vary in these disorders. In five CAG repeats. Trinucleotide repeat can be located in addition, trinucleotide expansion may also cause other coding and non-coding regions of the genome and is a types of disorders, including fragile X syndrome [6], common type of microsatellite repeats. The expansion of Friedreich’s ataxia, myotonic dystrophy, and fragile XE microsatellites, especially trinucleotide repeat expansion mental retardation [2, 7]. All these genetic diseases (TRE), has been implicated in more than 40 neurological caused by excessive expansion of trinucleotide repeats disorders [1, 2]. For instance, the ATXN3 gene usually [5, 6] are collectively referred to as trinucleotide repeat contains 13–41 CAG repeats [3]; more than 55 CAG disorders (TRDs). repeats in the ATXN3 gene are pathogenic and can TRDs and other disorders caused by microsatellite re- cause spinocerebellar ataxia type 3 (SCA3), which is a peats have stimulated a large number of genetic studies. condition characterized by progressive problems with Some studies aimed to find therapeutic approaches to movement [4]. However, individuals with “intermediate regulate expression level of affected genes or to shorten repeat” may or may not develop SCA3. Several CAG re- pathogenic repeats, for example, using zinc finger nucle- peat diseases are also known as polyglutamine diseases, ases [8]. Other studies aimed to understand the molecu- where extensive repeats of the CAG codon result in lar mechanisms contributing to repeat expansion, such multiple consecutive glutamines in the protein sequence. as replication slippage [9–12], double-strand break re- Currently, there are at least nine polyglutamine diseases, pair [13, 14], base excision repair [15], and mismatch re- pair [16–18]. However, how these mechanisms are * Correspondence: [email protected] precisely responsible for repeat expansion is not fully 1Institute for Genomic Medicine, Columbia University, New York, NY 10032, USA elucidated yet [16]. 4Department of Biomedical Informatics, Columbia University, New York, NY To better understand the genotype-phenotype correl- 10032, USA ation of TRDs, it is important to detect repeat sizes Full list of author information is available at the end of the article

© The Author(s). 2017 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Liu et al. Genome Medicine (2017) 9:65 Page 2 of 16

accurately on personal genomes. Repeat size is critically of the MinION sequencer to be 38.2% [36]. Several prior associated with the severity of TRDs and the age of on- studies have explored the technical feasibility to use set of TRDs symptoms. Usually, when repeat count is long-read sequencing for the analysis of TRE, such as higher than a certain threshold, the higher the repeat CGG repeats in the fragile X gene [28]; however, long- count, the more severe the disorder and the earlier the read sequencing has not been routinely used in research onset of symptoms. The severity of TRDs may also in- and clinical studies of TRDs, partly due to the lack of ac- crease from an affected ancestor generation to each suc- curate, robust, and reproducible computational tools to cessive offspring generation, demonstrating the property estimate repeat counts. More importantly, typical long- of genetic anticipation [19]. Therefore, precise determin- read alignment algorithms (such as BLASR [37] or ation of repeat counts of trinucleotide repeats will lead BWA-MEM [38] with the “–x pacbio” parameters) may to an improved understanding of TRDs and the molecu- not work for reads containing long stretches of ex- lar mechanisms involved, and is also crucial for diagno- panded repeats. One example was given in Additional sis, risk assessment, and prognosis of TRDs. file 1: Figure S1 for a patient with SCA3. This patient To determine the repeat counts of microsatellites, had a pathogenic with 67 CAG repeats in the polymerase chain reaction (PCR) is typically used to ATXN3 gene and we sequenced the repeat region and amplify genomic regions of interest (ROIs) and then the the immediate flanking region using PacBio SMRT se- repeat counts are determined by various techniques, quencing techniques. After aligning those reads to the such as capillary electrophoresis [20], gel electrophoresis reference genome, we attempted to infer the re- [21], southern blot analysis [22], electrochemical detec- peat counts directly from the BAM file (BAMSelf in tion [23], melting curve analysis [24], mass spectrometry Additional file 1: Figure S1). Clearly, this naïve method [25], or small-molecule biosensors [26]. However, these failed in finding the pathogenic allele, suggesting that techniques have several limitations to analyze microsat- more sophisticated algorithms need to be developed to ellite repeats, in that they are typically labor-intensive address these challenges. and time-consuming [25]. They may be difficult to be To improve the estimation of TRE from long-read se- applied in high-throughput screening studies where hun- quencing data, we developed a novel computational tool dreds or thousands of patients need to be genotyped at called RepeatHMM. RepeatHMM takes a set of reads as the same time. Sanger sequencing usually works for sub- input, uses a split-and-align strategy to improve align- jects with short repeats, but has substantial difficulty to ments, performs error correction, and leverages a hidden infer long repeats in patients from the sequence traces, Markov model (HMM) and a peak calling algorithm based even with careful manual examination. Next-generation on Gaussian mixture model to infer repeat counts. sequencing techniques, such as those from Illumina and RepeatHMM allows users to specify error parameters of Ion Torrent, have difficulty sequencing GC-rich (or GC- the sequencing experiments, thus automatically producing poor) repeat regions [27] and the repeat length in pa- transition and emission matrices for HMM and allowing tients may easily exceed the length of the sequence reads the analysis of both PacBio and Oxford Nanopore data. [28]. Therefore, it is extremely difficult, if not impossible, Below, we describe our results on evaluating RepeatHMM to use these sequencing techniques to resolve longer re- on simulation datasets under different sequencing scenar- peats [29], sometimes referred to as “unsequenceable re- ios and on real datasets generated from amplicon sequen- gions” of the human genome [28, 30]. cing and whole-genome sequencing (WGS). It is worth In contrast to short-read sequencing, the development noting that RepeatHMM is different from several previ- of long-read sequencing technologies, such as PacBio ously published tools, such as RepeatMasker [39], tandem SMRT (single molecule real-time) sequencing [31] and repeat finder (TRF) [40], and TRhist [41], which screen Oxford Nanopore sequencing [32], enables the interro- for simple/intersperse repeats only for a query sequence. gation of more than 10,000 bp of genomic DNA se- RepeatHMM is also different from lobSTR [42], which in- quence, thus offering the theoretical advantage to fers microsatellites from short-read sequencing data, or determine repeat counts in human participants [33]. PacmonsTR [43], which needs detail alignment and repeat However, PacBio reads have higher error rates [34, 35] information of every long read and uses realignment to (~15% on average) with a strong bias towards insertions determine repeat regions in long reads before size estima- [31]; therefore, it is not straightforward to directly use tion. RepeatHMM can be accessed at https://github.com/ long reads to detect repeat counts, even when coupled WGLab/RepeatHMM. with circular consensus sequencing (CCS), that is, se- quencing the same segments many times and perform- Methods ing self-error correction. Similar limitations exist in the Summary of RepeatHMM Oxford Nanopore platform, for example, a report esti- RepeatHMM consists of several steps, as shown in Fig. 1. mated the base calling error rate of an early generation We used trinucleotide repeat as an example below to Liu et al. Genome Medicine (2017) 9:65 Page 3 of 16

Fig. 1 A flowchart of the procedure to infer repeat counts using RepeatHMM illustrate the procedure, but RepeatHMM can be used repeat regions were not successfully detected by the for microsatellites of any size. split-and-align strategy, were directly aligned with a reference genome by BWA-MEM. Those long reads (1)Identify the location of the repeat ROI in a reference were discarded if they could not be aligned to the genome: first, we used a reference genome (GRCh38 reference genome with long flanking sequences. was used in this study) to find the gene of interest (3)Repeat regions in a long read: we next used long and determined the exact start and end location of a reads that covered the repeat region with upstream trinucleotide repeat region. and downstream flanking segments for further (2)The alignment of a long read: then, given a set of analysis. In RepeatHMM, the minimum length of long reads of interests, we used two sub-processes to the upstream and downstream segments can be detect repeat regions in long reads. First, we used TRF specified by users (by default, 18 bp). Furthermore, [40] to detect repeats from a long read and then split we rematched the flanking segments to a reference the long read into several flanking sub-sequences and genome. If the alignment had high identity, we repeat regions. After that, all flanking sub-sequences inserted several Ns between repeat regions and their were aligned to a reference genome using BWA- flanking segments to guarantee that the flanking MEM with specialized parameters, because flanking segments were identified as non-repeat states in sub-sequences still have high error rates but much RepeatHMM. shorter length. Successful alignments of ordered (4)Error correction of a long read: to correct for flanking sub-sequences were detected and used to sequencing errors, we used a template with perfect determine the corresponding repeat regions in long repeats. For example, a long read on a CTG repeat reads. We called this process the split-and-align region was “CATGCTGCTGCTGGCTTCCCGCTG strategy. Second, all the remaining long reads, whose CTGGGTTTTTTTGTTAGTTAATGCTTTTTGCT Liu et al. Genome Medicine (2017) 9:65 Page 4 of 16

TGCATGTCTG,” which contained a lot of insertions s3 = G, s4 = T} and the hidden states H ={h1 = N, h2 = Cr, and deletions. To perform error correction, we h3 = Ar, h4 = Gr, h5 = ICr, h6 = IAr, h7 = IGr, h8 = DCr, h9 = designed a template with perfect CTG repeats that DAr, h10 = DGr} indicating, respectively, non-repeat nucle- is 50% longer than this region and then used otides, the first, second, and third nucleotide of repeats, UnsymSeqAlg to align this read with the template, the insertion after the first, second, and third nucleotide, then corrected for errors based on the alignment. and the deletion of the first, second, and third nucleotide. (5)Detection of trinucleotide repeats: each long read was used as input to a HMM [44] to estimate the Emission matrix repeat counts. The details on the HMM were given Emission matrix specifies the emission probability of a below. For each long read, we estimated hidden state to the four nucleotides and Ns, where each row states given observed sequence and a repeat count represents a hidden state, each column represents a nu- was then estimated from the model. cleotide, and the sum of each row is equal to 1. In an (6)Peak calling of repeat counts from all long reads: we emission matrix, we considered an emission to be ex- next generated a histogram of all estimated repeat pected if hk+1emits the k th nucleotide acid in a micro- counts produced by HMM from all long reads. We satellite or h2*E + k emits the (k + 1) th nucleotide acid designed a peak calling procedure described below and h3*E +1emits the first nucleotide acid. For example, to detect one peak (homozygous) or two peaks for CAG repeats, we considered an emission to be (heterozygous), which represented the estimated expected if h2, h3, and h4 emits C, A, and G, respectively, repeat counts for this participant. and h8, h9, and h10 emits A, G, and C, respectively. Then, assume that a random emission rate would be HMM for repeat detection 0.02 (same as the substitution error rate), then all un- In this study, we used first-order HMM [44] to model expected emission probability is 0.005 (i.e. 0.02 divided the relationship between an observed sequence and a by 4) and the expected emission probability is 0.985 sequence of hidden states, where the probability of a (i.e. 1 – 0.005*3). The emission probability of an inser- state at each position only depends on the states at the tion state is 0.25 and of the non-repeat state is 0.2, previous position and the probability of an observation equally for the four nucleotides or N. An example at each position only depends on the emission probabil- matrix of E{N,M} for trinucleotide repeats is given in ity of the state at that position. Our HMM consists of Additional file 1: Table S1. several components, including a set of N hidden states H ={h1, h2, ⋯, hN}, a set of M observed symbols S ={s1 = A, Transition matrix s2 = C, s3 = G, s4 = T, s5 = N}, an emission matrix E{N,5} Transition matrix specifies the transition probabilities ={eij}{N,5} representing the probability of hi emitting sj,a between different hidden states, where each row repre- transition matrix T{N,N} ={tij}{N,N} indicating the probability sents a state, each column represents the state to be of hi in the previous state transiting to hj in the next state, transited to, and the sum of each row is equal to 1. and the starting probability P ={p1, p2, ⋯, pN}givingthe Transition matrix in RepeatHMM has several specific probability of each state before the first position of a rules based on the error profile of long reads: (1) insertion sequence. Then, the likelihood of an observation sequence probability: the transitions from the state rk/Irk/Dr(k − 1) with L observed symbols O = sk_1,sk_2,sk_3,⋯, sk_L would be to Irk, where 1 ≤ k ≤ E and Dr(k − 1) =Dre if k =1,indi- P(O)=∑HP(O|H, E, T)P(H, E, T). Details on the HMM cate a possible insertion; (2) deletion probability: the components were described below: transitions from h1 to Dr1 and r(k − 1)/Ir(k − 1)/Dr(k − 2) to Drk,where1≤ k ≤ Eandr(k − 1) = re, Ir(k − 1) = Ire, Hidden states and observed symbols Dr(k − 2) =Dr(E − 1) if k = 1, indicate a possible deletion; Given a microsatellite with E nucleotides in each repeat (3) the probability to/from repeat region: h1 to h2 indicates unit, H of our HMM has 3 * E + 1 hidden states, that is, a transition from non-repeat region to repeat region and N =3 * E +1:onehiddenstate h1 = N for those nucleo- re/Ire/Dr(E − 1) to h1 indicates a transition from repeat tides which are not in microsatellites and three types of region to non-repeat region, and both are set to 0.02 by hidden states for those nucleotides in microsatellites, default; (4) the transition probability from h1 to h1 is set to i.e. h2 = r1, h3 = r2, ⋯, hE +1= re, hE +2= Ir1, hE +3= 0.96 by default; (5) all non-expected transitions to rk or to Ir2, ⋯ h2*E +1=Ire, h2*E +2=Dr1, h2*E +3=Dr2, ⋯ insertion states or to deletion states have a probability h3*E +1=Dre indicating, respectively, the k th nucleo- close to 0, where 1 ≤ k ≤ E; (6) each expected transition, tide in repeats, the insertion after the k th nucleotide, i.e. the transition of r(k − 1)/Ir(k − 1)/Dr(k − 2) to rk, and the deletion of the k th nucleotide, where k ranges where 1 ≤ k ≤ E and r(k − 1) = re, Ir(k − 1) = Ire,Dr(k − 2) = from 1 to E. Without loss of generality, take CAG re- Dr(E − 1) if k = 1, has a probability of 1 minus the sum of peat for example, the observed symbols S ={s1 = A, s2 = C, other probabilities in its row. Without loss of generality, Liu et al. Genome Medicine (2017) 9:65 Page 5 of 16

take trinucleotide repeats and PacBio long reads (11% in- patterns has the same length. However, the emissions of sertion rate and 2% deletion rate by default), for example: the third and fifth positions require adjustments in the (1) insertion probability: the transitions from h2/h5/h10 to emission matrix. For example, suppose that the third h5, h3/h6/h8 to h6 and h4/h7/h9 to h7 indicate a possible in- position has 40% probability to be C and 60% probability sertion and their probability is thus i with default value of to be T, then the emission probability of the third pos- 0.11; (2) deletion probability: the transitions from h1/h4/ ition is 0.98 * 0.4 + 0.005 for C, 0.98 * 0.6 + 0.005 for T, h7/h9 to h8, h2/h5/h10 to h9,andh3/h6/h8 to h10 indicate a and 0.005 for both A and G. In such microsatellites, the possible deletion, and their probability is thus d with 0.02 mixed repeat patterns are required to be position- as the default value; (3) the probability to/from repeat re- independent, that is, knowing the symbol in the third gion: h1 to h2 indicates a transition from non-repeat re- position does not affect the emission probability in the gion to the first nucleotide of repeats and h4/h7/h9/h10 to fifth position. RepeatHMM can handle simple mixed h1 indicates a transition from repeat region to non-repeat microsatellite repeats as described here and automatic- region, and both are set to 0.02 by default; (4) the transi- ally readjust the emission matrix for repeat detections tion probability from h1 to h1 is set to n with 0.96 as the according to mixed patterns provided by users. default value; (5) all other non-expected transitions, in- cluding the transitions from h2/h3/h5/h6/h8/h10 to h2, h3/ Hidden state estimation h4/h6/h7/h8/h9 to h3,andh2/h4/h5/h7/h9/h10 to h4,havea With the matrices above, we used HMM with Viterbi al- probability close to 0; (6) each expected transition, i.e., the gorithms [45] to estimate hidden states of each nucleo- transitions of h4/h7/h9 to h2, h2/h5/h10 to h3,andh3/h6/h8 tide which maximize P(O) of a given observed long read. to h4, have a probability of 1 minus the sum of other prob- Based on the most likely hidden states, we estimated the abilities in its row. The example matrix of T{N,N} for trinu- repeat count for the long read. cleotide repeats is given in Additional file 1: Table S2. The matrix is used for all evaluations both on simulation data Unsymmetrical sequence alignment and error correction and real data for trinucleotide repeats in this study. The idea of unsymmetrical sequence alignment algo- rithm (UnsymSeqAlg) is similar to the well-known HMM for different trinucleotide repeats Needleman–Wunsch algorithm or Smith–Waterman al- In RepeatHMM, all trinucleotide repeat patterns have gorithm. The main difference is that UnsymSeqAlg as- the same symbols and hidden states names and the signs different penalties for introducing a gap in the hidden state names do not change with different repeat query and the target sequence. This strategy is reason- patterns (such as CTG or CCG). Different trinucleotide able, because typical sequence alignment algorithms repeats have different emission matrices and P (Additional usually apply the same gap penalty for two aligned se- file 1: Table S3 for CAG repeats), but RepeatHMM can quences and implicitly assume that two aligned se- automatically readjust all matrices based on a given repeat quences have the same error rates. The assumption does pattern. not hold when aligning a long read with high error rate with a template with perfect repeats (or a region in a ref- HMM for different microsatellite repeats erence genome). Thus, the penalty of a gap in long reads Different microsatellites have different repeat patterns should be significantly larger than that in the template. with varying lengths and various combinations of four For example, suppose that the match score is match = nucleotides. In RepeatHMM, all repeat patterns have the 1, mismatch score is mismatch = –1, the gap penalty for same symbol names (A, C, G, T, and N), but microsatel- template is gap_in_perf = –1 and the gap penalty for long lites with more nucleotides in repeat units have more reads is gap_in_read = –10, the correction of a CTG re- hidden states. Given a microsatellite repeat pattern, peat region “CATGCTGCTGCTGGCTTCCCGCTGCT RepeatHMM can automatically readjust hidden states and GGGTTTTTTTGTTAGTTAATGCTTTTTGCTTGCA all matrices correspondingly. Theoretically, RepeatHMM TGTCTG” by UnsymSeqAlg is “CTGCTGCTGCTG can handle repeat patterns with any number of nucleo- CTTCCGCTGCTGGTTTTTTTGTTGTTAATGCTTT tides in repeat units. TGCTGCTGCTG” where several insertions are removed. In contrast, a Smith–Waterman algorithm, with match = HMM for microsatellites with mixed patterns 1, mismatch = –1, gap_in_perf = gap_in_read = –1, would Microsatellite repeats may contain mixed repeat patterns produce a corrected region as: “CATGCTGCTGCTGG of the same length. For example, ATTCT repeats can be CTTCC-C-GCTGCTGG-GTTTTTT-TGTTAGTTAATG mixed with ATCCC or ATCCT or ATTCC repeats, that CTTTTTGCTTGCATG-T-CTG” where more gaps (and is, the third and fifth positions in this microsatellite re- more insertion errors) are introduced. To speed up the peat can be either C or T. For situations like this, the alignment in UnsymSeqAlg, banded alignment is also hidden states are still the same because each of mixed applied. Liu et al. Genome Medicine (2017) 9:65 Page 6 of 16

Peak calling of repeat counts parameters such as the coverage, the number of partici- We used the Python module scikit-learn for peak call- pants to be simulated, and insertion/deletion/substitu- ing from the histogram of all repeat counts estimated tion error rates: from long reads. For microsatellites located on auto- somes of a diploid genome, we assumed that the histo- (1)Manually examined the ATN1 gene in UCSC Genome gram was mixed by two main Gaussian models and Browser and identified the exact location of the CAG several minor models. We then used the following steps repeats. We assumed that the start position of the to obtain the peak(s) from the histogram. First, we re- repeat is start_pos, and the end position is end_pos. moved the repeat counts less than a minimum thresh- (2)Checked the literature to obtain the minimum and old (by default, 5) and those repeat counts with very maximum expansion size of both normal and few supporting reads (threshold specified by users). pathogenic repeats and denoted the expansion limits Second, we used N Gaussian components in Gaussian as min_repeat and max_repeat, respectively. mixture model, where N was in the range of 3–7(by (3)Set updown_size as max_repeat multiplied by 25 and default). For each N, we inferred the mixture model 20 also set updown_size = 1500 bp if updown_size was times since the estimation yielded different results each larger than 1500; then, obtained updown_size bp time.WeusedAkaikeinformationcriterion(AIC)to upstream region of the repeat region and updown_size select the best one and also required that the best mix- bp downstream region. ture model was not the first or the last. The selected (4)Randomly produced two counts, ci and cj, between mixture model usually contains several individual min_repeat and max_repeat. Two counts were Gaussian models. Third, we filtered the models requir- produced, because each gene has two , one ing that a Gaussian model with a smaller mean should from the father and the other from the mother. For have smaller standard deviation and should have a lar- the CAG repeat in ATN1, ci is a random number in ger amount of supporting reads. After the filtering, if the range of 6–35 and cj is a random number in the there was one peak, it suggested that two alleles had range of 49–88. the same repeat counts. If more than one peak was (5)Got a random position of a CAG in repeat region of available, we chose one peak with the largest number of the reference genome for each count. reads and identified another peak using the strategy: its (6)Inserted new CAG repeats at the position to produce supporting reads should be more than 80% (by default) trinucleotide repeats with ci or cj counts. of reads associated with the first peak, if its repeat (7)The lengths of upstream and downstream sequences count was less than the first peak; otherwise, a peak were independently generated from a normal with larger count was chosen. distribution with the mean of L bp and standard deviation of 10. L was set to half of updown_size for the smaller repeat counts and to half of updown_size Metrics for performance evaluation minus half of (ci − cj)×l for the larger repeat counts, To evaluate the performance, we used root mean square where l was the length of the repeat unit. error (RMSE) to assess the difference between estimated (8)Mutated the upstream sequence, the produced repeat counts and true repeat counts. Given a set of L repeat region in (6) and downstream sequence with subjects each with true repeat count RC and an esti- k 11% insertion rate, 2% deletion rate, and 2% mated count PC , k substitution rate. sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi X (9)Concatenated the mutated upstream, repeat region, L ðÞ− 2 ¼ RCk PCk and downstream to generate a random read. RMSE ¼ k 1 L (10)Repeated steps (5) to (9) to generate long reads with cj the coverage of cov.Pleasenotethatci repeats had ciþcj RMSE is a non-negative value; and the smaller the ci Âcov long reads and cj repeats had þ Â cov long RMSE, the closer the estimated repeat counts are to the ci cj reads. That is, longer repeats had less simulated long true repeat counts. reads and smaller repeats have more simulated long reads. Simulation datasets (11)Repeated steps (4) to (10) to generate long reads for To describe the simulation process clearly, we took the different participants. In the study, we simulated 100 ATN1 gene with CAG repeats as an example. Please participants. note that the effects of PCR slippage were not consid- ered in the simulation below. The PCR-based simulation of long reads with fixed The simulation of long reads with random start and start and end site was similar to the random read simu- end sites contained the following steps with user-defined lation above. The only difference was that the size of the Liu et al. Genome Medicine (2017) 9:65 Page 7 of 16

upstream and downstream sequences was determined by long reads were downloaded from ftp://ftp-trace.nc- PCR primers rather than random simulation. bi.nlm.nih.gov/giab/ftp/data/NA12878/ and the BAM file for Nanopore was downloaded from https://github.- Real datasets on patients with SCA3 com/nanopore-wgs-consortium/NA12878. The ground Genome DNA was obtained from peripheral blood of truth of microsatellite repeat counts is not available for 20 unrelated patients with SCA3 and five unaffected NA12878, but this subject is not expected to have participants. Target sequences of the CAG repeats frag- pathogenic alleles. We thus used the prediction from ment (about 1.5 kb in ATXN3) were amplified using theIlluminadataasthegoldstandardandthen two primers (f: GATTCTCGGATTTAGGATGC; r: AT assessed the performance of RepeatHMM on the two AAAGTGTGAAGGTAGCGAAC). Briefly, 50 ng DNA other platforms. template was added to a 25 μL mastermix of 5 μL 5X PrimeSTAR GXL Buffer, 5 mM dNTP Mixture, 7.5 uM Results primers, 0.625 U PrimeSTAR GXL DNA polymerase Overview of RepeatHMM and 15 μL ddH2O. Then, samples were amplified with RepeatHMM aims at the estimation of repeat counts an initial denaturation step of 95 °C for 5 min, followed for a given microsatellite from long-read sequencing by 35 cycles of 98 °C for 10 s, 56 °C for 15 s, 68 °C for data. RepeatHMM can handle trinucleotide repeats as 1min40s,withafinalextensionstepof68°Cfor well as other more complex repeat patterns and can be 10 min, and then held at 4 °C. PCR products with equal used on technologies with different error profiles, such molar ratio were barcoded, pooled, and constructed as as PacBio and Oxford Nanopore. A flowchart on a SMRTbell library following a standard protocol RepeatHMM is shown in Fig. 1 and a general overview (SMRTbell Template Prep Kit 1.0). The annealed is described below. SMRTbell templates were bounded with DNA polymer- First, given a specific microsatellite, RepeatHMM in- ase enzymes using the DNA Polymerase Binding Kit fers the start and end position for the repeat region in a and incubated with 9 nM of polymerase in the presence reference genome and uses a split-and-align strategy of phospholinked nucleotides for 6 h at 30 °C. After facilitated by TRF [40] and BWA-MEM [38] to find the that, the library was stored at 4 °C. Sequencing was per- approximate locations of the repeat region in each read. formed within 36 h of binding. The library was sequenced It will then use a sequence alignment algorithm Unsym- on a PacBio Sequel sequencer using the manufacturer’s SeqAlg to correct base errors in the read, to account for suggested protocols. The true repeat counts for patients the differences between different types of errors, such as were determined using capillary electrophoresis analysis. the much higher insertion errors (~11%) than deletion For control participants, we also sequenced the PCR prod- errors (~2%) in PacBio SMRT sequencing. Next, a HMM ucts by Sanger sequencing. is used to estimate which nucleotides are within repeats, given the transition probabilities between hidden states Real datasets on patients with SCA10 and the emission probabilities from each of these states This dataset [46] contained PacBio long-read sequencing to the four observed nucleotides. The transition/emis- data on three patients with spinocerebellar ataxia type sion probabilities can be directly inferred analytically ra- 10 (SCA10). Participants A, B, and C in this dataset had ther than empirically (i.e. estimated from data) and can about 840, 870, and 530 repeats, respectively, as esti- be automatically produced by RepeatHMM according to mated by gel electrophoresis of cloned expansion frag- user-specified error profiles and repeat patterns. Finally, ment excised from plasmid backbone [46]. The three estimated repeat counts from all reads will be tallied and patients had canonical ATTCT motif mixed with other one or two peaks are inferred from these distributions, repeats and the repeat regions were in the range of to estimate the repeat counts on each of the two hom- 4700–6500 bp. Karen et al. sequenced ATXN10 genes of ologous chromosomes. the three participants using SMRT sequencing tech- niques with C2 chemistry [46]. Estimation of repeat counts on simulation data To evaluate the performance of RepeatHMM, we ran- Real datasets on the NA12878 individual using three domly simulated long reads (CAG repeats together with sequencing techniques neighboring sequences) on the ATN1 gene for 100 par- The subject NA12878 had been sequenced by Illumina ticipants with varying coverages. For each participant, short-read sequencing technique [47], PacBio long-read two alleles were simulated, including one normal allele sequencing technique [48], and Oxford Nanopore long- with a repeat count sampled from 6 to 35 and one read sequencing technique. The coverage for these three pathogenic allele with a repeat count sampled from 49 platforms were ~300X, ~50X, and ~30X, respectively. to 88. The coverage was in the range of 10–100 with a All the BAM files for Illumina short reads and PacBio step of 10, in the range of 100–1000 with a step of 100, Liu et al. Genome Medicine (2017) 9:65 Page 8 of 16

and in the range of 1000–5000 with a step of 1000. The different amplification efficiency for the shorter and longer read simulation used typical error models for PacBio se- alleles as described in the “Methods” section. The number quencing data, with a 15% error rate, including 11% of participants and the categories of coverage levels had insertions, 2% deletions, and 2% substitutions (see the same setting as the simulation data with random start “Methods” for details). For each coverage level, we calcu- and end sites. lated the RMSE of true repeat counts and estimated Despite the use of different simulation settings, the re- counts for these 100 participants. Additionally, we also sults on PCR-based simulation data were largely consist- evaluated whether the alignment itself (BAMself) could ent with the results on the simulation data with random be informative for inferring repeat counts. For BAMself, start and end positions (Fig. 2b and d). For both we produced the BAM files using BWA-MEM with the RepeatHMM and BAMself, when the coverage increased recommended options for PacBio (i.e. − k17 − W40 − from 10 to 50, the RMSE for normal alleles dropped and r10 − A1 − B1 − O1 − E1), and –L1 and –wG where G = then levelled off. However, for pathogenic alleles, the Lm * 4 + hr was band width, Lm was the maximum RMSE for both algorithms continued to drop with in- repeat size, and hr was half of the length of the region of creasing coverage. Compared with the previous simula- reads in a reference genome. Then, we determined the tion dataset, higher coverage of sequencing data was start and end positions of the predicted repeat region in required to reach the same level of RMSE. each alignment for a long read, divided its length by 3, and rounded to the nearest integer, as the estimated Examination of effects when two alleles have similar repeat count for the read. repeat counts We found that both RepeatHMM and BAMself have The RepeatHMM pipeline used a peak-calling procedure improved RMSE for normal alleles when the coverage to identify peaks from a histogram of repeat counts from increased from 10 to 50 (Fig. 2a). When the coverage a collection of long reads, so we next evaluated its per- continued to increase, the RMSE of RepeatHMM lev- formance when the two alleles were very similar to each elled around 0.8, while the RMSE for BAMself levelled other. For example, when one allele had 15 repeats and around 2.5. For pathogenic alleles, the RMSE of both the other allele had 17 repeats, the small difference of RepeatHMM and BAMself substantially decreased when two repeats may not be discernable by the peak-calling the coverage increased from 10 to 200, but RepeatHMM algorithm. To assess this, we performed additional simu- had much larger improvement than BAMself; when the lation, where the count difference of two similar alleles coverage was over 200, RepeatHMM had much smaller were 1, 2, 3, 4, 5 to 6, and 7 to 9. For each count differ- RMSE than BAMself (1.7 versus 7). To further show the ence, the coverage was simulated from 20 (=10 * 21)to distribution of the differences between estimated repeat 5120 (=10 * 29); for each coverage level, 100 random counts and true repeat counts, we categorized the pre- pairs of similar alleles were simulated. The other settings diction errors into several groups, including those with of the simulation experiment were similar to the simula- prediction errors less than –3, equal to –3, –2, –1, 0, 1, tion described above. The RMSE of the prediction, to- 2, and 3, and more than 3 (figure repeat counts and true gether with the heterozygosity status of the prediction, repeat counts. Given a set oc). BAMself usually overesti- were given in Additional file 1: Figure S2. As expected, mated normal alleles by ≥ 2 repeats and overestimated when the two alleles were highly similar, the genotypes pathogenic repeats by > 3 repeats. In comparison, tended to be called homozygous. When the repeat differ- RepeatHMM generated the correct repeat counts for ence increased from 1 to 4, the fraction of heterozygotes most normal alleles with at most one repeat difference, incorrectly called as homozygotes decreased from ~ 50% while it underestimated by one or two repeats on patho- to ~35%, ~20%, and then ~0%, suggesting that genic alleles when the coverage was high enough. These RepeatHMM tend to over-call homozygotes when the results suggested that the alignment itself was not able difference between two alleles is less than 3. Overall, the to estimate repeat counts accurately. RMSE was similar to that shown in Additional file 1: To further evaluate the performance of different Figure S2, suggesting that the presence of similar alleles methods on amplicon sequencing, we also generated did not increase overall error rates but affected the calls simulation datasets under the constraints of predefined on heterozygosity status. It is also clear that the higher PCR primers so that all reads had similar length (not coverage would help improve the prediction of identical because of the simulation of random insertion/ heterozygosity. deletion errors). The primers for ATN1 were designed by Primer3 [49], as CCCACCCACTACTCCCATTT (for- Estimation of repeat counts from real dataset on SCA3 ward) and CCAGAGTTTCCGTGATGCTG (reverse). To evaluate the performance of RepeatHMM on real The product size in the reference genome is 762 bp. To data, we performed amplicon sequencing on the ATXN3 approximate the real-world scenario of PCR, we used gene on 25 participants using the PacBio Sequel Liu et al. Genome Medicine (2017) 9:65 Page 9 of 16

Fig. 2 (See legend on next page.) Liu et al. Genome Medicine (2017) 9:65 Page 10 of 16

(See figure on previous page.) Fig. 2 Analysis on simulation data to infer repeat counts for ATN1. a Performance on simulated long reads with random start and end sites that cover repeats. b Performance on simulated long reads with fixed start and end sites that cover repeats. c, d The distribution of the prediction errors (estimated repeat counts minus simulated counts) on random simulation data and PCR-based simulation data, respectively. RMSE root mean square error between simulated repeat counts and estimated counts for 100 participants sequencer. These participants consisted of 20 patients sam021, sam024, and sam025, with a coverage of affected with Spinocerebellar ataxia type 3 (SCA3) 1086, 569, 504, and 718, respectively. [50, 51], with repeat counts determined by capillary Using the raw reads, the predicted repeat counts and electrophoresis, as well as five controls, with repeat the differences from the gold standards were shown in counts determined by Sanger sequencing (Additional Fig. 3 and Additional file 1: Table S5. For comparison, file 1: Table S5). SCA3 is a rare autosomal dominant we also run TRhist and summarized its results into disease caused by abnormally extensive duplication of repeat counts using a custom script (two repeat units CAG repeats in the ATXN3 gene located on chromosome were merged if their distance was less than 5 bp). 14q [50, 51]. Extensive repeats in exons of ATXN3 would RepeatHMM worked well on the raw reads (Fig. 3a and affect pons and striatum, causing progressive cerebellar b): The difference between repeat counts determined by ataxia and even paralysis. In general, greater number of RepeatHMM and the gold standard was mostly 0 or 1 repeat counts is correlated with more severe phenotypic with only a few exceptions (Additional file 1: Table S5). expression and earlier age of onset. The predictions for ten normal alleles in five unaffected For the 25 participants of interest, 585,646 raw long participants and 17 normal alleles in 20 patients were reads with 939,895,440 bp were generated (Additional identical to the gold standard, and the three normal al- file 1: Table S4). Our sequencing experiments le- leles in 20 patients were underestimated by one repeat. veraged CCS protocol, where a CCS read is a consen- For the pathogenic alleles of 20 patients, five predictions sus sequence generated from a multiple sequence are identical to the gold standard, ten were overesti- alignment on subreads generated on a single template mated by one repeat, four by two repeats, and one by in a circular fashion. This set of raw data was sum- three repeats. Additionally, the estimation error was marized into 38,058 CCS reads with 61,063,678 bp. largely random and was not correlated with true repeat We therefore evaluated RepeatHMM on both raw size. In comparison, BAMself and TRhist produced very data and the CCS data. The coverage of raw reads for poor predictions, especially on the pathogenic alleles most participants was more than 21,000, except (Fig. 3 and Additional file 1: Tables S5 and S6). sam004, sam021, sam024, and sam025, whose cover- To evaluate whether the CCS protocol could improve age levels were 16,988, 7750, 6915, and 10,882, re- the predictive performance, we used RepeatHMM on spectively. In contrast, the coverage of CCS reads was CCS reads, and we referred to this analysis as more than 1300 for most participants, except sam004, RepeatCCS. The detailed results of the RepeatCCS

abc

Fig. 3 Performance of RepeatHMM, RepeatCCS, BAMself, and TRhist on estimating the repeat counts in ATXN3 for 20 patients with SCA3 and five controls. The gold standards (x-axis) were determined by capillary electrophoresis for 20 patients or by Sanger sequencing for five controls. a Scatterplot of estimated repeat counts and true counts. b, c The difference of estimated repeat counts and true counts by RepeatHMM, RepeatCCS, BAMself, and TRhist. RepeatCCS refers to the use of RepeatHMM on error-corrected reads generated by the circular consensus sequencing protocol Liu et al. Genome Medicine (2017) 9:65 Page 11 of 16

analysis were shown in Fig. 3 and Additional file 1: ATXN3 in all three patients. We also noted that the con- Table S5. Although RepeatCCS worked better than sensus sequences produced by the original authors [46] BAMself and TRhist, it had higher error rates than gave larger estimation of repeats counts for both partici- RepeatHMM. Thus, in the RepeatHMM framework, pants A (~30 repeats larger) and B (~64 repeats larger). CCS reads did not confer obvious advantage to raw In contrast, repeat sizes estimated by RepeatHMM were reads in quantifying longer repeat counts. closer to those estimated by gel electrophoresis (Fig. 4 Interestingly, we found that the prediction errors (the and Table 1) for participants A and B. For participant C, estimated repeat counts minus the true repeat counts) there was a larger difference between the estimation by by RepeatCCS had a clear positive correlation with the RepeatHMM and the size inferred by gel electrophoresis, true repeat counts (Fig. 3b). This indicated that the CCS which may be because participant C contained many protocol may be biased when assessing repeat counts interrupted repeats [46]. In summary, this comparative and that the bias was not random. One possible reason analysis demonstrated that RepeatHMM can also work leading to this bias may be due to the multiple sequence on complex repeat regions spanning thousands of base alignment algorithm used to generate CCS reads. When pairs with mixed repeat units. inferring consensus sequences from multiple subreads, the alignment algorithm may not be able to accurately Estimation of repeat counts from WGS data align subreads with many repeats against each other, due We further evaluated RepeatHMM on WGS data on to the error profile of the sequencing data. For example, NA12878 generated on three technical platforms: PacBio given a repeat of 80 CAG triplets (240 bp), the generated SMRT sequencing (~50X coverage), Oxford Nanopore data were on average ~10% longer (due to the much sequencing (~30X coverage), and Illumina short-read higher insertion rate over deletion rate), so the length of sequencing (~300X coverage). We recently generated repeat regions in the CCS reads would be on average PacBio SMRT sequencing data on a Chinese adult male around 88 × 3 = 264 bp. RepeatHMM directly used raw (HX1) with normal karyotype (~100X coverage) [52] and reads and thus was less susceptible to this problem given we included this individual in the analysis as well. Unlike appropriate adjustment of alignment parameters. There- amplicon sequencing experiments, whole-genome long- fore, although CCS reads offered some advantages over read sequencing typically had much lower coverage (for raw reads and were preferably used in many applications example, 100X or lower), and all reads had random start (such as amplicon sequencing and RNA-sequencing), and end sites that were distributed throughout the gen- more cautions should be taken when CCS reads are used ome. We selected 15 trinucleotide repeats that were in estimation of repeat counts. known to cause inherited neurological diseases, as well as 33 additional microsatellites with 2–5 bases as repeat Estimation of repeat counts from real data on SCA10 units. (We determined the repeat patterns and the start/ To further evaluate the performance of RepeatHMM on end positions of microsatellites based on the UCSC gen- regions with more complex repeat patterns than trinu- ome browser.) Since the Illumina data had high coverage cleotide repeats, we also analyzed another dataset on (~300X) and the sizes of repeat regions are expected to SCA10 [46]. The intronic region of the ATXN10 gene be smaller than the read length (150 bp), we used the contains 14 ATTCT repeats in the reference genome. estimation of repeat counts from the Illumina data as However, in this dataset, many hundreds of repeat units the gold standard to evaluate the two long-read sequen- were present in each of the patients. Furthermore, in cing platforms. addition to ATTCT, the repeat region also contained a During the analysis, for each microsatellite, we found small fraction of other repeat units such as ATCCT, that generally more than 80% of the short reads sup- ATTCC, and ATCCC. ported the exact repeat counts (either one or two Three methods were evaluated on the raw reads in the counts) calculated by RepeatHMM, indicating that SCA10 dataset: RepeatHMM, TRhist, and BAMself. The Illumina-based estimation is reliable and could be used results were shown in Table 1 where both BAMself and as gold standard for comparison. The predictions from TRhist failed to accurately detect the pathogenic allele in two long-read sequencing platforms were largely

Table 1 Estimation of repeat counts on the SCA10 dataset. The gel estimated counts were from the previous study on the pathogenic allele [46] BAMself TRhist RepeatHMM Gel estimated Estimation in [46] Participant A 15 15 10 27 17 830 ~840 ~870 Participant B 15 15 5 41 17 825 ~820 ~884 Participant C 14 14 5 27 17 488 ~530 ~514 Liu et al. Genome Medicine (2017) 9:65 Page 12 of 16

a b

c

Fig. 4 The distribution of repeat counts estimated by RepeatHMM for three patients with SCA10. The estimation of the pathogenic alleles by RepeatHMM for the three subjects A, B and C were 830 (a), 825 (b) and 488 (c), and the estimation by gel electrophoresis were ~840, ~820 and ~530, respectively consistent with the gold standard (Fig. 5). Therefore, distinguish the alleles with similar repeat counts, as we RepeatHMM could work on different sequencing plat- had discussed above. In addition, some of the repeat re- forms with different error characteristics by adjusting the gions cannot be confidently called by either PacBio or model parameters. However, we also noted that there were Nanopore data, due to the relatively low coverage. We ac- some prediction errors, especially on predicting - knowledged that this analysis focused on repeat regions zygous repeat counts (Additional file 1: Table S7). This with normal alleles and relatively small number of repeat may be due to the fact that the coverage of long-read se- units, so the results might not be extrapolated to patho- quencing data for NA12878 was not high enough to genic alleles. In summary, our exploratory analysis

Fig. 5 Comparison of the estimation of repeat counts on NA12878 using three sequencing platforms. The sequencing platforms include Illumina short-read sequencing, PacBio long-read sequencing (a), and Nanopore long-read sequencing (b). We examined 40 microsatellites with repeat units in the range of 2–5 bp, which are short enough to be confidently called by the Illumina data Liu et al. Genome Medicine (2017) 9:65 Page 13 of 16

confirmed that RepeatHMM worked on different long- and convenient estimation of repeat counts. Our results read sequencing platforms, with appropriate adjustment suggested that long-read sequencing has the potential to of model parameters in HMM. be routinely used in research studies on microsatellite We also used RepeatHMM on HX1 [52] to analyze 15 repeat disorders and may be extended in clinical diag- distinct types of trinucleotide repeats that were known nostic applications. to cause human diseases (Additional file 1: Table S7). All RepeatHMM has several advantages over traditional ap- the repeat counts were within normal ranges, consistent proaches to determine repeat counts. First, RepeatHMM with the prior knowledge that HX1 did not have a takes long reads as input and utilizes HMM for repeat known neurological disorder. Furthermore, we analyzed region detection: HMM is computationally flexible to de- the CAG repeats in the ATXN3 gene using three differ- tect various types of repeats with different unit lengths ent techniques: on whole-genome long-read sequencing and motifs. Although we demonstrated the usefulness of (Fig. 6a), on PCR-based long-read sequencing (Fig. 6b), RepeatHMM on real dataset for CAG repeats and ATTCT and on Sanger sequencing (Fig. 6c). Since PCR-based repeats, RepeatHMM can be used for other types of tri- long-read data had high coverage, we down-sampled the nucleotide repeats by simply specifying different sets of dataset and produced three subsets of data, each with parameters. Second, different error profiles from various ~100X coverage. We found that WGS, PCR-based sequencers (such as PacBio sequencer and Oxford Nano- amplicon sequencing (three down-sampled subsets), and pore sequencers) can also be incorporated into HMM Sanger sequencing concordantly predicted that HX1 had using different parameters. Third, RepeatHMM is compu- 14 CAG repeats in ATXN3, further suggesting that tationally efficient. Based on the evaluation on the SCA3 RepeatHMM worked on different types of data. dataset, it usually took 2–12 min to analyze raw data on ATXN3 for a participant (~21,000X coverage). The mem- Discussion ory usage is up to 200 Mb in a 64-bit Linux machine with Long stretches of trinucleotide repeats generally cannot Python 2.7; however, please note that TRF requires more be interrogated by Sanger sequencing or next-generation time and memory, especially if the reads contain com- sequencing, and were traditionally regarded as “unse- plicated repeat patterns or if the reads are too long. There- quenceable” genomic regions. In this study, we leveraged fore, RepeatHMM is a flexible, efficient, and powerful tool long-read sequencing techniques and developed a novel to replace traditional approaches for the determination of computational tool, RepeatHMM, to estimate repeat repeat counts. counts for microsatellites. Compared to existing tech- However, there are several limitations of the niques (capillary electrophoresis and southern blot) that RepeatHMM approach. First, some repeat regions have are labor-intensive and cannot be scaled to high- mixed repeats, such as one CTG within many CAG, or throughput applications, the combination of long-read contain interrupted repeats, such as 10 CAG repeats sequencing and RepeatHMM may greatly facilitate rapid plus TTTTTTG followed by another 20 CAG repeats. If

a b

c

Fig. 6 The analysis of ATXN3 in HX1 using three different sequencing techniques. a Whole-genome long-read sequencing with ~100X coverage. b PCR-based long-read sequencing with three randomly down-sampled datasets, each with ~100X coverage. c Sanger sequencing. All methods concordantly predicted that there were 14 CAG repeats in ATXN3 Liu et al. Genome Medicine (2017) 9:65 Page 14 of 16

long reads contain non-canonical repeats, it is not development of some human diseases by affecting straightforward to differentiate the interruption of per- gene expression or the function of encoded proteins. fect repeats from insertion/substitution errors. To Given the success of RepeatHMM on simple micro- address this problem, in the current version of satellite repeats, in the future we will explore the RepeatHMM, the former scenario (several CTG within modification of RepeatHMM to handle more compli- many CAG) can be formulated using simple mixed re- cated microsatellite repeats with mixed patterns of peat patterns. For the latter case (10 CAG repeats plus different lengths, as well as with much TTTTTTG followed by another 20 CAG repeats), longer repeat units (10–60 bp). RepeatHMM would consider it as a combination of mul- tiple single-base insertions and a deletion of a CAG and Conclusions the insertions did not contribute to repeat length estima- In this study, we have developed RepeatHMM to detect tion (however, the results may be post-processed to repeat counts of microsatellites from long-read sequen- identify stretches of continuous insertions). For even cing data. RepeatHMM was evaluated on both simula- more complicated repeat patterns, the models need spe- tion data and real data and our results suggested that cialized adjustment. Second, the flanking downstream/ RepeatHMM was effective and efficient to quantify re- upstream sequences of repeat regions need to be long peat counts. RepeatHMM is flexible to handle repeat enough for RepeatHMM to work reliably. If a long read patterns of any length beyond trinucleotide repeats and has a short flanking sequence such as 20 bp, yet the re- can incorporate different error profiles. With the wider peat region is too long (for example, 200 repeats with application of long-read sequencing techniques in re- more than 600 bp), few alignment software tools could search and clinical settings, RepeatHMM is expected to correctly map the long reads to a reference genome. contribute to the quantification of repeat counts and to Thus, repeat counts in some of the long reads with short facilitate the analysis of genotype-phenotype relation- flanking sequences cannot be calculated. Third, repeat ships for disease-related microsatellites. patterns and their location in a reference genome are assumed to be known a priori, that is, our method did Additional file not aim at de novo discovery of repeat patterns. Fourth, we found that RepeatHMM tended to make mistakes Additional file 1: The matrices for the HMM, the results on estimating when two alleles have similar repeat counts, which con- repeat sizes in ATXN3 on 20 patients with SCA3 and five controls, and supplementary figures. (DOCX 353 kb) fused the peak calling procedure. As shown in Additional file 1: Figure S2, when the size difference between two Abbreviations alleles was more than 2, the error rates decreased sharply. CCS: Circular consensus sequencing; PCR: Polymerase chain reaction; In most cases the pathogenic allele would be substantially RepeatCCS: RepeatHMM on Circular Consensus Sequencing reads; RMSE: Root longer than the benign allele and this problem had limited mean square error; TRD: Trinucleotide repeat disorders; TRE: Trinucleotide repeat expansion; UnsymSeqAlg: Unsymmetrical sequence alignment influence on the accuracy of repeat size estimation for disease analysis in practice. On the other hand, in some Acknowledgements cases where two alleles have similar repeat counts, allele We thank members of the China-Japan Friendship Hospital to prepare DNA samples from patients with SCA3 and determine the repeat numbers drop-out remains a critical issue and needs to be ad- by capillary electrophoresis. We also thank Nextomics for generating the dressed in the future to avoid false negatives for the cor- sequence data on PacBio Sequel sequencer and performing Sanger validation. rect identification of heterozygosity. Finally, our peaking We thank the Wang lab members and three anonymous reviewers for insightful comments; and Abolfazl Doostparast for proofreading and calling procedure always assumed one or two peaks in the editing the manuscript. We also acknowledge the use of whole-genome repeat counts, which did not handle cases where extensive long-read sequencing data on HX1, originally generated by Jinan University mosaicism is available. We may improve the peaking call- and Wuhan Institute of Biotechnology. ing procedure in the future to address this issue. Funding We demonstrated a few successful examples in using This study was supported by the National Institutes of Health (HG006465 RepeatHMM to quantify well-known and canonical to KW). disease-associated repeat expansions, but it is conceiv- Availability of data and materials able that RepeatHMM may also be useful for other All data generated in this study were deposited in NCBI’s Sequence Read microsatellite repeats, including more complicated re- Archive (SRA) under the BioProject of PRJNA379845. The raw reads data can peats that do not always conform to canonical patterns. be accessed: sam001:SRR5363334; sam002:SRR5363452; sam003:SRR5363453; sam004:SRR5363454; sam005:SRR5363455; sam006:SRR5363456; sam007:SRR5 Microsatellites account for about 2% of human genomes 363457; sam008:SRR5363458; sam009:SRR5363459; sam010:SRR5363460; sam0 with a wide distribution throughout the genome [53] 11:SRR5363461; sam012:SRR5363462; sam013:SRR5363463; sam014:SRR53634 and have a higher rate than other regions of 64; sam015:SRR5363465; sam016:SRR5363466; sam017:SRR5363467; sam018:SR R5363468; sam019:SRR5363469; sam020:SRR5363470; sam021:SRR5363471; the genome [54]. Microsatellite repeats contribute to sam022:SRR5363472; sam023:SRR5363473; sam024:SRR5363480; sam025:SRR genetic diversity in human populations and to the 5363632. Liu et al. Genome Medicine (2017) 9:65 Page 15 of 16

Authors’ contributions 13. Richard G-F, Cyncynatus C, Dujon B. Contractions and expansions of CAG/ QL, DW, and KW designed the study. QL implemented the algorithm and CTG trinucleotide repeats occur during ectopic gene conversion in yeast, by performed the computational analysis. PZ and DW performed PacBio a MUS81-independent mechanism. J Mol Biol. 2003;326(3):769–82. sequencing to generate long-read data on ATXN3. WG collected samples and 14. Richard GF, Goellner GM, McMurray CT, Haber JE. Recombination-induced advised on data analysis and interpretation. QL and KW drafted and revised CAG trinucleotide repeat expansions in yeast involve the MRE11-RAD50- the manuscript. All authors read and approved the manuscript. XRS2 complex. EMBO J. 2000;19(10):2381–90. 15. Kovtun IV, Liu Y, Bjoras M, Klungland A, Wilson SH, McMurray CT. OGG1 Ethics approval and consent to participate initiates age-dependent CAG trinucleotide expansion in somatic cells. – The study protocol was approved by the China-Japan Friendship Hospital Nature. 2007;447(7143):447 52. and was conducted in compliance with the Declaration of Helsinki. Written 16. Guo J, Gu L, Leffak M, Li G-M. MutS[beta] promotes trinucleotide repeat informed consent was obtained from all participants before enrollment. expansion by recruiting DNA polymerase [beta] to nascent (CAG)n or (CTG)n hairpins for error-prone DNA synthesis. Cell Res. 2016;26(7):775–86. 17. Schmidt MHM, Pearson CE. Disease-associated repeat instability and Consent for publication mismatch repair. DNA Repair. 2015;38:117–26. All patient information was anonymized at source and unique ID codes were 18. Iyer RR, Pluciennik A, Napierala M, Wells RD. DNA triplet repeat expansion used to identify cases. Publication of de-identified results from all consenting and mismatch repair. Annu Rev Biochem. 2016;84:199–226. participants was approved. 19. Mirkin SM. Expandable DNA repeats and human disease. Nature. 2007; 447(7147):932–40. Competing interests 20. Lyon E, Laver T, Yu P, Jama M, Young K, Zoccoli M, et al. A simple, high- PZ and DW are employees of Nextomics Bioscences and KW is an advisor for throughput assay for Fragile X expanded alleles using triple repeat primed Nextomics Biosciences. All other authors declare that they have no competing PCR and capillary electrophoresis. J Mol Diagn. 2010;12(4):505–11. interests. 21. Haddad LA, Mingroni-Netto RC, Vianna-Morgante AM, Pena SDJ. A PCR- based test suitable for screening for fragile X syndrome among mentally retarded males. Hum Genet. 1996;97(6):808–12. Publisher’sNote 22. Hsiao K-M, Lin H-M, Pan H, Li T-C, Chen S-S, Jou S-B, et al. Application of Springer Nature remains neutral with regard to jurisdictional claims in FTA® sample collection and DNA purification system on the determination published maps and institutional affiliations. of CTG trinucleotide repeat size by PCR-based southern blotting. J Clin Lab Anal. 1999;13(4):188–93. Author details 23. Fojta M, Havran L, Vojtiskova M, Palecek E. Electrochemical detection of 1Institute for Genomic Medicine, Columbia University, New York, NY 10032, DNA triplet repeat expansion. J Am Chem Soc. 2004;126(21):6532–3. USA. 2Nextomics Biosciences, Wuhan, Hubei 430000, China. 3China-Japan 24. Lim GX, Loo YL, Mundhofir FE, Cayami FK, Faradz SM, Rajan-Babu IS, et al. Friendship Hospital, Beijing 100029, China. 4Department of Biomedical Validation of a commercially available screening tool for the rapid Informatics, Columbia University, New York, NY 10032, USA. identification of CGG trinucleotide repeat expansions in FMR1. J Mol Diagn. 2014;17(3):302–14. Received: 7 December 2016 Accepted: 30 June 2017 25. Zhang T, Lin X-C, Tang H, Yu R-Q, Jiang J-H. Mass spectrometry based trinucleotide repeat sequence detection using target fragment assay. Anal Methods. 2016;8(25):5039–44. References 26. Nakatani K, Hagihara S, Goto Y, Kobori A, Hagihara M, Hayashi G, et al. 1. Kovtun IV, McMurray CT. Features of trinucleotide repeat instability in vivo. Small-molecule ligand induces nucleotide flipping in (CAG)n trinucleotide – Cell Res. 2008;18(1):198–213. repeats. Nat Chem Biol. 2005;1(1):39 43. – 2. McMurray CT. Mechanisms of trinucleotide repeat instability during human 27. Ashley EA. Towards precision medicine. Nat Rev Genet. 2016;17(9):507 22. development. Nat Rev Genet. 2010;11(11):786–99. 28. Loomis EW, Eid JS, Peluso P, Yin J, Hickey L, Rank D, et al. Sequencing the 3. Lima M, Costa MC, Montiel R, Ferro A, Santos C, Silva C, et al. Population unsequenceable: Expanded CGG-repeat alleles of the fragile X gene. – of wild-type CAG repeats in the Machado-Joseph Disease gene in Genome Res. 2013;23(1):121 8. Portugal. Hum Hered. 2005;60(3):156–63. 29. Rhoads A, Au KF. PacBio sequencing and its applications. Genomics – 4. Bettencourt C, Lima M. Machado-Joseph Disease: from first descriptions to Proteomics Bioinformatics. 2015;13(5):278 89. new perspectives. Orphanet J Rare Dis. 2011;6(1):1–12. 30. Treangen TJ, Salzberg SL. Repetitive DNA and next-generation sequencing: 5. Spada ARL, Wilson EM, Lubahn DB, Harding AE, Fischbeck KH. Androgen computational challenges and solutions. Nat Rev Genet. 2012;13(1):36–46. receptor gene in X-linked spinal and bulbar muscular atrophy. 31. Roberts RJ, Carneiro MO, Schatz MC. The advantages of SMRT sequencing. Nature. 1991;352(6330):77–9. Genome Biol. 2013;14(7):1–4. 6. Verkerk AJMH, Pieretti M, Sutcliffe JS, Fu Y-H, Kuhl DPA, Pizzuti A, et al. 32. Clarke J, Wu H-C, Jayasinghe L, Patel A, Reid S, Bayley H. Continuous base Identification of a gene (FMR-1) containing a CGG repeat coincident with a identification for single-molecule nanopore DNA sequencing. Nat Nano. breakpoint cluster region exhibiting length variation in fragile X syndrome. 2009;4(4):265–70. Cell. 1991;65(5):905–14. 33. Chaisson MJP, Huddleston J, Dennis MY, Sudmant PH, Malig M, Hormozdiari 7. La Spada AR, Taylor JP. Repeat expansion disease: progress and puzzles in F, et al. Resolving the complexity of the human genome using single- disease pathogenesis. Nat Rev Genet. 2010;11(4):247–58. molecule sequencing. Nature. 2015;517(7536):608–11. 8. Mittelman D, Moye C, Morton J, Sykoudis K, Lin Y, Carroll D, et al. Zinc-finger 34. Glenn TC. Field guide to next-generation DNA sequencers. Mol Ecol Resour. directed double-strand breaks within CAG repeat tracts promote repeat 2011;11(5):759–69. instability in human cells. Proc Natl Acad Sci U S A. 2009;106(24):9607–12. 35. Carneiro MO, Russ C, Ross MG, Gabriel SB, Nusbaum C, DePristo MA. Pacific 9. Cleary JD, Tome S, Lopez Castel A, Panigrahi GB, Foiry L, Hagerman KA, et biosciences sequencing technology for genotyping and variation discovery al. Tissue- and age-specific DNA replication patterns at the CTG/CAG- in human data. BMC Genomics. 2012;13(1):1–7. expanded human myotonic dystrophy type 1 locus. Nat Struct Mol Biol. 36. Laver T, Harrison J, O’Neill PA, Moore K, Farbos A, Paszkiewicz K, et al. 2010;17(99):1079–87. Assessing the performance of the Oxford Nanopore Technologies MinION. 10. Freudenreich CH, Kantrow SM, Zakian VA. Expansion and length-dependent Biomol Detect Quantif. 2015;3:1–8. fragility of CTG repeats in yeast. Science. 1998;279(5352):853–6. 37. Chaisson MJ, Tesler G. Mapping single molecule sequencing reads using 11. Kerrest A, Anand RP, Sundararajan R, Bermejo R, Liberi G, Dujon B, et al. basic local alignment with successive refinement (BLASR): application and SRS2 and SGS1 prevent chromosomal breaks and stabilize triplet repeats by theory. BMC Bioinformatics. 2012;13(1):238. restraining recombination. Nat Struct Mol Biol. 2009;16(2):159–67. 38. Li H. Aligning sequence reads, clone sequences and assembly contigs with 12. Kang S, Jaworski A, Ohshima K, Wells RD. Expansion and deletion of CTG BWA-MEM. arXiv. 2013;arXiv:1303.3997 repeats from human disease genes are determined by the direction of 39. RepeatMasker Open-4.0. http://www.repeatmasker.org. Accessed 10 July replication in E. coli. Nat Genet. 1995;10(2):213–8. 2017. Liu et al. Genome Medicine (2017) 9:65 Page 16 of 16

40. Benson G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res. 1999;27(2):573–80. 41. Doi K, Monjo T, Hoang PH, Yoshimura J, Yurino H, Mitsui J, et al. Rapid detection of expanded short tandem repeats in personal genomics using hybrid sequencing. Bioinformatics. 2014;30(6):815–22. 42. Gymrek M, Golan D, Rosset S, Erlich Y. lobSTR: A short profiler for personal genomes. Genome Res. 2012;22(6):1154–62. 43. Ummat A, Bashir A. Resolving complex tandem repeats with long reads. Bioinformatics. 2014;30(24):3491–8. 44. Baum LE, Eagon JA. An inequality with applications to statistical estimation for probabilistic functions of Markov processes and to a model for ecology. Bull Amer Math Soc. 1967;73(3):360–3. 45. Kschischang FR, Frey BJ. Iterative decoding of compound codes by probability propagation in graphical models. IEEE J Sel Areas Commun. 1998;16(2):219–30. 46. McFarland KN, Liu J, Landrian I, Godiska R, Shanker S, Yu F, et al. SMRT sequencing of long tandem nucleotide repeats in SCA10 reveals unique insight of repeat expansion structure. PLoS One. 2015;10(8):e0135906. 47. Zook JM, Catoe D, McDaniel J, Vang L, Spies N, Sidow A, et al. Extensive sequencing of seven human genomes to characterize benchmark reference materials. Scientific Data. 2016;3:160025. 48. Pendleton M, Sebra R, Pang AWC, Ummat A, Franzen O, Rausch T, et al. Assembly and diploid architecture of an individual human genome via single-molecule technologies. Nat Meth. 2015;12(8):780–6. 49. Untergasser A, Cutcutache I, Koressaar T, Ye J, Faircloth BC, Remm M, et al. Primer3—new capabilities and interfaces. Nucleic Acids Res. 2012;40(15):e115. 50. Takiyama Y, Nishizawa M, Tanaka H, Kawashima S, Sakamoto H, Karube Y, et al. The gene for Machado-Joseph disease maps to human chromosome 14q. Nat Genet. 1993;4(3):300–4. 51. Kawaguchi Y, Okamoto T, Taniwaki M, Aizawa M, Inoue M, Katayama S, et al. CAG expansions in a novel gene for Machado-Joseph disease at chromosome 14q32.1. Nat Genet. 1994;8(3):221–8. 52. Shi L, Guo Y, Dong C, Huddleston J, Yang H, Han X, et al. Long-read sequencing and de novo assembly of a Chinese genome. Nat Commun. 2016;7:12065. 53. Borstnik B, Pumpernik D. Tandem repeats in protein coding regions of primate genes. Genome Res. 2002;12(6):909–15. 54. Brinkmann B, Klintschar M, Neuhuber F, Hühne J, Rolf B. Mutation rate in human microsatellites: influence of the structure and length of the tandem repeat. Am J Hum Genet. 1998;62(6):1408–15.

Submit your next manuscript to BioMed Central and we will help you at every step:

• We accept pre-submission inquiries • Our selector tool helps you to find the most relevant journal • We provide round the clock customer support • Convenient online submission • Thorough peer review • Inclusion in PubMed and all major indexing services • Maximum visibility for your research

Submit your manuscript at www.biomedcentral.com/submit