Accurate detection and classification of heterozygous indels by direct sequencing / 1550/F4

Jon Sorenson, Anjali Pradhan, Sharada Vijaychander, Bimal Sangari, Sylvia Fang, Theresa Nguyen, Ben Jones, Danwei Guo, Quynh Doan. Applied Biosystems, 850 Lincoln Centre Dr., Foster City, CA

ABSTRACT Examples of heterozygous indels found by Algorithm results direct sequencing A heterozygous / (indel) is defined by the presence of two alleles differing by an insertion or deletion---a length polymorphism. It has been estimated that 20% of the Figure 3a. polymorphisms in the are length polymorphisms. Deletion of However the direct sequencing of heterozygous individuals is complicated by a “phase shift” in the electropherogram trace at the T near exon site of the indel mutation. Quality-based sequencing pipelines 2 of MSX1 discard this data as noise or spend significant time and effort trying to call the polymorphism correctly. Starting with SeqScape® (LL:4487). Software v2.0, we introduced an algorithm to predict when samples Table I. Testing results for the HIM detection algorithm implemented in were exhibiting the presence of a heterozygous indel mutation SeqScape Software v2.5. Results are grouped by the quality of the (HIM). This algorithm has been refined in subsequent releases of underlying data with respect to blobs, n+1 primer peaks, PCR the software to include basecalling the inserted/deleted sequence, contamination. Sensitivity is measured with two definitions: 1. How mapping the polymorphism to the reference sequence, displaying often was the true location and sequence called correctly? 2. How often the mutation in the HUGO-approved format, and assigning a quality Figure was the true location called correctly? As can be seen from these value to the mutation. Recently we have been able to improve the 3b. results, it is more difficult to call the correct sequence than identify the algorithm further by accumulating numerous in-house and customer HIM location. This can be traced back to the quality of the underlying examples of this trace feature and developing a pipeline for Insertion basecalls. Specificity is measured as how frequently a called HIM annotation and automated testing. Results from this testing show a of T in reflects a true HIM. Because the data set is biased towards samples significant improvement in the ability of the software to correctly call containing HIMs, the true specificity rates are much higher than those heterozygous indel versus earlier analysis pipelines. This exon 9 of reported here. ability to accurately detect and characterize heterozygous indels MSH2 offers a valuable solution to a previously difficult problem in the (LL:4436). The algorithm described here has been incorporated into SeqScape analysis of resequencing data. Software v2.5, a software tool for analyzing resequencing projects. The integrated analysis pipeline of SeqScape Software is shown in Figure 6. INTRODUCTION Figure 7 shows an example of a heterozygous indel being identified and reported correctly by SeqScape Software. Characterization of heterozygous insertions and deletions by direct sequencing is often challenging due to the noisy nature of the electropherogram trace after Figure 3c. the insertion/deletion (indel) mutation (see Figure 3). However, these traces Deletion of are often clean from the point of view that the trace exhibits a reproducible signature, and other sources of random noise can be minimized. In this TG in exon situation it has proven possible to construct algorithms which correctly identify 16 of the location and identity of heterozygous indel mutations. NPHS1 Data Annotation Pipeline (LL:4868). Figure 6. The integrated resequencing analysis pipeline of A first step towards constructing any robust algorithm is the accumulation of SeqScape Software. Heterozygous indel detection is incorporated annotated data for testing and validating the algorithm into steps 1 and 2. High-throughput annotation of HIMs is made difficult by 1. No previous algorithm has existed to reliably detect this signature and 2. Sequences The above pictures show forward and reverse traces from loci containing HIMs are relatively rare, considering that the background rate of containing heterozygous insertions or deletions. These mutations variation is relatively low (in many organisms) and the percentage of variations which are HIMs is a fraction of this. are correctly called by the algorithms described here.

Over the last four years we have collected examples of this signature from Algorithm details various in-house and customer sources. The current test set contains 262 examples of forward and reverse sequences from >39 loci where The process of HIM detection can be separated into trace-level and heterozygous indels have been observed. Given this relatively large data set specimen-level operations. for the problem, we needed to develop a moderate-throughput approach to correctly annotating each pair of forward and reverse sequences. For each electropherogram trace the following steps are applied: • Identify a putative HIM by detecting a sharp change in the per-peak signal- to-noise ratio. The quality of this detection is assessed by statistical measures. • For high-quality putative HIMs, identify possible deleted or inserted sequences which are consistent with the observed mixed basecalls. The details of this algorithm are proprietary, but it does not require use of a reference sequence.

SeqScape Software assembles forward and reverse traces into a single consensus for each specimen. For each specimen the following HIM-related steps are applied: Figure 7. Reporting of a deletion of G in T1A-2 (lung type-I cell • Identify regions of the consensus where the overlapping traces each membrane-associated glycoprotein). Screenshot is taken from suggest a HIM event. SeqScape Software v2.5. • Identify the correct consensus location for the HIM. • Identify the correct indel sequence using the sequences detected above. CONCLUSION Figure 1. Workflow for annotating heterozygous indels. The approach • Identify if the indel sequence is an insertion or deletion versus the uses a preliminary version of the algorithm to help guide the annotator reference. The analysis of resequencing studies is often limited by the availability towards the correct call. • Assign a quality value to the HIM using proprietary heuristics. of tools which are focused on the issues particular to resequencing. To address this problem, we introduced SeqScape Software several Testing pipeline years ago as an analysis tool for enabling researchers to quickly To test the validity of this algorithm, we constructed an automated pipeline analyze resequencing data. which analyzes each specimen in our test set and compares the algorithmic In subsequent releases of SeqScape Software we have focused on answer to the annotated answer. enhancements which treat the specific needs of resequencing analysis. The software package now supports a wide variety of reference annotations, incorporates various options for primer trimming, heterozygote detection, and presents a large number of integrated views and reports detailing analysis results.

Figure 4. Algorithm testing workflow for validating the HIM detection The algorithms described here for detecting and characterizing algorithm. heterozygous indel mutations are a useful addition to this suite of analysis features. Development of a robust solution to this problem is hindered by the lack of a large supply of annotated data, but we have made significant progress towards accumulating such a data set. The proper analysis of HIMs presents a difficult problem to most resequencing pipelines; these algorithms present a useful solution to this previously difficult problem. RELATED POSTERS Figure 2. Visualization of a heterozygous indel using an in-house data 1216/T, 1662/F annotation tool. These traces show a 2bp insertion of AC from exon 2 of ACKNOWLEDGMENTS VHL. We would like to acknowledge the help provided by Jim Labrenz, Curtis Gehman, Carey Gire, Carl Fosler, Craig Forbes, Stephen An additional requirement of data annotation for real-world algorithms is the Glanowski, and Michele Cargill. We would also like to thank various separation of data into high-quality and low-quality buckets. An algorithm customers who have kindly sent back instances of this data with the should be expected to achieve high accuracy on data from the high-quality hope that we might some day analyze it better. bucket and be able to identify the difference between high- and low-quality NOTE inputs. Ideally the algorithm would call correct answers through low-quality For Research Use Only. Not for use in diagnostic procedures. data as well, but many other factors outside the limitations of the algorithm Figure 5. Example test report from algorithm testing. The columns to Applied Biosystems and SeqScape are registered trademarks and AB (Design), KB and might make this situation impossible. Applera are trademarks of Applera Corporation or its subsidiaries in the US and/or certain the far right identify if the algorithm call resulted in a true positive (correct other countries. position and sequence), true negative (no HIM is present), false positive (a HIM was called but there is no true HIM), or false negative (either the Notice to Purchaser: License Disclaimer Purchase of this software product alone does not imply any license under any process, position or sequence was called incorrectly, if at all). instrument or other apparatus, system, composition, reagent or kit rights under patent claims owned or otherwise controlled by Applera Corporation, either expressly or by estoppel.

© 2004 Applied Biosystems. All rights reserved.