Human Transposon Insertion Profiling: Analysis, Visualization And

Human transposon insertion profiling: Analysis, PNAS PLUS visualization and identification of somatic LINE-1 insertions in ovarian cancer

Zuojian Tanga,b, Jared P. Sterankac,d, Sisi Maa,2, Mark Grivainisa,b, Nemanja Rodicc,3, Cheng Ran Lisa Huangd,4, Ie-Ming Shihc,e, Tian-Li Wangc, Jef D. Boekeb,1, David Fenyöa,b,1, and Kathleen H. Burnsc,d,1

aCenter for Health Informatics and Bioinformatics, NYU Langone Medical Center, New York, NY 10016; bInstitute for Systems Genetics, NYU Langone Medical Center, New York, NY 10016; cDepartment of Pathology, Johns Hopkins University School of Medicine, Baltimore, MD 21205; dMcKusick–Nathans Institute of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, MD 21205; and eDepartment of Gynecology and Obstetrics, Johns Hopkins University School of Medicine, Baltimore, MD 21205

Contributed by Jef D. Boeke, December 20, 2016 (sent for review June 18, 2016; reviewed by Prescott Deininger and David J. Witherspoon) Mammalian genomes are replete with interspersed repeats reflect- pseudogenes (8). Collectively, these insertions represent a major ing the activity of transposable elements. These mobile DNAs are source of structural variants in human populations. self-propagating, and their continued transposition is a source of LINE-1 mobilization occurs through a mechanism termed target both heritable structural variation as well as somatic mutation in primed reverse transcription (TPRT) (9, 10). There are several human genomes. Tailored approaches to map these sequences are hallmarks of a sequence inserted by the TPRT mechanism (Fig. useful to identify insertion alleles. Here, we describe in detail a 1A), including the following: (i) target site duplications (TSDs) strategy to amplify and sequence long interspersed element-1 (LINE-1, surrounding the insertion, (ii) a poly(A) (pA) tail at the 3′ end of L1) retrotransposon insertions selectively in the human genome, the sequence, and (iii)a5′ end truncation and/or inversion (11). transposon insertion profiling by next-generation sequencing (TIP- Sequences may also demonstrate 3′ (or, less commonly, 5′)trans- seq). We also report the development of a machine-learning–based duction events, which result from the formation of RNA inter- computational pipeline, TIPseqHunter, to identify insertion sites mediates more encompassing than the templating LINE-1. When GENETICS with high precision and reliability. We demonstrate the utility of these RNA intermediates are reverse-transcribed, unique se- this approach to detect somatic retrotransposition events in high- quences adjacent to the LINE-1 from the templating locus are grade ovarian serous carcinoma. incorporated into the transposed segment (12–14). The high copy number of LINE-1 repeats and these sequence retrotransposon | TIPseq | human | LINE-1 | ovarian cancer features together present substantial challenges to the accurate and sensitive detection of new insertions. These challenges are uch of our genome consists of interspersed repeats, se- Mquences that evidence the long-standing activities of mo- Significance bile DNAs (1, 2). One of the most abundant and successful mobile DNAs in human genomes is long interspersed element-1 Much of our genome is repetitive sequence. This property (LINE-1). LINE-1 sequences are common to many mammals, poses challenges for investigators because differences in re- and sequence comparisons indicate that LINE-1 has accumu- petitive sequences are difficult to detect. With hundreds of lated throughout primate lineages as a singular succession of thousands of similar repeats, it has been difficult to discern subfamilies (3). There are more than 1 million LINE-1 fragments how one person’s genome differs from another person’s ge- in our genome today, accounting for nearly 540 million base nome or how tumor DNA differs from normal DNA. To solve pairs, or about 18% of our DNA (1). The oldest LINE-1 inser- this issue, we developed methods to target next-generation tions have themselves been interrupted by younger transpos- sequencing to the insertion sites of the most variable repeats. able elements, and their fragments can be merged such that we Computational pipelines to make these studies scalable and recognize a smaller number of LINE-1 insertion instances, to- more widely accessible were needed, however. Here, we re- taling around 500,000. port a pipeline that accomplishes this goal. We use it to dem- The vast majority of these sequences are so-called “fixed pre- ” onstrate insertions of the long interspersed element-1 (LINE-1) sent insertions, meaning they are found in a homozygous state in acquired in ovarian cancer that may contribute to the devel- all humans. These insertions represent ancestral insertions fre- opment of these tumors. quently shared with other extant species. A much smaller set of roughly 500 full-length and truncated LINE-1 insertions in the Author contributions: Z.T., J.D.B., D.F., and K.H.B. designed research; Z.T., J.P.S., N.R., average human genome corresponds to the L1PA1 subfamily or C.R.L.H., and T.-L.W. performed research; Z.T., S.M., M.G., C.R.L.H., and I.-M.S. contributed “ ” Homo sapiens new reagents/analytic tools; Z.T. and C.R.L.H. analyzed data; and Z.T., J.D.B., D.F., and K.H.B. the Ta subset of the -specific insertions of LINE-1 wrote the paper. (L1Hs) sequences. These insertions are transcriptionally and Reviewers: P.D., Tulane Cancer Center; and D.J.W., University of Utah. transpositionally active (4, 5). Although this particularly interesting The authors declare no conflict of interest. “ ” active subset of LINE-1 includes fixed present elements, it also Data deposition: The sequences reported in this paper have been deposited in the NCBI encompasses polymorphic elements. Each constitutes a biallelic Sequence Read Archive (SRA) database (accession nos. SRP074110 and SRP074316). structural variant, such that an individual may not have the LINE-1 1To whom correspondence may be addressed. Email: [email protected], [email protected], insertion (i.e., may carry only the preinsertion allele) or may have or [email protected]. 2 inherited the LINE-1 insertion (i.e., may be either homozygous or Present address: Institute for Health Informatics, University of Minnesota, Minneapolis, MN 55455. heterozygous for the insertion polymorphism). 3Present address: Department of Pathology, Yale University School of Medicine, New Active L1Hs retrotransposons are responsible not only for Haven, CT 06520. L1Hs sequence but also drive retrotransposition of other mobile 4Present address: Atlas Venture, Cambridge, MA 02139. Alu DNAs, namely, short interspersed elements (SINEs) and This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10. SVA (SINE/VNTR/Alu) transposons (6, 7) and even processed 1073/pnas.1619797114/-/DCSupplemental.

www.pnas.org/cgi/doi/10.1073/pnas.1619797114 PNAS Early Edition | 1of8 Downloaded by guest on September 28, 2021 A LINE-1 (L1) retrotransposon insertion polymorphisms, as well as insertions ORF1 ORF2 pA that occur somatically. We show the utility of these tools by applying TIPseq to one of the first surveys of LINE-1 activity in TSD TSD 6kb ovarian cancer. Results B L1 gDNA Vectorette PCR. Vectorette PCR is a method that can be used to amplify DNA fragments in which a portion (i.e., the 5′ end of an PCR template amplicon) is a known sequence, but the sequence at the opposing end is unknown (29, 30). Vectorette PCR is a type of ligation- first strand mediated PCR; alternatives include linear amplification strategies and ligation-mediated PCRs, such as inverse PCR. In our dsPCR amplicon hands, vectorette PCR is an effective strategy to amplify LINE-1 insertion sites and insertion sites of similarly complex pop- L1 ulations of mobile DNAs in the human genome, including AluYa Alu L1/junction read pair and Yb subfamilies, as well as human endogenous retrovirus- K (HERV-K) elements (17). Steps of the PCR are diagrammed in Fig. 1B. One advantage of TIPseq over other methods of se- L1/genome read pair quencing transposon-flanking amplicons is that it depends on a single round of PCR before library construction. Most other methods depend on nested PCR-based methods, which tend to junction/genome read pair amplify the bias in outcomes greatly when a variety of amplicons of different lengths and sequences are amplified in parallel.

genome/genome read pair Having high-quality, high-molecular-weight genomic DNA as starting material is important. For the method described, we typically isolate DNA from fresh-frozen tissues using phenol- Fig. 1. LINE-1 insertions and vectorette PCR. (A) Full-length LINE-1 (L1) in- chloroform extraction and ethanol precipitation. We use 10 μgof sertion is diagrammed; the LINE-1 spans 6 kb and includes two ORFs (ORF1 and ORF2). The element ends with a 3′ series of adenine nucleobases and the genomic DNA and divide this genomic DNA into aliquots that are pA tail of its RNA precursor, and it is flanked by TSDs of the preinsertion independently digested with one of a panel of restriction enzymes. genomic sequence (red boxes). (B, Top to Bottom) Vectorette PCR work flow. One of the first considerations in designing the vectorette PCR is Genomic DNA (parallel lines) is cut with restriction enzymes, leaving sticky to select restriction enzymes that will maximally represent portions ends (downward-facing arrows); vectorette adapters (blue) are ligated to of the genome in amplifiable fragment lengths; it is desirable to these ends. The annealed vectorette sequences are not perfectly comple- ensure that as close as possible to 100% of genomic insertions will mentary, and no binding site exists for the amplification primer at the outset be represented as at least one fragment of less than 3 kb. For of the PCR assay. First-strand extensions (sea green) occur from a forward TIPseq amplifications in the human genome, we use five or six primer specific for L1Hs LINE-1 (black, rightward-facing top arrow), and in different 5- or 6-bp restriction enzyme cutters to ensure that a subsequent iterations of the PCR, the reverse amplification primer has its large majority of the genome (>95%) is represented in fragments complement from these strands (black, leftward-facing bottom arrow). The – structure of the resulting amplicons, along with possibilities for corre- that are 1 3 kb in size in at least one of the parallel digests. sponding paired end sequencing reads, is shown. Informative reads can be Shorter fragments amplify but are less informative, particularly for grouped into categories depending on their positions in the TIPseq ampli- insertions into genomic locations that are themselves repetitive, or cons. These positions include L1/junction read pairs, L1/genome read pairs, those insertions representing significant 3′ transduction events. In junction/genome, and genome/genome read pairs. L1/L1 concordant read contrast, longer fragments are less well amplified relative to pairs are not informative. shorter fragments in this highly multiplexed PCR. The restriction enzymes chosen conform to the following characteristics: (i) they should cut efficiently and independent of exacerbated when insertion alleles are at a low frequency in a genomic methylation; (ii) they should leave overhanging “sticky sample. Such is the case for somatically acquired insertions in ends” and demonstrate high efficiencies in serial cut, ligation, and primary tissue samples, which consist of admixtures of cells of recut experiments; (iii) the restriction enzyme recognition site different lineages. Because of these barriers, and assumptions should occur at the right frequency in the genome (typically, 6-bp that interspersed repeats are not functional, characterizations of cutters); and (iv) we avoid using combinations of enzymes that these sequences have been incomplete. Recently, however, would impose multiple requirements for CG dinucleotide-con- strategies for mapping these elements have been developed taining cut sites because these site are underrepresented in the – based on selective PCR amplification (15 17), hybridization- human genome. Enzymes that can be heat-inactivated before based enrichment (18, 19), and read selection from whole- vectorette oligonucleotide ligation are advantageous. Finally, it is – genome sequencing data (20 22). These studies underscore the critical that the cut sites corresponding to chosen enzymes not be continued activity of LINE-1 in modern humans, and demon- represented within the transposable element at any position 3′ of strated somatic insertions in several types of human malignancy the forward primer. This technique ensures that amplicons will (15, 19, 22–27). extend the retroelement insertion into unique DNA flanking. Here, we describe a strategy for selectively amplifying geno- Because of this last point, we use different restriction enzymes for mic DNA 3′ of insertions of active subfamilies of human LINE-1 mapping different types of transposable elements. For human for Illumina sequencing (17, 27, 28). We also describe informat- LINE-1, we use six enzymes: AseI, BspHI, BstYI, HindIII, NcoI, ics pipeline, machine-learning–based computational pipeline and PstI (New England Biolabs). transposon insertion profiling by next-generation sequencing Next, a pair of vectorette oligonucleotides is designed to (TIPseq) Hunter (TIPseqHunter; https://github.com/fenyolab/ correspond to each restriction enzyme used. The annealed oli- TIPseqHunter, with supporting files available at openslice. gonucleotide pair will form the vectorette adapter. The two fenyolab.org/data/tipseqhunter/) and a visualization tool, Trans- strands create a double-stranded end for the adapter that is poScope, to identify and display L1Hs in the resulting reads. compatible with the sticky end created by the restriction enzyme. This combination of tools is useful for identifying inherited This structure allows for efficient ligation. The annealed

2of8 | www.pnas.org/cgi/doi/10.1073/pnas.1619797114 Tang et al. Downloaded by guest on September 28, 2021 vectorette sequences do not complement one another perfectly PNAS PLUS throughout their length, however. Vectorette adapters have a Preparation and Quality Control: central, partly mismatched sequence. This mismatched interval is where one of the primers for the vectorette PCR is positioned Vectorette sequences Vectorette sequence removal (Fig. 1B, leftward facing arrow). This amplification primer is complementary to the first strand synthesized in the PCR assay. Its design forces first strand synthesis from the transposable el- Alignment: ement sequence; no binding site complementary to the amplifi- L1Hs annotation cation primer exists unless this strand extension occurs. After this Alignments: L1Hs-masked human genome extension in subsequent cycles of the PCR, exponential amp- L1Hs reference sequence lification of sequences flanking the transposable element can proceed. The amplification primer responsible for the first strand extension is designed to be complementary to the transposable Identification: element (Fig. 1B, rightward facing arrow). This primer is placed to take advantage of so-called “diagnostic nucleotide” substitu- Junction reads Identifications: tions that define relatively recently active subfamilies of mobile Target regions (TRLocator) Insertion sites DNAs. This procedure minimizes unwanted amplification of insertion sites from older, exclusively “fixed present” transposons and greatly enriches for amplification of insertions, such as Modeling: polymorphic insertions or acquired somatic events. In the case of L1Hs Ta subset, this specificity is possible because of a consec- Label set: five features Logistic regression module utive trinucleotide signature “ACA” in the 3′ UTR of the ele- ′ ment. The vectorette PCR uses the 3 end of the element, which Prediction: is advantageous to detect 5′ truncated retroelements. On the other hand, our dependence on the 3′ end can create difficulties Unlabeled set Predict results owing to sequencing across the pA homopolymer. We have GENETICS found that the greatest specificity is conferred when ACA nucleotides are located at the 3′ end of the amplification primer Fig. 2. Schematic of the TIPseqHunter pipeline. There are five steps in the (L1 amplification primer sequence, 5′-AGATATACCTAATG- ′ pipeline: (i) low-quality sequences, base pairs, and vectorette sequences are CTAGATGACACA-3 ). Locked nucleic acids can be placed at trimmed using Trimmomatic software; (ii) qualified read pairs are aligned to these three 3′-most positions to increase binding specificity but an L1Hs masked reference genome (hg19) and the L1Hs consensus sequence are not required. using Bowtie2 software; (iii) candidate insertion sites are identified using the PCR conditions must strike a balance between imposing this enriched target sites with at least one junction-containing read pair; (iv)a specificity and achieving a high yield. Use of proofreading poly- machine-learning model is built using five features (width, depth, variant merases is necessary, given the 1- to 3-kb targeted optimal length index, pA tail purity, and number of junction reads); and (v) the trained of the PCR amplicons, and we use ExTaq DNA Polymerase model is used to predict probabilities of the candidate insertions being the true insertion sites. (TaKaRa Clontech). We use touchdown cycling conditions with annealing temperatures lowering progressively from 72 °C to 60 °C. The vectorette PCR produces a complex mixture of ii) The processed reads are aligned using Bowtie2 version 223 amplicons. These amplicons are sheared to 300 bp and prepared (32) to the human reference genome assembly [February for Illumina sequencing. We use a Covaris E210 instrument to 2009 (hg19, GRCh37) release] in which 1,544 RepeatMasker shear the DNA, with the following settings: 75 s, 200 cycles per (33) annotated L1Hs fragments have been masked using burst, four-intensity, and 10% duty cycle. The Illumina TruSeq BEDTools (34). The reason we chose to mask L1Hs in- DNA Sample Preparation Kit v2, Illumina TruSeq DNA PCR- sertions incorporated in the reference genome was to be Free Library Preparation Kit, or Kapa Biosystems KAPA DNA able to use characteristics of alignments at these loci in a Library Preparation Kit can be used for library preparation. machine-learning algorithm to be applied at loci without Indexing allows for sequencing runs to be multiplexed. We reference L1Hs. Only by aligning to a masked genome have generated high-quality LINE-1 insertion profiles for so- can we be assured that both reference and nonreference matic retrotransposition assays running 10–12 samples in a single L1 insertions will behave similarly. Sequence alignments Illumina HiSeq v4 sequencing lane for an average of between 20 are also made against an L1Hs consensus reference sequence (35). and 40 million reads per sample. The TIPseqHunter pipeline iii requires paired-end sequencing reads. ) Sequence reads indicative of LINE-1 insertions are detected (Fig. 1): (a) genome/genome (G/G) read pairs, where both Data Analysis. The data analysis pipeline, TIPseqHunter, com- reads aligned to genome sequence with an intervening dis- prises five major steps: (i) sequence read preparation and quality tance consistent with the DNA fragment length distribution. ii iii The coverage distribution of these reads forms a wide peak control, ( ) sequence alignment, ( ) candidate LINE-1 insertion b site identification, (iv) LINE-1 insertion site modeling, and next to a LINE-1 insertion site; ( ) L1/genome (L1/G) read (v) LINE-1 insertion site prediction (Fig. 2). Each is described fur- pairs, where one read is aligned to the L1Hs sequence and ther in this section. Because the efficacy of the PCR assay and the the other to the reference genome sequence, and the coverage distribution of these reads extends about 500 bp from type of background associated with each TIPseq dataset can vary, the LINE-1 insertion site; and (c) junction reads (J) that the machine-learning algorithm is run on each sample: span the insertion [both L1/junction (L1/J) and junction/ge- i) The paired-end reads are trimmed of low-quality base pairs nome (J/G) pairs]. Reads containing sequences from both as identified by Illumina Phred or Q score using Trimmo- the 3′ pA end of L1 and the flanking genomic sequence can matic (31). Illumina adaptors, vectorette oligonucleotides, be used to pinpoint the precise insertion site. TRLocator, an and primer sequences are also trimmed. in-house–developed peak-finding algorithm (36), is modified

Tang et al. PNAS Early Edition | 3of8 Downloaded by guest on September 28, 2021 to find enriched target regions. Target regions with at least d) pA tail purity of each consensus sequence at the predicted one junction read pair are considered candidate LINE-1 3′ end of the insertion site within a segment of specified insertion sites. length (%) iv) A model of candidate insertion sites is built for each sam- e) Number of J supporting the predicted site ple using the logistic regression module of the R package “caret” (Fig. 3). The model is trained and tested on known We previously described a study of somatic retrotransposition LINE-1 insertions (positive instances) and candidate in- of LINE-1 in patients with lethal metastatic pancreatic ductal sertion sites missing the first 5′-most position of the adenocarcinoma (PDAC) (27), in which we used TIPseq to map LINE-1 primer sequence (negative instances). These in- LINE-1 insertions in a series of primary and metastatic lesions, sertion sites are randomly divided into a training set as well as matched normal genomic DNA (germ line) from the same individuals. Here, we applied TIPseqHunter to analyze (70% of insertion sites) and a test set (30% of insertion – sites). Ten-fold cross-validation on the training set is used data from 36 of these TIPseq experiments. Each included 6 30 to ensure model generalizability, and model accuracy was million paired-end 100-bp reads per sample. All sequences were evaluated using the test set. aligned to both the repeat masked human genome sequence and the L1Hs consensus sequence (as described in ii above). Overall The five parameters are used for the training set model are as alignment rates to hg19 were 98.7–99.4%, and the alignment follows: rates to L1Hs were 32–42% (Table S1). Depending on the sample, we identified between 104 and 105 a) Width: the width of the enriched target region (log -based 2 candidate insertion sites (Table S2) defined by an enriched target value) region and at least one junction read. About one-quarter are b) Depth: the average coverage enriched target region (log -based 2 spurious, with no restriction enzyme cut sites near the enriched value) target regions. These candidate insertion sites are excluded be- c) Variant index: alignment mismatches and indels within the fore modeling. The remaining enriched target regions lacking the peak interval (sum of the number of mismatches and indels first 5′-most position of LINE-1 amplification primer are used as divided by the sum of the number of base pairs and the num- negative instances for training the model. ber of reads) We used two sets of positive instances for training and testing the model: a set of L1PA1 elements present in all humans, so- called “fixed present insertions,” andasetofL1Hsthatincludes fixed present L1PA1 insertions, polymorphic L1PA1, and fixed L1PA2. Each set of known insertions has strengths and weaknesses for training the model. The fixed present L1PA1 elements are present in all humans, and will be homozygous in a sample when they occur on an autosome (i.e., two copies per genome equivalent). Using this set promotes specificity but lowers sensitivity. Detection of somatically acquired insertions present in a sample at less than one copy per genome equivalent is decreased. In contrast, when older LINE-1 and known insertion variants are included, only a subset is expected to be amplified in any given sample. Using this set promotes sensitivity with some reduction of specificity. The fixed present insertions used here include 200 L1Hs annotated in the human reference genome build by RepeatMasker and also detected in TIPseq runs on 108 germ-line samples from diverse humans. L1PA2 is excluded from this set; all are L1PA1 or L1(Ta). We observe strong evidence for their presence in all samples, and there is a clear boundary between the positive and negative instances in the training set based on the five features (Fig. 4A and Table S3). Training the model on these 200 fixed present L1Hs insertions yields a small set of high-confidence Fig. 3. Training and evaluation of the model. (1) To identify if a candidate B insertion site is a true insertion site, a dataset labeled with true and false in- insertion sites (Fig. 4 ). sertion sites (the labeled set) is constructed for each sequencing sample. Pos- RepeatMasker annotates 1,544 L1Hs annotations of the ref- itive instances (true insertion sites) in the labeled set are identified by erence genome; of these annotations, about 600–800 are classi- matching to one of the two annotated LINE-1 lists (fixed present and fied as candidate insertions in each sample. These 600–800 RepeatMasker). (2) Negative instances (false insertion sites) in the labeled set candidate insertion sites contain all of the 200 fixed present are defined as candidates missing the first 5′-most position of the L1 amplifi- L1PA1 elements, but also encompass other, older elements and cation primer. (3) Labeled set comprises positive instances and negative in- a number of known LINE-1 insertion variants. There is less stances, with “1” representing a positive instance and “0” representing a C negative instance. Five selected features extracted from the sequencing data discrimination between positive and negative instances (Fig. 4 are obtained for each instance and will be used to construct the predictive and Table S4). Training the model on this larger set of positive model. (4) Labeled set is split into a training set (70%) and test set (30%) instances that also contains sites supported by weaker evidence randomly. (5) Predictive model is built with logistic regression on the training yields a larger set of predicted insertion sites (Fig. 4D). set to establish the relationship between the characteristics in sequencing data The accuracy of these models was estimated using the test set and the instance type (i.e., whether a candidate insertion site is a true insertion and resulted in accuracies of >0.99 and >0.90 for models trained site). (6) Resulting predictive model is applied to the test set to predict the on the fixed present L1PA1 and larger, RepeatMasker L1Hs sets, instance type. (7) Instance type predicted by the model is compared with the respectively (Fig. 5A). Training on the smaller and higher quality true instance type of the test set to evaluate the performance of the predictive set of fixed present L1PA1 results in higher accuracy and pro- model (measured by accuracy). If the model performance is not satisfactory on B the test set, applying the predictive model on the unlabeled dataset for novel duces a shorter candidate list (Fig. 5 and TranspoScope web site insertion set discovery is not recommended. (8) Predictive model is applied to at openslice.fenyolab.org/transposcope/home.html). In contrast, the unlabeled set to predict the probability of a candidate insertion site being for a majority of samples, training on a larger set of Repeat- a true transposon insertion site. Masker annotated insertions results in retrieval of a larger

4of8 | www.pnas.org/cgi/doi/10.1073/pnas.1619797114 Tang et al. Downloaded by guest on September 28, 2021 PNAS PLUS GENETICS

Fig. 4. Model parameters. The distribution of five model parameters (width and depth in base pairs of the enriched target region; variant index, alignment mismatches and indels; pA tail purity at each predicted LINE-1 3′ end; and the number of junction reads supporting the predicted site). Width, depth, and

junction reads are all log2-based values. Width and depth determine the placement of each point on the x and y axes. The variant index is shown as the color of the data point fill. The pA purity is shown as the color of the data point outline. The number of junction reads is depicted as the size of the data point. (A) Negative instances (Left), positive instances (Center), and unlabeled instances (Right) when a fixed present set of L1PA1 is used to train the model. (B) Predicted probabilities that candidate insertions are true LINE-1 insertions in five increments of P = 0.02, and then P < 0.9 (rightmost). (C) Instances when the RepeatMasker set is used to train the model. More insertions are included as positive instances for training compared with A.(D) Predicted probabilities associated with C.

number of insertions (Fig. 5B and TranspoScope web site). Even ability to detect these insertions (Fig. S2). We note that in the when a LINE-1 insertion occurs in a region with a low pro- previous study, an insertion-finding algorithm was used but a portion of uniquely mapping reads, the percentage of concordant significant component of manual postreview was also required; aligned reads is not compromised and the insertion can be re- TIPseqHunter dispenses with this need. liably detected (Fig. S1). For a TIPseq study of OC reported here, we compared paired tumor and normal gDNA from eight individuals with type II OC, Identification of Tumor-Specific Insertion Sites and PCR Validation. seven with high-grade serous carcinoma (HGSC) and one with To test the TIPseqHunter pipeline, we mapped LINE-1 inser- carcinosarcoma. These cases are representative of one of the tions in paired tumor and normal DNA samples from patients most lethal malignant diseases in women, and we have previously with PDAC and patients with ovarian carcinoma (OC). As pre- reported high levels of LINE-1–encoded protein expression in viously reported (27), PDAC samples were acquired through a these malignancies (37). Using TIPseq and TIPseqHunter, we rapid autopsy protocol; we had available matched normal pri- found a total of 36 somatically acquired, tumor-specific inser- mary tumor and metastatic tumor samples from 10 individuals, tions in five of the eight individuals. We had sufficient gDNA and matched normal and primary tumor samples from three available for PCR validations for two of the five samples, both individuals. Using TIPseqHunter, we identified 88 so-called HGSC. We successfully validated all 21 insertions attempted in “progenitor L1” insertions, somatically acquired LINE-1 inser- one sample (Fig. 6A and TranspoScope web site) and five of the tions shared by a primary tumor and a metastatic site of disease six insertions in the second sample. Thus, the overall PCR vali- in the case, and not found in normal genomic DNA (gDNA) dation rate for this ovarian cancer dataset was 96.3% (26 of 27 from the same patient. We also identified 127 other (unshared) samples), and our overall validation rate was 95.5% (106 of somatic insertions in either primary or metastatic tumor samples 111 samples). when comparing these samples with normal samples: 63 in pri- Notably, one of the somatically acquired insertions predicted mary tumors and 64 in metastatic sites of disease. Information on by TIPseqHunter falls within intron 5 of the well-known tumor these somatic insertions is available on the TranspoScope web suppressor gene, breast cancer 1 gene (BRCA1). Inherited mu- site. We are able to detect 76 of the 80 previously reported, tations in BRCA1 can create strong predispositions to both PCR-validated L1 insertions (27) using TIPseqHunter, giving a ovarian and breast cancers, and somatically acquired mutations sensitivity of 95%. Of 20 additional candidates identified by or loss of BRCA1 expression through copy number alterations or TIPseqHunter, we tested four high-quality insertions and suc- epigenetic silencing occurs in a substantial proportion of ovarian cessfully PCR-validated all four of these insertions. Using fewer HGSC cases. The 593-bp, 5′ truncated L1Hs insertion is in an sequencing reads (<10 million per sample) compromises our antisense orientation with respect to BRCA1, and it has a 12- to

Tang et al. PNAS Early Edition | 5of8 Downloaded by guest on September 28, 2021 A are new technologies, and include approaches developed by Rah- bari and Badge (38), Devine and co-workers (15), Faulkner and co- workers (19, 39, 40), Kazazian and co-workers (16, 41), Deininger and co-workers (42), and Gage and co-workers (43), as well as our own approaches. Collectively, their application has reinforced that LINE-1 is an important source of structural variation in humans and contributed to a growing database of LINE-1 insertion polymorphisms (44). This conclusion is even more true for Alu insertion variants (45, 46). These studies have also demonstrated that several types of epithelial cancers acquire somatic insertions of LINE-1 as theydevelop(15,19,24,26).Recent projects mining whole-genome sequencing data have extended our understanding of the scope B of heritable LINE-1 insertions (20, 21, 47) and somatic retrotransposition (22, 23, 48) greatly. In the coming years, experimentalists will have many reasons to map LINE-1 insertions in individual samples. The ability to develop comprehensive catalogs of insertions in a particular sample is an important prerequisite to recognizing recurrent insertion patterns, associating insertions with phenotypes, manipulating these sequences to test their effects, identifying which LINE-1s are active in a sample (47), and discerning what makes a cell context permissive for LINE-1 activity. At present, we have focused on TIPseq using large quantities of high-molecular-weight genomic DNA as starting Fig. 5. Model performance. (A) Accuracies for a set of matched germ-line, material. In the future, adapting the protocol for single-cell appli- primary pancreatic tumor, and metastatic tumor samples. Accuracies are the cations or for use with fragmented gDNA from fixed tissues would highest when the fixed present L1PA1 set is used to define positive instances broaden the scope of questions that can be addressed by targeted for training models (circles). Accuracies for the RepeatMasker trained models LINE-1 insertion site mapping. range from 0.90 to 0.98 (squares). (B) Number of insertions detected (labeled positive plus unlabeled instances) with P > 0.99 as predicted by the models Targeted LINE-1 mapping methods have recently been de- when training on the fixed present and RepeatMasker sets. veloped and have not been extensively shared among independent laboratories or even among multiple users within their resident laboratories. There have been limited publications detailing wet 19-bp TSD (the exact TSD length is uncertain due to T7 bench considerations or providing accompanying informatics suites. microhomology with the pA tail). The intron sequence is rich in There have been limited efforts to compare approaches to establish Alu insertions, and the somatically acquired LINE-1 interrupts best practices, and more challenging applications, such as single-cell an AluSx. The insertion was validated with a series of PCR re- LINE-1 mapping, have generated estimates of LINE-1 activities actions, and Sanger-sequenced (Fig. 6A). The patient is not a that are difficult to reconcile (18, 40, 49). ’ known carrier of a germ-line BRCA1 mutation; we are currently Here, we present a user s guide to a targeted LINE-1 insertion assessing the functional consequences of this intronic insertion. site sequencing strategy: TIPseq. We describe a ligation-mediated PCR method for the selective amplification of LINE-1 insertions Data Visualization: TranspoScope. We have developed a biologist- and provide code for a machine-learning–based algorithm to friendly display for viewing TIPseq results that we call Trans- identify insertions based on the resulting reads. The PCR approach poScope. TranspoScope is meant to provide the following: (i)a is a modification of vectorette PCR, which selectively amplifies ′ graphical view of read pairs supporting each insertion; (ii)a fragments of genomic sequence 3 of an interspersed repeat se- display of the quality and quantity of junction reads, which we quence, in this application, the L1PA1 element. The analytical perceive as critical for confidently calling an insertion; (iii)a portion uses information from read alignments to unique sequences downstream, as well as the so-called “junction reads” that span the zoomable format that allows a range from multiple kilobases to ′ individual nucleotides; (iv) a pane reporting on the nearest gene(s) 3 pA end of the LINE-1 and the adjacent DNA. This methodology provides resolution of the 3′ endoftheinsertiontothebasepair, and, if intragenic, the position within the gene; and (v) a link to and gives the orientation of the insertion. Although the design of the UCSC Genome Browser, with a pointer to the exact posi- the PCR allows us to amplify 5′ truncated and 5′ inverted LINE-1 tion of the insertion. In addition, instances of restriction en- insertions, those severely truncated insertions (<100 bp), pA-only zyme sites used in the TIPseq experiment that lie near the insertions, and LINE-1 insertions with extensive 3′ transduction insertion point are indicated as red vertical lines in the screen B–E events are missed or pose problems for TIPseq/TIPseqHunter. shots shown in Fig. 6 , and a supporting movie (https://www. We apply TIPseq and TIPseqHunter here to detect somatic = youtube.com/watch?v exVAnoMRLSM) shows a TranspoScope retrotransposition in two types of malignancies, PDAC and type session. II OC. High proportions of both types of tumors aberrantly ex- Discussion press LINE-1 encoded ORF1p protein (37), an RNA-binding protein critical for LINE-1 retrotransposition. Both have pre- Much of our DNA is the result of transposable element activi- viously been shown to permit somatic retrotransposition events ties. Over the past several million years, essentially all of this (22, 25, 27). We demonstrate that TIPseq and TIPseqHunter activity has resided with a subfamily of LINE-1, the H. sapiens- allow for the precise detection of somatic LINE-1 integrations specific L1Hs and other RNA transposons dependent on LINE- when LINE-1 insertion profiles of matched tumor and normal 1–encoded reverse transcriptase for their retrotransposition (2). DNA are compared. Insertions found by this approach have an The highly repetitive nature of these sequences has made excellent validation rate, with well over 90% of insertions being them especially challenging to study. The reference genome as- validated by a traditional gDNA PCR. Our discovery of a so- sembly captures fixed alleles and high-frequency alleles, but does matically acquired insertion at the BRCA1 locus in a case of not encompass many common variants (17). Targeted methods ovarian cancer underscores the types of potentially functional for recovering LINE-1 insertions for next-generation sequencing alterations that can be revealed by this approach.

6of8 | www.pnas.org/cgi/doi/10.1073/pnas.1619797114 Tang et al. Downloaded by guest on September 28, 2021 PNAS PLUS

A 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X

BRCA1 GENETICS

593bp

Fig. 6. Somatic LINE-1 insertions in ovarian cancer. (A) Positions of somatic insertions observed in HGSC shown on a chromosomal ideogram (red marks). To the lower right, a schematic shows the structure of the BRCA1 gene at 17q21.31 and the location of a somatically acquired, intronic LINE-1 insertion. The 593-bp LINE-1 is 5′ truncated and includes a portion of the ORF2p ORF, the LINE-1 3′ UTR, and a pA tail (red); it is flanked by TSDs (white boxes). (B–E) TranspoScope view of the evidence for two insertions. (B and C) L1(Ta) at chr6:136,712,694 ± 3 at two different magnifications. (D and E) LINE-1 insertion at chr17:41,250,393 ± 1inBRCA1.(B and D) Distribution of genome/genome read pairs (gray); genome/L1 read pairs (purple/blue), junction reads (orange), and all reads overlaid. (C and E) Sequence of the junction reads.

Tang et al. PNAS Early Edition | 7of8 Downloaded by guest on September 28, 2021 ACKNOWLEDGMENTS. This work was supported by the Sol Goldman US NIH Awards R01CA161210 (to J.D.B.), R01CA163705 (to K.H.B.), and Pancreatic Cancer Research Center and the Health, Empowerment, R01GM103999 (to K.H.B.); as well as National Institute of General Med- Research, and Awareness Women’s Cancer Foundation (N.R.); a Burroughs ical Sciences Center for Systems Biology of Retrotransposition Grant Wellcome Fund Career Award for Biomedical Scientists Program (K.H.B.); P50GM107632 (to K.H.B. and J.D.B.).

1. Lander ES, et al.; International Human Genome Sequencing Consortium (2001) Initial 26. Doucet-O’Hare TT, et al. (2015) LINE-1 expression and retrotransposition in Barrett’s sequencing and analysis of the human genome. Nature 409(6822):860–921. esophagus and esophageal carcinoma. Proc Natl Acad Sci USA 112(35):E4894–E4900. 2. Burns KH, Boeke JD (2012) Human transposon tectonics. Cell 149(4):740–752. 27. Rodic N, et al. (2015) Retrotransposon insertions in the clonal evolution of pancreatic 3. Boissinot S, Chevret P, Furano AV (2000) L1 (LINE-1) retrotransposon evolution and ductal adenocarcinoma. Nat Med 21(9):1060–1064. amplification in recent human history. Mol Biol Evol 17(6):915–928. 28. Wheelan SJ, Scheifele LZ, Martínez-Murillo F, Irizarry RA, Boeke JD (2006) Transposon 4. Skowronski J, Fanning TG, Singer MF (1988) Unit-length line-1 transcripts in human insertion site profiling chip (TIP-chip). Proc Natl Acad Sci USA 103(47):17632–17637. teratocarcinoma cells. Mol Cell Biol 8(4):1385–1397. 29. Arnold C, Hodgson IJ (1991) Vectorette PCR: A novel approach to genomic walking. 5. Sheen FM, et al. (2000) Reading between the LINEs: Human genomic variation in- PCR Methods Appl 1(1):39–42. duced by LINE-1 retrotransposition. Genome Res 10(10):1496–1508. 30. Eggert H, Bergemann K, Saumweber H (1998) Molecular screening for P-element 6. Dewannieux M, Esnault C, Heidmann T (2003) LINE-mediated retrotransposition of insertions in a large genomic region of Drosophila melanogaster using polymerase marked Alu sequences. Nat Genet 35(1):41–48. chain reaction mediated by the vectorette. Genetics 149(3):1427–1434. 7. Hancks DC, Goodier JL, Mandal PK, Cheung LE, Kazazian HH, Jr (2011) Retro- 31. Bolger AM, Lohse M, Usadel B (2014) Trimmomatic: A flexible trimmer for Illumina transposition of marked SVA elements by human L1s in cultured cells. Hum Mol Genet sequence data. Bioinformatics 30(15):2114–2120. – 20(17):3386 3400. 32. Langmead B, Trapnell C, Pop M, Salzberg SL (2009) Ultrafast and memory-efficient 8. Esnault C, Maestre J, Heidmann T (2000) Human LINE retrotransposons generate alignment of short DNA sequences to the human genome. Genome Biol 10(3):R25. – processed pseudogenes. Nat Genet 24(4):363 367. 33. Smit AFA, Hubley R, Green P (2010) RepeatMasker Open-3.0. Available at www.repeatmasker. 9. Luan DD, Korman MH, Jakubczak JL, Eickbush TH (1993) Reverse transcription of org. Accessed January 9, 2017. R2Bm RNA is primed by a nick at the chromosomal target site: A mechanism for non- 34. Quinlan AR, Hall IM (2010) BEDTools: A flexible suite of utilities for comparing ge- – LTR retrotransposition. Cell 72(4):595 605. nomic features. Bioinformatics 26(6):841–842. 10. Cost GJ, Feng Q, Jacquier A, Boeke JD (2002) Human L1 element target-primed re- 35. Jurka J (2000) Repbase update: A database and an electronic journal of repetitive verse transcription in vitro. EMBO J 21(21):5899–5910. elements. Trends Genet 16(9):418–420. 11. Ostertag EM, Kazazian HH, Jr (2001) Twin priming: A proposed mechanism for the 36. Schweikert C, Brown S, Tang Z, Smith PR, Hsu DF (2012) Combining multiple ChIP-seq creation of inversions in L1 retrotransposition. Genome Res 11(12):2059–2065. peak detection systems using combinatorial fusion. BMC Genomics 13(Suppl 8):S12. 12. Goodier JL, Ostertag EM, Kazazian HH, Jr (2000) Transduction of 3′-flanking se- 37. Rodic N, et al. (2014) Long interspersed element-1 protein expression is a hallmark of quences is common in L1 retrotransposition. Hum Mol Genet 9(4):653–657. many human cancers. Am J Pathol 184(5):1280–1286. 13. Pickeral OK, Makałowski W, Boguski MS, Boeke JD (2000) Frequent human genomic 38. Rahbari R, Badge RM (2016) Combining Amplification Typing of L1 Active Subfamilies DNA transduction driven by LINE-1 retrotransposition. Genome Res 10(4):411–415. (ATLAS) with high-throughput sequencing. Methods Mol Biol 1400:95–106. 14. Symer DE, et al. (2002) Human l1 retrotransposition is associated with genetic in- 39. Sanchez-Luque FJ, Richardson SR, Faulkner GJ (2016) Retrotransposon Capture Se- stability in vivo. Cell 110(3):327–338. quencing (RC-Seq): A targeted, high-throughput approach to resolve somatic L1 15. Iskow RC, et al. (2010) Natural mutagenesis of human genomes by endogenous ret- retrotransposition in humans. Methods Mol Biol 1400:47–77. rotransposons. Cell 141(7):1253–1261. 40. Upton KR, et al. (2015) Ubiquitous L1 mosaicism in hippocampal neurons. Cell 161(2): 16. Ewing AD, Kazazian HH, Jr (2010) High-throughput sequencing reveals extensive 228–239. variation in human-specific L1 content in individual human genomes. Genome Res 41. Doucet TT, Kazazian HH, Jr (2016) Long interspersed element sequencing (L1-Seq): A 20(9):1262–1270. method to identify somatic LINE-1 insertions in the human genome. Methods Mol 17. Huang CR, et al. (2010) Mobile interspersed repeats are major structural variants in – the human genome. Cell 141(7):1171–1182. Biol 1400:79 93. 18. Evrony GD, Lee E, Park PJ, Walsh CA (2016) Resolving rates of mutation in the brain 42. Streva VA, et al. (2015) Sequencing, identification and mapping of primed L1 ele- using single-neuron genomics. eLife 5:e12966. ments (SIMPLE) reveals significant variation in full length L1 elements between in- 19. Shukla R, et al. (2013) Endogenous retrotransposition activates oncogenic pathways dividuals. BMC Genomics 16:220. in hepatocellular carcinoma. Cell 153(1):101–111. 43. Erwin JA, et al. (2016) L1-associated genomic regions are deleted in somatic cells of – 20. Ewing AD, Kazazian HH, Jr (2011) Whole-genome resequencing allows detection of the healthy human brain. Nat Neurosci 19(12):1583 1591. many rare LINE-1 insertion alleles in humans. Genome Res 21(6):985–990. 44. Mir AA, Philippe C, Cristofari G (2015) euL1db: The European database of L1HS ret- 21. Stewart C, et al.; 1000 Genomes Project (2011) A comprehensive map of mobile ele- rotransposon insertions in humans. Nucleic Acids Res 43(Database issue):D43–D47. ment insertion polymorphisms in humans. PLoS Genet 7(8):e1002236. 45. Witherspoon DJ, et al. (2010) Mobile element scanning (ME-Scan) by targeted high- 22. Lee E, et al.; Cancer Genome Atlas Research Network (2012) Landscape of somatic throughput sequencing. BMC Genomics 11:410. retrotransposition in human cancers. Science 337(6097):967–971. 46. Witherspoon DJ, et al. (2013) Mobile element scanning (ME-Scan) identifies thou- 23. Tubio JM, et al.; ICGC Breast Cancer Group; ICGC Bone Cancer Group; ICGC Prostate sands of novel Alu insertions in diverse human populations. Genome Res 23(7): Cancer Group (2014) Mobile DNA in cancer. Extensive transduction of nonrepetitive 1170–1181. DNA mediated by L1 retrotransposition in cancer genomes. Science 345(6196): 47. Sudmant PH, et al.; 1000 Genomes Project Consortium (2015) An integrated map of 1251343. structural variation in 2,504 human genomes. Nature 526(7571):75–81. 24. Solyom S, et al. (2012) Extensive somatic L1 retrotransposition in colorectal tumors. 48. Helman E, et al. (2014) Somatic retrotransposition in human cancer revealed by Genome Res 22(12):2328–2338. whole-genome and exome sequencing. Genome Res 24(7):1053–1063. 25. Ewing AD, et al. (2015) Widespread somatic L1 retrotransposition occurs early during 49. Evrony GD, et al. (2012) Single-neuron sequencing analysis of L1 retrotransposition gastrointestinal cancer evolution. Genome Res 25(10):1536–1545. and somatic mutation in the human brain. Cell 151(3):483–496.

8of8 | www.pnas.org/cgi/doi/10.1073/pnas.1619797114 Tang et al. Downloaded by guest on September 28, 2021