Downloaded from genome.cshlp.org on October 1, 2021 - Published by Cold Spring Harbor Laboratory Press
Predicting human genes susceptible to genomic instability associated with Alu/Alu-mediated rearrangements
Key words: genome instability, Alu, genomic rearrangement, MMBIR, bioinformatics, machine- learning
Xiaofei Song1, Christine R. Beck1, Renqian Du1, Ian M. Campbell1, Zeynep Coban-Akdemir1, Shen Gu1,
Amy M. Breman1,2, Pawel Stankiewicz1,2, Grzegorz Ira1, Chad A. Shaw1,2, James R. Lupski1,3,4,5*
1. Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
2. Baylor Genetics, Houston, TX 77021, USA
3. Department of Pediatrics, Baylor College of Medicine, Houston, TX 77030, USA
4. Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX 77030, USA
5. Texas Children’s Hospital, Houston, TX 77030, USA
*Corresponding author:
James R. Lupski, M.D., Ph.D., D.Sc. (hon), FAAP, FACMG, FANA, FAAAS, FAAS
Cullen Professor, Department of Molecular and Human Genetics and Professor of Pediatrics
Member, Human Genome Sequencing Center (HGSC) Baylor College of Medicine
One Baylor Plaza
Houston, Texas 77030
E-mail: [email protected]
Phone: (713) 798-6530
Fax: (713) 798-5073 Downloaded from genome.cshlp.org on October 1, 2021 - Published by Cold Spring Harbor Laboratory Press
Abstract
Alu elements, the short interspersed element numbering >1 million copies per human genome, can mediate the formation of copy number variants (CNVs) between substrate pairs. These Alu/Alu-mediated rearrangements (AAMR) can result in pathogenic variants that cause diseases. To investigate the impact of AAMR on gene variation and human health, we first characterized Alus that are involved in mediating
CNVs (CNV-Alus) and observed that these Alus tend to be evolutionarily younger. We then computationally generated, with the assistance of a supercomputer, a test dataset consisting of 78 million
Alu pairs and predicted ~18% of them are potentially susceptible to AAMR. We further determined the relative risk of AAMR in 12,074 OMIM genes using the count of predicted CNV-Alu pairs, and experimentally validated the predictions with 89 samples selected by correlating predicted hotspots with a database of CNVs identified by clinical chromosomal microarrays (CMA) on the genomes of ~54,000 subjects. We fine-mapped 47 duplications, 40 deletions and 2 complex rearrangements and examined a total of 52 breakpoint junctions of simple CNVs. Overall, 94% of the candidate breakpoints were at least partially Alu mediated. We successfully predicted all (100%) of Alu pairs that mediated deletions (n=21) and achieved an 87% positive predictive value overall when including AAMR-generated deletions and duplications. We provided a tool, AluAluCNVpredictor, for assessing AAMR hotspots and their role in human disease. These results demonstrate the utility of our predictive model and provide insights into the genomic features and molecular mechanisms underlying AAMR.
2 Downloaded from genome.cshlp.org on October 1, 2021 - Published by Cold Spring Harbor Laboratory Press
INTRODUCTION
Alu elements are repetitive sequences originally described by re-association kinetics (Schmid and Jelinek
1982). The term ‘Alu’ derives from these sequences sharing a cut site for the restriction endonuclease
AluI (Houck et al. 1979). Alu repetitive sequences comprise ~11% of the human genome, and number more than one million copies per haploid genome (Lander et al. 2001). They belong to the primate- specific short interspersed element (SINE) family of mobile DNA. Alu elements can be grouped into distinct subfamilies based on sequence divergence. AluJs are the oldest Alu dimeric subfamily; AluS are the most numerous, and are younger than AluJ, while AluYs are the youngest of this family of repetitive elements (Shen et al. 1991; Batzer and Deininger 2002). Monomeric Alus also exist in the human genome, such as FRAM and FLAM elements (Quentin 1992). Full-length Alu elements are ~300 bp in size and consist of two monomeric repeats derived from 7SL RNA, an adenosine-rich connector, and a poly(A) tail. The left monomer contains an internal RNA polymerase III promoter, A Box and B Box and the right monomer has an A′ box (Fig. 1A) (Deininger et al. 2003; Beck et al. 2011). Alu sequences are often found at the endpoints of segmental duplications and the breakpoints of genomic rearrangements, and are associated with genome instability (Bailey et al. 2003; Shaw and Lupski 2005; Vissers et al. 2009). Copy number variants (CNVs) differ from a normal diploid state by deletion or amplification of genomic segments. When a pair of Alus mediates a genomic rearrangement (i.e. Alu/Alu-mediated rearrangement,
AAMR), a chimeric Alu hybrid will form at the junction (Fig. 1B). Microhomologies are the sequences surrounding the breakpoint junctions that are identical between the CNV-Alu elements within a pair. The first observed AAMR event was described 30 years ago in a patient with hypercholesterolemia and a 7.8 kb deletion of LDLR (Lehrman et al. 1987); similar AAMR-mediated exonic events have been elucidated during the decades that followed in association with different diseases including spastic paraplegia 4
(MIM 182601) (Boone et al. 2011; Boone et al. 2014), Fanconi anemia (MIM 227650) (Flynn et al. 2014), and von Hippel-Lindau syndrome (MIM 193300) (Franke et al. 2009). Alu-associated CNVs have been estimated to cause ~0.3% of human genetic diseases (Deininger and Batzer 1999). In spite of this fairly large potential impact of AAMR events on gene variation and human health, to date, <300 independent
3 Downloaded from genome.cshlp.org on October 1, 2021 - Published by Cold Spring Harbor Laboratory Press
events have been experimentally characterized at nucleotide-level resolution (Supplemental Table S1).
Array comparative genomic hybridization (aCGH) is a robust experimental procedure for CNV detection, including CNVs at the exonic level; however, array techniques have non-nucleotide-level breakpoint junction resolving capability. Although the combination of aCGH and PCR can achieve breakpoint junction sequence resolution, such an approach is not currently scalable. Thus, it is impractical and costly to map the breakpoints for all detected disease associated rare CNVs using this approach. Moreover, many studies utilizing genome-wide variant assays, including whole genome sequencing (WGS) and whole exome sequencing (WES), are limited by sequence coverage and alignment difficulties inherent to the relatively short length of sequencing reads and high degree of Alu sequence identity (Treangen and
Salzberg 2011).
CNVs and other structural variants (SV) can result from distinct molecular mechanisms including
DNA recombination-, DNA repair- and DNA replication-associated processes (Carvalho and Lupski 2016) and lead to human diseases often termed genomic disorders (Lupski 1998). Previously, repeated sequences (e.g. paralogous genes/pseudogenes, low-copy repeats (LCRs), etc.) and repetitive elements
(e.g. SINEs, long interspersed nuclear elements, LINEs) that are involved in the formation of genomic rearrangements have been posited to undergo non-allelic homologous recombination (NAHR). For example, duplications and deletions of the same genomic segment can be flanked by similar human endogenous retroviral sequences (HERVs) (Sun et al. 2000; Campbell et al. 2014), LINEs (Higashimoto et al. 2013; Startek et al. 2015), or LCRs (Lupski 1998; Sharp et al. 2005). The PRDM9 binding motif is a cis-acting sequence motif associated with allelic homologous recombination (AHR) and NAHR hotspots
(Lupski 2004; Lindsay et al. 2006; Myers and McCarroll 2006; Berg et al. 2010; Dittwald et al. 2013).
PRDM9 targeting sites are associated with about 40% of recombination hotspots ascertained through studies of historical recombinants (Myers et al. 2008; Webb et al. 2008). Deletions mediated by LCRs and
HERVs in human genomes are enriched for PRDM9 binding motifs proximal to the junctions, further implicating NAHR as the mechanism for their formation (Repping et al. 2002; Campbell et al. 2014).
Classically, NAHR has been proposed as the mechanism underlying AAMR events (Cordaux and Batzer
4 Downloaded from genome.cshlp.org on October 1, 2021 - Published by Cold Spring Harbor Laboratory Press
2009); however, the minimal processing segment required for NAHR is generally thought to be longer than an individual Alu sequence (Reiter et al. 1998). It has recently been proposed that Alu repetitive elements may participate in aberrant rearrangement of the genome by mediating template switching (TS) during replication-based repair mechanisms (Boone et al. 2011; Boone et al. 2014; Gu et al. 2015), or by undergoing SSA (single strand annealing) or MMEJ (microhomology-mediated end joining) mechanisms given that they provide multiple regions of microhomology (Elliott et al. 2005; Morales et al. 2015).
RESULTS
To better understand the mechanism(s) of AAMR and potentially identify human genes that are prone to instability due to these events, we conducted a machine-learning based analysis of Alu and the human genome reference; the steps of which are described below and summarized in Fig. 2.
Collection and characteristics of CNV-Alu pairs
We define the Alu pairs involved in AAMR events as CNV-Alus, and all the other non CNV-Alus as Ctrl-
Alus (Fig. 1B). To build a classifier for predicting Alu pairs that may be more likely to mediate genomic rearrangements, we utilized a positive training dataset comprised of 219 CNV-Alu pairs, 218 of which were collected from deletions published in 58 articles and 1 is currently unpublished (Supplemental Table
S1). Each breakpoint of the 219 deletions has been mapped at nucleotide-level resolution within the resultant chimeric Alu in the original studies enabling determination of microhomology at breakpoint junction of AAMR. The deletions vary in size from ~800 bp to ~4 Mb, and 75% of the deletions are shorter than 57 kb (Supplemental Fig. S1A). We have determined each of the Alu elements involved and their genomic coordinates, orientation and the information of subfamily using the RepeatMasker track of the UCSC Genome Browser (Supplemental Table S1) (Kent et al. 2002).
To determine whether CNV-Alu pairs are enriched for a specific subfamily, we first calculated the relative frequency of each subfamily composition in CNV-Alu pairs. Considering the different frequency of Alu subfamilies, we further calculated the expectation of the relative frequency of each
5 Downloaded from genome.cshlp.org on October 1, 2021 - Published by Cold Spring Harbor Laboratory Press
composition in Ctrl-Alu pairs that were documented in the RepeatMasker database (Smit et al. 2013-
2015). We found that AAMR events are more likely to be mediated by younger Alu elements, such as
AluS-AluY and AluY-AluY (P < 0.001, one-tailed binominal test, Fig. 3A). This could be potentially explained by a higher possibility of aligning better with each other due to a less divergent sequence in the younger families (Batzer and Deininger 2002). Of note, younger subfamilies are more active in retrotransposition assays (Bennett et al. 2008; Konkel et al. 2015). Thus, one potential explanation for younger Alus being involved in AAMR is that open chromatin exists over active Alus during transcription, which could allow them to function as better substrates for DNA repair.
We next analyzed the properties and characteristics of AAMR breakpoint junctions. The majority of AAMR microhomologies are shorter than 25 bp (Fig. 3B), and have a higher GC content than that of whole CNV-Alu element sequences (P < 0.0001, one-tailed t-test, Fig. 3C). The higher GC percentage might facilitate strand annealing by a stronger hydrogen bonding than A-T. We mapped the sequence of each breakpoint microhomology generated during AAMR formation to an Alu consensus sequence (Fig.
3D), and noted enrichment downstream of RNA pol III promoters (A Box, B Box and A′ Box). The location of breakpoint junctions in 18 rearrangements with at least one Alu element involved was previously described by Rudiger et al (Rudiger et al. 1995); a 26 bp core sequence was at or close to the breakpoint junctions (Fig. 3D-H, shaded light blue). With the 219 AAMR events, we further showed that more breakpoint junctions (n = 62) are located at or near the first 26 bp core sequence. There is no significant enrichment of a specific subfamily in these 62 events when compared to all of the 219 junctions (P > 0.05, one-tailed binomial test, Supplemental Fig. S2).
Of note, the microhomology distribution we observed is consistent with Alu/Alu-mediated evolutionary deletions previously identified by comparing human and chimpanzee genomes (Fig. 3E)
(Sen et al. 2006; Han et al. 2007). To infer the underlying mechanism for AAMR, we adapted a yeast template switching assay to examine human Alu pairs that occur more often in AAMR events, where
AluS-AluS and AluS-AluY are most numerous; the elements were chosen from experimentally determined events at the SPAST locus (Boone et al. 2014). We induced a nick downstream of a replication origin to
6 Downloaded from genome.cshlp.org on October 1, 2021 - Published by Cold Spring Harbor Laboratory Press
generate a single-ended, double-strand DNA break (seDSB), inducing a substrate that is repaired via TS during Break-Induced Replication (BIR). The previously published 74 events displayed an enrichment pattern of breakpoint junctions (Supplemental Fig. S3A, C) (Mayle et al. 2015); this pattern was robust, and was supported by an additional 429 events (Supplemental Fig. S3C-E). To test if this pattern is peculiar to the construct used, we made a strain with the same distal Alu and a different proximal Alu element (AluSx-AluY, 88.4% similar, Fig. 3F). Despite the distinctive pattern of available microhomologies (Supplemental Fig. S3A-B), the experimentally detected junctions from the two yeast assays differ from the observed pattern in human events, and display preferences for similar regions (Fig.
3G-H). The yeast junctions favor blocks of nearly identical sequence that are minimally interrupted by single base pair mismatches, therefore representing the largest sequence homeology (highly similar but not identical sequences) between each pair. These data support the contention that AAMR can be mediated by TS and potentially occurs by MMBIR. The divergent frequency distribution patterns observed between human and yeast experimental data might be due to the distribution of homeology between the two Alu pairs in the template switching assay and species differences.
Predicting CNV-Alu pairs
The skewed distribution of the properties of AAMR breakpoint junctions discussed above indicates that
AAMR events could be generated by Alu pairs enriched for particular features. This finding motivated characterization of a group of factors that can potentially distinguish CNV-Alu pairs from their genomic milieu. We investigated whether features of an individual Alu, and its surrounding genomic region may potentially influence genomic instability and the choice of which elements serve as templates for repair.
There are >1.1 million Alu elements in the human genome. If one also considers interchromosomal recombination events, > 6 10ͥͥ Ctrl-Alu pairs exist in the human genome. As the number of CNV-Alu pairs is grossly unequal to that of Ctrl-Alu pairs, factors were selected before utilization in our machine learning model to abrogate concerns of overfitting the model.
7 Downloaded from genome.cshlp.org on October 1, 2021 - Published by Cold Spring Harbor Laboratory Press
The distribution across the genome of the 219 events in our positive training set was biased because of the limited number of studies and the focus on several disease genes (Supplemental Fig. S1B-
C). To maximize the knowledge that one could potentially gain from these events, we treated each deletion as an individual locus, and have characterized each locus independently.
Generating a control, i.e. a negative training dataset, is challenging because the absence of evidence for an Alu to be involved in CNV formation does not mean that the element cannot be used in a heretofore-uncharacterized event. To overcome this issue, we constructed distinct control datasets for analyzing individual features and genomic features (Supplemental Fig. S4; see methods). To decrease the possibility of choosing false negative CNV-Alu pairs, 1000 Ctrl-Alu pairs with the same orientation as each other (either plus or minus) in each region were randomly collected as a negative control dataset.
Since the local environmental genomic features can be variable simply because of the relative position within the gene, the negative control datasets for these features were generated analogously to the collected CNV-Alu pairs, which are intragenic exonic CNVs in/near disease-associated genes. Of note, the CNV sizes for these experimentally determined AAMR events tend to be less than 250 kb
(Supplemental Fig. S1A). There are only 27 deletions spanning greater than 250 kb and none were between 250 kb and 500 kb in size. We chose ‘< 250 kb’ as a cutoff to include the majority of the known events and generate a comparable test dataset to the training data. For each of the 192 CNV-Alu pairs that are <250 kb apart, we selected 1000 directly oriented Ctrl-Alu pairs that could delete at least one exon, and have the same distance between them as the CNV-Alu pair.
As it was not clear whether each Alu element within a pair would contribute equally to AAMR events, we calculated the pairwise value for each feature by computing the minimum, mean and maximum values for each pair; and also considered the difference between the two elements for genomic features. We then characterized each pairwise feature for Alu pairs in the training dataset. At each locus, we plotted the Ctrl-Alu pair distribution as a boxplot, and marked the CNV-Alu pair as a red dot. We also calculated a P-value using a Monte Carlo approach (see methods). For example, at most loci, the red dots showing the value of pairwise alignment of CNV-Alu pairs are located above the medians of Ctrl-Alu
8 Downloaded from genome.cshlp.org on October 1, 2021 - Published by Cold Spring Harbor Laboratory Press
pairs (Fig. 4A). The distribution of P-values across different loci is plotted in Fig. 4B. The geometric mean of P-values for the feature of pairwise alignment is 0.102. Therefore, CNV-Alu pairs in the majority of loci have a higher sequence similarity with each other than the control pairs. We next examined the region of 500 bp up and downstream of each Alu element to identify putative PRDM9 binding motifs that match >85% to a position weight matrix (PWM) (Campbell et al. 2014). There is no significant difference between CNV-Alu pairs and their locus-specific controls with respect to the number of surrounding
PRDM9 binding motifs (Fig. 4C-D, Table 1). We summarized the Monte Carlo based P-values for this and other tested parameters, including both individual Alu sequence features and genomic features surrounding CNV-Alu pairs, in Table 1 and Supplemental Fig. S5. After conducting these analyses and combining results from the plots and P-values, we noted that not all the features distinguish CNV-Alu pairs from control elements and that distinct pairwise values also perform differently. Therefore, we utilized the following parameters to train our model for prediction: minimum Alu element length and Alu density, mean GC percentage, sequence similarity of A Boxes/B Boxes, poly(A) tail length, replication timing, and the pairwise alignment score (marked with asterisks in Table 1).
We applied these features to train a model using quadratic discriminant analysis (QDA) with the
CNV-Alu pairs and 10,000 control pairs. The prior probability was set to 0.3, which provided the highest sensitivity in testing the known 192 CNV-Alu pairs (sensitivity = 0.83, 10-fold cross validation,
Supplemental Fig. S6). Our test dataset, consisting of ~78 million Ctrl-Alu pairs, was built to enrich potential intragenic exonic events but efficiently decrease the computational burden by choosing Alu pairs that have the same orientation, span at least one exon, and are located within 250 kb of each other. We utilized the Blue Gene supercomputer at Rice University to calculate the selected featured for each pair, as calculating pairwise alignment scores of the 78 million pairs was computationally intensive. These analyses predicted 17.6% of the Ctrl-Alu pairs to be more likely to mediate CNVs.
Of note, the features utilized in the QDA model training share some degree of co-dependence.
For example, evolutionarily young Alus tend to have a higher sequence similarity and be more active in retrotransposition (Bennett et al. 2008). The QDA model explicitly accounts for co-dependence
9 Downloaded from genome.cshlp.org on October 1, 2021 - Published by Cold Spring Harbor Laboratory Press
(covariance) between variables in making its predictions; in this way, the influence of correlation is statistically controlled and accounted for in the model. Any unwanted or erroneous impact of covariance on prediction should be minor. In addition, to better understand the level of feature co-dependence, we measured the impact of feature co-dependence by training a series of models with all selected features as well as removing one feature at a time and evaluating the performance of each model. The model that was trained with all selected features performed best in terms of having the lowest error rate as shown in Fig.
5A; however, the feature of Alu length seems to be redundant based on the same error rate as the overall model. The feature of Alu density contributed tremendously to the prediction as the error rate increased dramatically when this feature was removed; removing the other features could increase the error rate to some degree as well.
Predicting hotspot genes for AAMR events
We developed a risk score to estimate the relative effect of AAMR events at a gene level. We first tested the correlation between the count of predicted CNV-Alu pairs from our QDA analysis and the count of reported CNV-Alu pairs in known AAMR hotspot genes (Supplemental Table S1); they are significantly correlated (linear regression, P < 0.001). We then utilized this model to predict the risk score for 12,074 human genes that are collected in the OMIM database (https://www.omim.org/) with a gene MIM number and have complete information for each involved Alu element (Supplemental Table S2). We focused on a disease-related gene set because we only chose the potentially pathogenic Alu pairs in the test dataset. In addition, we could implement validation experiments using a clinical microarray database, which mainly interrogates disease-related genes. Although the prediction might not be as accurate as the OMIM genes, we also performed the same gene-level prediction on the total 23,637 RefSeq genes with available tested
Alu pairs (Supplemental Table S2).
An AAMR risk score of 1 indicates a prediction of 10 (101) CNVs intersecting this gene were expected. Using a score >1 as a cutoff determined a subset of 329/12,074 OMIM genes, or approximately
3% of the total. Given the left skewed distribution of AAMR scores (Fig. 5B), we regarded the AAMR
10 Downloaded from genome.cshlp.org on October 1, 2021 - Published by Cold Spring Harbor Laboratory Press
risk score >1 as an extremely strict cut off. In addition, we showed the distribution of the genes that have been affected by AAMR more than once (n=133, Supplemental Table S1, Fig. 5C). These genes have a median score of 0.601 and enrich surrounding 0.6; we suggest a score of 0.6 as a loose ‘cut off’ for at-risk genes. Genes with a score greater than 0.6 might be more susceptible for AAMR.
Furthermore, Alu density alone, a parameter previously suggested to influence AAMR mediated genomic instability (Boone et al. 2014), did not achieve the same performance in prediction compared to the model trained with all the selected features (P < 0.001, chi-square test). To clarify the potential effect of gene size, we performed two analyses: first, we fit a model treating count divided by gene size as a predictor and compared this model to our reported model; a goodness of fit test shows that these two models are not significantly different (chi-squared P = 0.331). This result suggests that accounting for gene size does not alter the predictive power of the model. As a second analysis, we created a multiple regression model that examined predicted count and gene size as independent contributors to the model
(Supplemental Table S3). The results analyzing this model show that although count is a significant variable, there is no residual variation significantly associated with gene size. Taken together, we conclude that the gene size does not significantly contribute to our results. Therefore, we trained this gene-level prediction only using the count of predicted CNV-Alu pairs.
Validating the computational prediction
We tested our pairwise prediction results against a list of 663 human Alu pairs that recombined during evolution as the corresponding positions in the chimpanzee genome only contain one chimeric Alu element at the breakpoint junction (Han et al. 2007). However, only one Alu pair was collected in our test dataset, and was not predicted as a likely CNV-Alu pair. The narrow overlap of the two datasets might be due to the pathogenicity and constraint on AAMR events between predicted pairs in humans that could delete at least one exon.
We then experimentally tested our prediction by cross-referencing the list of 329 predicted hotspot genes (risk score >1, Supplemental Table S2) with an anonymized database of exon-targeted
11 Downloaded from genome.cshlp.org on October 1, 2021 - Published by Cold Spring Harbor Laboratory Press
clinically applied chromosomal microarrays, CMAs, performed in ~54,000 individuals. We selected subject samples that have 1 CNV that intersect with genes in this hotspot gene list. To investigate any enrichment of AAMR versus variants mediated by other repeats/repetitive elements, we chose genes with