Predicting Human Genes Susceptible to Genomic Instability Associated with Alu/Alu-Mediated Rearrangements
Total Page:16
File Type:pdf, Size:1020Kb
Downloaded from genome.cshlp.org on October 1, 2021 - Published by Cold Spring Harbor Laboratory Press Predicting human genes susceptible to genomic instability associated with Alu/Alu-mediated rearrangements Key words: genome instability, Alu, genomic rearrangement, MMBIR, bioinformatics, machine- learning Xiaofei Song1, Christine R. Beck1, Renqian Du1, Ian M. Campbell1, Zeynep Coban-Akdemir1, Shen Gu1, Amy M. Breman1,2, Pawel Stankiewicz1,2, Grzegorz Ira1, Chad A. Shaw1,2, James R. Lupski1,3,4,5* 1. Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA 2. Baylor Genetics, Houston, TX 77021, USA 3. Department of Pediatrics, Baylor College of Medicine, Houston, TX 77030, USA 4. Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX 77030, USA 5. Texas Children’s Hospital, Houston, TX 77030, USA *Corresponding author: James R. Lupski, M.D., Ph.D., D.Sc. (hon), FAAP, FACMG, FANA, FAAAS, FAAS Cullen Professor, Department of Molecular and Human Genetics and Professor of Pediatrics Member, Human Genome Sequencing Center (HGSC) Baylor College of Medicine One Baylor Plaza Houston, Texas 77030 E-mail: [email protected] Phone: (713) 798-6530 Fax: (713) 798-5073 Downloaded from genome.cshlp.org on October 1, 2021 - Published by Cold Spring Harbor Laboratory Press Abstract Alu elements, the short interspersed element numbering >1 million copies per human genome, can mediate the formation of copy number variants (CNVs) between substrate pairs. These Alu/Alu-mediated rearrangements (AAMR) can result in pathogenic variants that cause diseases. To investigate the impact of AAMR on gene variation and human health, we first characterized Alus that are involved in mediating CNVs (CNV-Alus) and observed that these Alus tend to be evolutionarily younger. We then computationally generated, with the assistance of a supercomputer, a test dataset consisting of 78 million Alu pairs and predicted ~18% of them are potentially susceptible to AAMR. We further determined the relative risk of AAMR in 12,074 OMIM genes using the count of predicted CNV-Alu pairs, and experimentally validated the predictions with 89 samples selected by correlating predicted hotspots with a database of CNVs identified by clinical chromosomal microarrays (CMA) on the genomes of ~54,000 subjects. We fine-mapped 47 duplications, 40 deletions and 2 complex rearrangements and examined a total of 52 breakpoint junctions of simple CNVs. Overall, 94% of the candidate breakpoints were at least partially Alu mediated. We successfully predicted all (100%) of Alu pairs that mediated deletions (n=21) and achieved an 87% positive predictive value overall when including AAMR-generated deletions and duplications. We provided a tool, AluAluCNVpredictor, for assessing AAMR hotspots and their role in human disease. These results demonstrate the utility of our predictive model and provide insights into the genomic features and molecular mechanisms underlying AAMR. 2 Downloaded from genome.cshlp.org on October 1, 2021 - Published by Cold Spring Harbor Laboratory Press INTRODUCTION Alu elements are repetitive sequences originally described by re-association kinetics (Schmid and Jelinek 1982). The term ‘Alu’ derives from these sequences sharing a cut site for the restriction endonuclease AluI (Houck et al. 1979). Alu repetitive sequences comprise ~11% of the human genome, and number more than one million copies per haploid genome (Lander et al. 2001). They belong to the primate- specific short interspersed element (SINE) family of mobile DNA. Alu elements can be grouped into distinct subfamilies based on sequence divergence. AluJs are the oldest Alu dimeric subfamily; AluS are the most numerous, and are younger than AluJ, while AluYs are the youngest of this family of repetitive elements (Shen et al. 1991; Batzer and Deininger 2002). Monomeric Alus also exist in the human genome, such as FRAM and FLAM elements (Quentin 1992). Full-length Alu elements are ~300 bp in size and consist of two monomeric repeats derived from 7SL RNA, an adenosine-rich connector, and a poly(A) tail. The left monomer contains an internal RNA polymerase III promoter, A Box and B Box and the right monomer has an A′ box (Fig. 1A) (Deininger et al. 2003; Beck et al. 2011). Alu sequences are often found at the endpoints of segmental duplications and the breakpoints of genomic rearrangements, and are associated with genome instability (Bailey et al. 2003; Shaw and Lupski 2005; Vissers et al. 2009). Copy number variants (CNVs) differ from a normal diploid state by deletion or amplification of genomic segments. When a pair of Alus mediates a genomic rearrangement (i.e. Alu/Alu-mediated rearrangement, AAMR), a chimeric Alu hybrid will form at the junction (Fig. 1B). Microhomologies are the sequences surrounding the breakpoint junctions that are identical between the CNV-Alu elements within a pair. The first observed AAMR event was described 30 years ago in a patient with hypercholesterolemia and a 7.8 kb deletion of LDLR (Lehrman et al. 1987); similar AAMR-mediated exonic events have been elucidated during the decades that followed in association with different diseases including spastic paraplegia 4 (MIM 182601) (Boone et al. 2011; Boone et al. 2014), Fanconi anemia (MIM 227650) (Flynn et al. 2014), and von Hippel-Lindau syndrome (MIM 193300) (Franke et al. 2009). Alu-associated CNVs have been estimated to cause ~0.3% of human genetic diseases (Deininger and Batzer 1999). In spite of this fairly large potential impact of AAMR events on gene variation and human health, to date, <300 independent 3 Downloaded from genome.cshlp.org on October 1, 2021 - Published by Cold Spring Harbor Laboratory Press events have been experimentally characterized at nucleotide-level resolution (Supplemental Table S1). Array comparative genomic hybridization (aCGH) is a robust experimental procedure for CNV detection, including CNVs at the exonic level; however, array techniques have non-nucleotide-level breakpoint junction resolving capability. Although the combination of aCGH and PCR can achieve breakpoint junction sequence resolution, such an approach is not currently scalable. Thus, it is impractical and costly to map the breakpoints for all detected disease associated rare CNVs using this approach. Moreover, many studies utilizing genome-wide variant assays, including whole genome sequencing (WGS) and whole exome sequencing (WES), are limited by sequence coverage and alignment difficulties inherent to the relatively short length of sequencing reads and high degree of Alu sequence identity (Treangen and Salzberg 2011). CNVs and other structural variants (SV) can result from distinct molecular mechanisms including DNA recombination-, DNA repair- and DNA replication-associated processes (Carvalho and Lupski 2016) and lead to human diseases often termed genomic disorders (Lupski 1998). Previously, repeated sequences (e.g. paralogous genes/pseudogenes, low-copy repeats (LCRs), etc.) and repetitive elements (e.g. SINEs, long interspersed nuclear elements, LINEs) that are involved in the formation of genomic rearrangements have been posited to undergo non-allelic homologous recombination (NAHR). For example, duplications and deletions of the same genomic segment can be flanked by similar human endogenous retroviral sequences (HERVs) (Sun et al. 2000; Campbell et al. 2014), LINEs (Higashimoto et al. 2013; Startek et al. 2015), or LCRs (Lupski 1998; Sharp et al. 2005). The PRDM9 binding motif is a cis-acting sequence motif associated with allelic homologous recombination (AHR) and NAHR hotspots (Lupski 2004; Lindsay et al. 2006; Myers and McCarroll 2006; Berg et al. 2010; Dittwald et al. 2013). PRDM9 targeting sites are associated with about 40% of recombination hotspots ascertained through studies of historical recombinants (Myers et al. 2008; Webb et al. 2008). Deletions mediated by LCRs and HERVs in human genomes are enriched for PRDM9 binding motifs proximal to the junctions, further implicating NAHR as the mechanism for their formation (Repping et al. 2002; Campbell et al. 2014). Classically, NAHR has been proposed as the mechanism underlying AAMR events (Cordaux and Batzer 4 Downloaded from genome.cshlp.org on October 1, 2021 - Published by Cold Spring Harbor Laboratory Press 2009); however, the minimal processing segment required for NAHR is generally thought to be longer than an individual Alu sequence (Reiter et al. 1998). It has recently been proposed that Alu repetitive elements may participate in aberrant rearrangement of the genome by mediating template switching (TS) during replication-based repair mechanisms (Boone et al. 2011; Boone et al. 2014; Gu et al. 2015), or by undergoing SSA (single strand annealing) or MMEJ (microhomology-mediated end joining) mechanisms given that they provide multiple regions of microhomology (Elliott et al. 2005; Morales et al. 2015). RESULTS To better understand the mechanism(s) of AAMR and potentially identify human genes that are prone to instability due to these events, we conducted a machine-learning based analysis of Alu and the human genome reference; the steps of which are described below and summarized in Fig. 2. Collection and characteristics of CNV-Alu pairs We define the Alu pairs involved in AAMR events as CNV-Alus, and all the other non CNV-Alus as Ctrl- Alus (Fig. 1B). To build a classifier for predicting Alu pairs that may be more likely to mediate genomic rearrangements, we utilized a positive training dataset comprised of 219 CNV-Alu