bioRxiv preprint doi: https://doi.org/10.1101/2020.03.31.014589; this version posted April 1, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.
SMN1 copy-number and sequence variant analysis from next generation sequencing data
Daniel Lopez-Lopez1,2, Carlos Loucera1,2, Rosario Carmona1, Virginia Aquino1, Josefa Salgado3,4, Sara Pasalodos3,4, María Miranda4, Ángel Alonso3, Joaquín Dopazo1,2,5,6,*
1 Clinical Bioinformatics Area. Fundación Progreso y Salud (FPS). CDCA, Hospital Virgen del Rocio. 41013. Sevilla. Spain;
2 Computational Systems Medicine, Institute of Biomedicine of Seville (IBIS), Hospital Virgen del Rocio. 41013. Sevilla. Spain;
3 Genomic Medicine Unit. Navarrabiomed. 31008 Pamplona, Navarra. Spain
4 Department of Genetics. Complejo Hospitalario de Navarra. 31008 Pamplona, Navarra. Spain
5 Bioinformatics in Rare Diseases (BiER). Centro de Investigación Biomédica en Red de Enfermedades Raras (CIBERER), FPS, Hospital Virgen del Rocío. 41013. Sevilla, Spain;
6 FPS/ELIXIR-es, Hospital Virgen del Rocío, Sevilla, 42013, Spain.
* Corresponding author
Abstract Spinal Muscular Atrophy (SMA) is a severe neuromuscular autosomal recessive disorder affecting 1/10,000 live births. Most SMA patients present homozygous deletion of SMN1, while the vast majority of SMA carriers present only a single SMN1 copy. The sequence similarity between SMN1 and SMN2, and the complexity of the SMN locus makes the estimation of the SMN1 copy-number by next generation sequencing (NGS) very difficult Here, we present SMAca, the first python tool to detect SMA carriers and estimate the absolute SMN1 copy- number using NGS data. Moreover, SMAca takes advantage of the knowledge of certain variants specific to SMN1 duplication to also identify silent carriers. This tool has been validated with a cohort of 326 samples from the Navarra 1000 Genomes project (NAGEN1000). SMAca was developed with a focus on execution speed and easy installation. This combination makes it especially suitable to be integrated into production NGS pipelines. Source code and documentation are available on Github at: www.github.com/babelomics/SMAca bioRxiv preprint doi: https://doi.org/10.1101/2020.03.31.014589; this version posted April 1, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.
Introduction Spinal Muscular Atrophy (SMA; MIM 253300) is an autosomal recessive disorder caused by degeneration of alpha motor neurons in the anterior horn of the spinal cord, leading to hypotonia, muscular atrophy and weakness of proximal muscles, predominantly affecting the lower extremities (Lunn and Wang, 2008). In most populations SMA is caused by homozygous deletions or, less frequently, mutations or exons 1-6 deletions in the survival motor neuron gene (SMN1). The disease severity is determined mainly by a copy gene, SMN2. The more SMN2 copies present, the milder the phenotype usually is. Both genes, located on chromosome 5q13.2, can be distinguished by only five nucleotides (Wirth, et al., 2020). Most individuals have two copies of each SMN1 and SMN2, however due to the complex genomic structure, gene conversion and rearrangements occur quite frequently in SMN locus leading to copy-number variants (MacDonald, et al., 2014). The majority of SMA patients, have a SMN1 deletion or gene conversion of SMN1 into SMN2, which results in a homozygous loss of SMN1 exon 7 or exons 7 and 8. Establishing the SMN2 copy number is of importance for SMA patients due to the inverse correlation between disease severity and SMN2 copy number. SMA carriers are symptom-free and can be identified by the presence of only a single SMN1 exon 7 copy. About 5% of SMA carriers have two SMN1 copies on one chromosome and 0 copies on the other (2+0) being named “silent carriers”. Finally, a variant, referred to as SMN1/2∆7-8, contains one or two extra copies of SMN exons 1-6 of SMN1 or SMN2. This variant is often present in individuals with no, or only one, SMN2. Although it is frequently found (23%) in Spanish carriers and non-carries, its clinical significance is not yet completely clear (Calucho, et al., 2018). Frequency of SMA carriers in the population is around 1.7-2.1 % (Larson, et al., 2015; Su, et al., 2011), presenting most of them only a single SMN1 exon 7 copy.
At present, the gold standard genetic testing for SMA is multiplex ligation-dependent probe amplification (MLPA) of SMN1 and SMN2 although it can neither identify silent carriers (2+0) nor subtle mutations in SMN1 (false-negative rate of ∼5%). However, next-generation sequencing (NGS) technology is rapidly becoming a cost-effective approach for clinical testing (Boycott, et al., 2019). Despite the difficulties that the accurate determination of two genes almost identical inherent to short read technologies, some strategies to process NGS data have already been proposed to detect SMA carriers (Feng, et al., 2017; Larson, et al., 2015). However, to our knowledge, there are not freely available tools that could be derive the mutational status of SMA form primary massive sequencing data. Here, we present a python tool, SMAca, which is the first application that can detect SMA carriers and concomitantly estimate the absolute SMN1 copy-number form NGS data. Moreover, SMAca can exploit the knowledge variants specific to SMN1 duplication to identify the silent carriers, following the recommendations for SMA carrier testing by the American College of Medical Genetics and Genomics (Prior, et al., 2011).
bioRxiv preprint doi: https://doi.org/10.1101/2020.03.31.014589; this version posted April 1, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.
Materials and Methods
NGS data processing Raw FASTQ files were processed following a standard NGS pipeline. Briefly, after filtering out low quality reads with fastp v0.20.0 (Chen, et al., 2018), reads were aligned with BWA-MEM v0.7.16a (Li and Durbin, 2009) against the human reference genome GRCh37/h19. Potential PCR duplicates were marked with Picard v2.17.3 (http://broadinstitute.github.io/picard/). Finally, the whole set of BAM files were analyzed with SMAca in a single batch.
SMN1 copy-number estimation The availability of a batch of samples allows a more accurate estimation of SMN1 copy-number. SMAca first calculates the raw proportion of SMN1 reads over the total number of reads covering SMN1 and SMN2 at three specific gene positions (denoted as a, b and c) for each sample (Table 1).
Table 1. SMN1 and SMN2 different nucleotides. Positions in SMN1 (and the analogous positions in SMN2) used to calculate the raw proportion of SMN1 reads(D1ij) over the total number of reads covering SMN1 and SMN2 (D1ij+D2ij) Position SMN1 SMN2 a chr5:70247724 chr5:69372304 b chr5:70247773 chr5:69372353 c chr5:70247921 chr5:69372501
These positions correspond to single nucleotide differences between SMN1 and SMN2. Raw values are then scaled with respect to 20 control genes (Table 2) previously described to have consistent average coverage relative to SMN1 and SNM2 (Larson, et al., 2015). Additionally, two genetic variants that have been associated with duplication events in SMN1 are also screened and reported (Luo, et al., 2014).
Ȩ Table 2. Control genes. List of genes used to calculate the scale factor ʚ$ʛ and the scaled proportion of SMN1 reads ʚ$%ʛ: ACAD9 FASTKD2 ITGA6 NTRK1 SIL1
ATR FOXN1 IVD PTEN SLC22A5
CYP11B1 HEXB LMNA RAB3GAP1 SLC35D1
EDNRB IQCB1 LRPPRC RAPSN STIM1
In particular, the relative coverage of SMN1 and SMN2 with respect to each control gene is calculated: &$ $ͥ $ͦ / &$, were $ͥand $ͦare the average coverage for the whole genes
SMN1 and SMN2, and &$ is the average coverage for the control gene k in the ith sample. Then, the scale factor $ ∑&Ͱͥ &$/ &