Varaft: a Variant Annotation and Filtration System for Human Next
Total Page:16
File Type:pdf, Size:1020Kb
VarAFT: a variant annotation and filtration system for human next generation sequencing data Jean-Pierre Desvignes, Marc Bartoli, Valérie Delague, Martin Krahn, Morgane Miltgen, Christophe Béroud, David Salgado To cite this version: Jean-Pierre Desvignes, Marc Bartoli, Valérie Delague, Martin Krahn, Morgane Miltgen, et al.. VarAFT: a variant annotation and filtration system for human next generation sequencing data. Nu- cleic Acids Research, Oxford University Press, 2018, 46 (W1), pp.W545-W553. 10.1093/nar/gky471. hal-01852493 HAL Id: hal-01852493 https://hal-amu.archives-ouvertes.fr/hal-01852493 Submitted on 2 Aug 2018 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Published online 31 May 2018 Nucleic Acids Research, 2018, Vol. 46, Web Server issue W545–W553 doi: 10.1093/nar/gky471 VarAFT: a variant annotation and filtration system for human next generation sequencing data Jean-Pierre Desvignes1, Marc Bartoli1,Valerie´ Delague1, Martin Krahn1,2, Morgane Miltgen1, Christophe Beroud´ 1,2,* and David Salgado1,* 1Aix Marseille Univ, INSERM, MMG, 13005, Marseille, France and 2APHM, Hopitalˆ d’Enfants de la Timone, Departement´ de Gen´ etique´ Medicale´ et de Biologie Cellulaire, 13385 Marseille, France Received January 29, 2018; Revised May 11, 2018; Editorial Decision May 14, 2018; Accepted May 16, 2018 ABSTRACT mAD) (1)), we still have only a limited vision of the human genome variability especially in the context of rare human With the rapidly developing high-throughput se- genetic disease. Indeed, most disease-causing mutations are quencing technologies known as next generation private, and the availability of functional tests is limited. sequencing or NGS, our approach to gene hunting Therefore, distinguishing neutral mutations from disease- and diagnosis has drastically changed. In <10 years, causing ones is challenging. This is even more challenging these technologies have moved from gene panel to for rare diseases, defined in Europe as conditions with a fre- whole genome sequencing and from an exclusively quency below 1: 2000, most of them being very rare. A re- research context to clinical practice. Today, the limit view from Orphanet (2), revealed that the majority of rare is not the sequencing of one, many or all genes but diseases are defined by a handful of published reports de- rather the data analysis. Consequently, the challenge scribing a few individuals with a previously unidentified ge- is to rapidly and efficiently identify disease-causing netic syndrome. It is now accepted that the limitation is no longer the sequencing of one, many or all genes but rather mutations within millions of variants. To do so, we the data analysis. In addition, while scientists were previ- developed the VarAFT software to annotate and pin- ously experts for a limited number of genes, they are now point human disease-causing mutations through ac- facing the ‘all genes data deluge’. This revolution has there- cess to multiple layers of information. VarAFT was fore resulted in a dependency on bioinformatics tools and designed both for research and clinical contexts and methods to gather, store, analyze and mine the data flow. In- is accessible to all scientists, regardless of bioin- deed, NGS technologies typically result in the production formatics training. Data from multiple samples may of hundreds of millions to billions of reads per exome or be combined to address all Mendelian inheritance genome, respectively. The analysis of these raw data can be modes, cancers or population genetics. Optimized divided into three steps as described by Gargis et al. (3). The filtration parameters can be stored and re-applied primary analysis includes the production of sequence reads to large datasets. In addition to classical annota- and assignment of base quality scores; the secondary anal- ysis includes de-multiplexing, alignment of reads to a refer- tions from dbNSFP, VarAFT contains unique fea- ence genome and variant calling; and the tertiary analysis tures at the disease (OMIM), phenotypic (HPO), gene is dedicated to the identification of disease-causing muta- (Gene Ontology, pathways) and variation levels (pre- tions. It involves the annotation and filtration of identified dictions from UMD-Predictor and Human Splicing sequence variations. As reported by Salgado et al. (4)and Finder) that can be combined to optimally select can- Eilbeck et al. (5), the annotation includes various layers that didate pathogenic mutations. VarAFT is freely avail- should be combined in the filtration step to rapidly select able at: http://varaft.eu. a handful of candidate mutations. This filtration step can benefit from the combination of data from multiple sam- INTRODUCTION ples as reported by Sawyer et al. (6) with Whole Exome Se- quencing (WES) success rates ranging from 23% for single- Massively parallel sequencing, also called NGS (next gener- tons to 34% for families. To simplify this tedious process, ation sequencing), led to a genetic revolution with the abil- various systems have been released such as QueryOR (7), ity to sequence any human genome in a few hours. Never- VarElect (8), VCF-Miner (9)andBierApp(10). To anno- theless, despite the thousands of exomes and genomes that tate and prioritize mutations, these systems include mul- have been studied (Genome Aggregation Database (gno- *To whom correspondence should be addressed. David Salgado. Tel: +33 491324884; Fax: +33 491804319; Email: [email protected] Correspondence may also be addressed to Christophe Beroud.´ Tel: +33 491324488; Fax: +33 491804319; Email: [email protected] C The Author(s) 2018. Published by Oxford University Press on behalf of Nucleic Acids Research. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact [email protected] Downloaded from https://academic.oup.com/nar/article-abstract/46/W1/W545/5025894 by SCDU Mediterranee user on 02 August 2018 W546 Nucleic Acids Research, 2018, Vol. 46, Web Server issue tiple annotations either captured through global systems Once annotated, data from any sample can be combined such as ANNOVAR (11)andVEP(12) or individually re- through the ‘filtration module’. It allows users to combine trieved. However, they only partially respond to users’ needs data from multiple sources and layers. The display mode can and may require a preliminary annotation step performed be parametered to display a subset of available columns. In- by bioinformaticians. In addition, for web-based solutions, teractive filtration features allow the progressive reduction confidentiality issues may arise depending on national leg- of the list of candidate mutations by combining the var- islation (13). In this context, we designed a new system ious annotations. An in-house mutation database can be called VarAFT (Variant Annotation and Filtration Tool), generated by VarAFT or provided by users, to exclude fre- that provides a full graphical interface and includes unique quent mutations reported in a specific population and/or features to improve mutation annotation and prioritization. platform-dependent artefacts. Once filtration steps have It combines classical data (phylogenetic, conservation and been defined and validated, they can be saved, reapplied protein structures) with additional information at variant, and shared for subsequent analysis to ensure filtration stan- gene and phenotype levels. In addition, it is one of the few dardization in a clinical diagnosis context or for large re- systems able to combine small (single nucleotide variations, search networks. At any filtration step, selected data can be small insertion/deletions) and large rearrangements (copy exported for downstream analysis or reporting. Moreover, number variations) to get a comprehensive picture of the the quality of each selected mutation can be viewed in its individual genome. sequencing context using IGV (33) directly from VarAFT. With VarAFT, users can easily annotate, filter and per- form breadth and depth of coverage analysis from their RESULTS data without computer programming skills and with lim- A highly integrative system to easily pinpoint candidate ited hardware requirements, to efficiently identify disease- disease-causing variants causing mutations as demonstrated in various situations (14–21). As previously reported, the ability to efficiently filter ge- netic variation to select candidate disease-causing muta- tions is improved by combining data at the variant, gene MATERIALS AND METHODS and phenotypic levels (4,5). Although multiple information VarAFT is a freely available application written in Java and are available at each level, no system was able to collect and can therefore be used on most computers. Various binaries combine all this information (4). VarAFT was therefore de-