VISPA: a Computational Pipeline for the Identification and Analysis Of

Calabria et al. Genome Medicine 2014, 6:67 http://genomemedicine.com/content/6/9/67 SOFTWARE Open Access VISPA: a computational pipeline for the identification and analysis of genomic vector integration sites Andrea Calabria1†, Simone Leo2,3†, Fabrizio Benedicenti1, Daniela Cesana1, Giulio Spinozzi1,4, Massimilano Orsini2, Stefania Merella5, Elia Stupka5, Gianluigi Zanetti2 and Eugenio Montini1* Abstract The analysis of the genomic distribution of viral vector genomic integration sites is a key step in hematopoietic stem cell-based gene therapy applications, allowing to assess both the safety and the efficacy of the treatment and to study the basic aspects of hematopoiesis and stem cell biology. Identifying vector integration sites requires ad-hoc bioinformatics tools with stringent requirements in terms of computational efficiency, flexibility, and usability. We developed VISPA (Vector Integration Site Parallel Analysis), a pipeline for automated integration site identification and annotation based on a distributed environment with a simple Galaxy web interface. VISPA was successfully used for the bioinformatics analysis of the follow-up of two lentiviral vector-based hematopoietic stem-cell gene therapy clinical trials. Our pipeline provides a reliable and efficient tool to assess the safety and efficacy of integrating vectors in clinical settings. Background the proximity of the vector’sintegrationsite(IS),a Viral vectors, due to their ability to permanently integrate phenomenon known as insertional mutagenesis (IM) in a target genome, are used to achieve the stable genetic [1,9-12]. The identification of ISs on leukemic cells from modification of therapeutically relevant cells and their GT patients and preclinical models allowed identifying progeny. In particular, γ-retroviral (γ-RVs) and lentiviral the causes of IM and tracking the evolution of the (LVs) vectors are the preferred choice for hematopoietic malignant clone over time [11-16]. State-of-the-art stem/progenitor cell (HSPC) gene therapy (GT) applica- strategies for the identification of ISs start with the tions, having proved their efficacy in several preclinical amplification, via polymerase chain reaction (PCR), of assays and clinical trials for inherited monogenic disorders the DNA portion that contains part of the proviral gen- [1-4]. Since γ-RVs and LVs integrate in the cellular genome and of the flanking cellular genome. PCR products ome in a semi-random fashion [5-8], in a population of are then sequenced and mapped to the reference host vector marked cells each clone and its progeny harbor an genome to determine the genomic coordinates of the integrated vector in a unique genomic position that can ISs. More recently, next generation sequencing (NGS) be used as a distinctive genetic identifier. approaches have greatly enhanced the power of IS ana- Studies aimed at investigating the genomic distribution lysis, allowing to recognize clonal expansions caused by of integrating vectors in blood cells of GT patients are in vivo selection of gain-of-function insertional mutants fundamental to assess the safety and efficacy of the even before they progress to overt malignancy. More- therapy. Indeed, in some cases HSPC-GT was associated over, IS analysis is useful to address clonal diversity dur- to the potential emergence of severe adverse effects that ing hematopoietic reconstitution and the levels of HSPC involve the perturbation of the expression of genes in marking and activity after transplant, thus providing readouts for efficacy. For these reasons, over the past * Correspondence: [email protected] years there has been a constant increase in the amount †Equal contributors 1San Raffaele Telethon Institute for Gene Therapy (TIGET), San Raffaele of sequencing and mapping of vector/genomic DNA Scientific Institute, 20132 Milano, Italy junctions, as well as an increasing diversification of Full list of author information is available at the end of the article © 2014 Calabria et al.; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Calabria et al. Genome Medicine 2014, 6:67 Page 2 of 12 http://genomemedicine.com/content/6/9/67 tissue sources, cell types and time points during IS mon- file for each barcode); the LTR and LC sequences are sub- itoring. However, despite the advances brought forth by sequently removed from each read to isolate genomic NGS approaches and the higher level of detail provided by fragments; in the next step, reads are mapped to the refer- the additional cell types and time points, there is still a ence genome and several filters are applied to ensure clear lack of computational tools that offer both a level of unambiguous alignment; after that, ISs that fall in the performance capable of dealing with the huge amount of same 3 bp window are merged together; finally, all ISs are data generated by sequencing platforms and sufficient annotated by listing nearby genomic features (for example, usability to make them accessible to investigators with genes). In a subsequent postprocessing step, each IS is varying degrees of technological expertise. associated to the LAM-PCR sample from which it was Here we describe the design and implementation of originally derived, allowing its assignment to a source (for VISPA (Vector Integration Sites Parallel Analysis), a bio- example, peripheral blood, bone marrow and so on), cell informatics pipeline for the identification of ISs built on a type (for example, CD4+ T cell, CD19+ B cell, and so on), scalable infrastructure with a simple graphical user inter- and time point after treatment. face (GUI) based on the popular Galaxy framework [17]. VISPA has been successfully applied in several studies on Format conversion mouse and human genome [3,4,18,19]. In this work we In the data extraction step, standard flowgram format describe its performance on human IS datasets. This ana- (SFF, for the Roche 454) or FASTQ (for the Illumina lysis allowed us to highlight critical points negatively MiSeq) files are converted to the FASTA format with a impacting the efficiency of IS retrieval and mapping and wrapper for the sff_extract program [22]. In the course of provided hints to improve the whole process. VISPA is our experiments, before running the pipeline, we also had available at [20]. to convert paired-end FASTQ files from the Illumina MiSeq to single-strand. In the case of Illumina paired-end Implementation data, before running the pipeline, we converted the reads VISPA has been specifically designed to analyze DNA to single-strand as follows: for each read, we determined fragments generated by linear-amplification (LAM) medi- the LTR’s orientation to identify the starting nucleotide; ated PCR [21], a technique used to retrieve and amplify then, for overlapping reads, we merged the pair, while for DNA fragments containing the junctions between the non-overlapping reads we only kept the one that con- integrated proviral and the cellular genome. Due to its tained the LTR, which allows identifying the IS. high sensitivity and accuracy, LAM-PCR is the current standard for preclinical and clinical GT studies. The DNA Demultiplexing fragments generated with this method range from 100 to To avoid NGS capacity underutilization, several samples 1,000 bp in length, and contain the proviral long terminal are often sequenced at the same time, a technique called repeat (LTR), the flanking genomic DNA and a linker multiplexing. To enable the redistribution of output cassette (LC). LAM-PCR products are then reamplified reads into separate groups (demultiplexing), samples are by PCR with fusion primers containing a specific 6- tagged with individual ‘barcode’ sequences. Our demulti- nucleotide sequence (barcode) that acts as a tag to allow plexing tool, implemented in Python, identifies barcodes sample recognition after multiplexing. Barcoded frag- and uses them to demultiplex sequence data, producing ments are then purified, quantified, grouped into pools a separate FASTA file for each barcode. To demultiplex and sequenced with either Roche 454 or Illumina MiSeq sequencing data, we developed a simple exact string pat- platforms. As a result of this procedure, the sequencing tern matching: the input is a list of barcode sequences reads contain not only the genomic fragment needed for that will be searched for at the beginning of each read, IS identification, but also viral and artificial sequences that while the output consists of a separate FASTA file for must be trimmed out before alignment to the reference each barcode (reads that do not contain any known tag genome. Finally, sequencing reads must be processed by a are discarded). To avoid biases due to the possible mis- bioinformatics pipeline that yields the final list of anno- classification of similar sequences, no mismatches are tated ISs (Figure 1). tolerated in this phase. Bioinformatics pipeline Trimming The bioinformatics pipeline (Figure 1C) consists of several Sequencing reads produced in the context of GT contain sequential steps that lead from raw sequencing reads to both viral fragments and artificial sequences introduced

VISPA: a Computational Pipeline for the Identification and Analysis Of

INTELLIGENT MEDICINE the Wings of Global Health

Developing and Implementing an Institute-Wide Data Sharing Policy Stephanie OM Dyke and Tim JP Hubbard*

Cancer Genome Interpreter Annotates the Biological and Clinical Relevance of Tumor Alterations David Tamborero1,2, Carlota Rubio-Perez1, Jordi Deu-Pons1,2, Michael P

ENVE: a Novel Computational Framework

A 26-Hour System of Highly Sensitive Whole Genome Sequencing for Emergency Management of Genetic Diseases Neil A

Exploration of Haplotype Research Consortium Imputation for Genome-Wide Association Studies in 20,032 Generation Scotland Participants Reka Nagy1, Thibaud S

Omic Personality: Implications of Stable

Downloaded from the CAVA Integration with Data Generated by Pre-NGS Methods Webpage [19]

Characterizing Biobank Organizations in the U.S.: Results from a National

Detecting Protein Variants by Mass Spectrometry: a Comprehensive Study in Cancer Cell-Lines Javier A

Biomed Central Open-Access Research That Covers a Broad Range of Disciplines, and Reaches Influencers and Decision Makers

Systems Medicine and Integrated Care to Combat Chronic