Supplemental Material.Pdf
Total Page:16
File Type:pdf, Size:1020Kb
SUPPLEMENTAL MATERIAL A cloud-compatible bioinformatics pipeline for ultra-rapid pathogen identification from next-generation sequencing of clinical samples Samia N. Naccache1,2, Scot Federman1,2, Narayanan Veerarghavan1,2, Matei Zaharia3, Deanna Lee1,2, Erik Samayoa1,2, Jerome Bouquet1,2, Alexander L. Greninger4, Ka-Cheung Luk5, Barryett Enge6, Debra A. Wadford6, Sharon L. Messenger6, Gillian L. Genrich1, Kristen Pellegrino7, Gilda Grard8, Eric Leroy8, Bradley S. Schneider9, Joseph N. Fair9, Miguel A.Martinez10, Pavel Isa10, John A. Crump11,12,13, Joseph L. DeRisi4, Taylor Sittler1, John Hackett, Jr.5, Steve Miller1,2 and Charles Y. Chiu1,2,14* Affiliations: 1Department of Laboratory Medicine, UCSF, San Francisco, CA, USA 2UCSF-Abbott Viral Diagnostics and Discovery Center, San Francisco, CA, USA 3Department of Computer Science, University of California, Berkeley, CA, USA 4Department of Biochemistry, UCSF, San Francisco, CA, USA 5Abbott Diagnostics, Abbott Park, IL, USA 6Viral and Rickettsial Disease Laboratory, California Department of Public Health, Richmond, CA, USA 7Department of Family and Community Medicine, UCSF, San Francisco, CA, USA 8Viral Emergent Diseases Unit, Centre International de Recherches Médicales de Franceville, Franceville, Gabon 9Metabiota, Inc., San Francisco, CA, USA 10Departamento de Genética del Desarrollo y Fisiología Molecular, Instituto de Biotecnología, Universidad Nacional Autónoma de México, Cuernavaca, Morelos, México 11Division of Infectious Diseases and International Health and the Duke Global Health Institute, Duke University Medical Center, Durham, NC 12Kilimanjaro Christian Medical Centre, Moshi, Tanzania 13Centre for International Health, University of Otago, Dunedin, New Zealand 14Department of Medicine, Division of Infectious Diseases, UCSF, San Francisco, CA, USA *Corresponding Author Charles Chiu Department of Laboratory Medicine UCSF, San Francisco, CA [email protected] 1 TABLE OF CONTENTS Supplemental Results............................................................................................................ 3 Supplemental Methods.......................................................................................................... 4 Supplemental Figures……………………………………………………………….………….... 15 Supplemental Figure S1. Effect of read length on accuracy of SNAP and RAPSearch classification……………………………………………...……........ 15 Supplemental Figure S2. Coverage maps automatically generated by SURPI in comprehensive mode....................................................................................... 16 Supplemental Figure S3. Relative percentages of processing time and mapped / filtered reads for clinical metagenomic datasets analyzed using SURPI..................... 21 Supplemental Figure S4. ROC curves generated from varying BWA and BT2 parameter combinations................................................................................... 22 Supplemental Figure S5. SURPI scripts and software / database dependencies...... 23 Supplemental Tables............................................................................................................. 24 Supplemental Table S1. Processing time and cost of SURPI versus PathSeq (v1.2) on a cloud server.............................................................................................. 24 Supplemental Table S2. Composition and number of reads corresponding to in silico-generated NGS datasets..................................................................... 25 Supplemental Table S3. Clinical NGS datasets used for ROC curve analysis.......... 26 Supplemental Table S4. Clinical NGS datasets used for SURPI performance validation.......................................................................................................... 27 Supplemental Table S5. Sensitivity and specificity of viral detection in clinical NGS datasets........................................................................................................... 28 Supplemental Tables S6-S13. Viral species summary tables of SURPI (SNAP) nucleotide alignment to the NCBI nt DB.......................................................... 29 Supplemental Tables S14-S21. Viral species summary tables of SURPI (RAPSearch) translated nucleotide alignment to the viral protein DB................................... 48 Supplemental Table S22. Performance testing of de novo assembly methods......... 71 Supplemental References.................................................................................................... 72 2 SUPPLEMENTAL RESULTS Effect of read length on SNAP and RAPSearch performance The ability of SNAP and RAPSearch to correctly identify viral reads across varying read lengths was evaluated (Supplemental Fig. S1A-B). Since SNAP is a global aligner, the optimal edit distance determined by Youden’s index and the F1 score (harmonic mean) (Akobeng 2007) increases linearly as the read length increases (Supplemental Fig. S1C). For RAPSearch, the optimal E-value cutoff decreases exponentially as the read length increases (Supplemental Fig. S1D). Comparison of processing time and costs for running SURPI versus PathSeq v2.1 on a cloud server. Both SURPI and PathSeq v2.1 can be run in a cloud computing environment using an Amazon Elastic Compute Cloud (EC2) server. We benchmarked SURPI using a hi1.4xlarge instance with 16vCPUs, 60.5 GB RAM, and 2 x 1024 GB SSDs. PathSeq was run using its default “cloud” setup and with default parameters on EC2, using a Hadoop cluster of 20 m1.large instances as described (Kostic et al. 2011). Overall processing times and costs are reported in Supplemental Table S1. Costs were calculated by hour (minutes rounded up) according to established Amazon EC2 prices as of July of 2013. Accuracy of SURPI for viral read identification in clinical NGS datasets SURPI’s ability to correctly identify viral reads originating from viral genomes known to be represented in clinical NGS datasets was evaluated. Clinical sample sets benchmarked and validated by SURPI are shown in Table S4. True viral reads / microbial reads were defined as meeting an E-value cutoff of 1x10-20 after BLASTn alignment to the most closely related viral genome in the reference database. Using this definition, the sensitivity and specificity of SURPI for viral identification were calculated (Supplemental Table S5). For detection of parechovirus in pediatric stool, there is decreased sensitivity by SURPI in fast mode due to sequence divergence (Figure 5B; Supplemental Table S4), as fast mode, unlike comprehensive mode, does not use RAPSearch for translated nucleotide alignment to the viral database. Similarly, for the highly divergent titi monkey adenovirus (TMAdV) and Bas-Congo rhabodovirus (BASV), only 8% and 0% viral reads are detected by nucleotide alignment, respectively, using SNAP, while only 14-16% viral reads are detected using RAPSearch (Figure 5G; Supplemental Table S4). 3 SUPPLEMENTAL METHODS Clinical samples NGS datasets from clinical samples harboring defined pathogens were generated in-house with the exception of datasets corresponding to a prostate cancer cell line (Prensner et al. 2011) and biopsy tissue from a colon cancer study (Castellarin et al. 2012), which were downloaded from the publicly available NCBI Sequence Read Archive (SRA). Details of NGS library preparation have been previously reported for the following datasets: chronic hepatitis (serum) (Naccache et al. 2013), pediatric diarrhea (stool) (Yu et al. 2012), hemorrhagic fever (serum) (Grard et al. 2012), prostate cancer (biopsy tissue) (Lee et al. 2012), simian pneumonia (lung swab / tissue) (Chen et al. 2011), and acute respiratory illness from influenza A(H1N1)pdm09 (nasal swab) (Greninger et al. 2010) (Supplementary Methods). The remaining in-house NGS libraries were derived from the following diseases and clinical samples: transfusion-transmitted viral hepatitis from the transfusion-transmitted viruses study (TTVS) (serum), (Mosley et al. 1990), hantavirus pulmonary syndrome (serum) (Hartline et al. 2013), encephalitis (cerebrospinal fluid), spiked HIV (plasma), acute hemorrhagic fever outbreak in the Democratic Republic of the Congo, Africa (serum), acute febrile illness in Tanzania, Africa (serum), and fever in a returning traveler (serum). Ethics statement All clinical samples were analyzed under protocols approved by the UCSF Institutional Review Board (IRB) and the IRBs of the collaborative donor institutions. Written informed consent to provide clinical samples for microbial analysis was previously obtained for patients in the chronic hepatitis cohort (University of Chicago) (Naccache et al. 2013) and the BIOLINCC Transfusion-Transmitted Viruses Study (Mosley et al. 1990), and for the returning traveler with undiagnosed fever. Other samples did not require consent as they were pre-existing and de- identified or were obtained from public health investigations, thus deemed not to constitute human subjects research NGS library construction The majority of NGS libraries were constructed using a modified TruSeq (Illumina, San Diego, CA) protocol as previously described (Naccache et al. 2013). Briefly, reverse transcription was performed using random nonamers attached to a 18 bp linker adapter (Wang et al. 2003), followed by second-strand synthesis using Sequenase (Affymetrix, Santa Clara, 4 CA) and 10-25 cycles of PCR amplification using high-fidelity KlenTaq LA DNA polymerase (Sigma-Aldrich, Carlsbad, CA). Adapter sequences were then cleaved using the Type IIs restriction