A Pipeline for Identifying Saccharomyces Cerevisiae Mutations

bioRxiv preprint doi: https://doi.org/10.1101/2020.05.16.099259; this version posted May 17, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 1 2 3 4 MutantHuntWGS: A Pipeline for Identifying Saccharomyces cerevisiae Mutations 5 6 7 8 Mitchell A. Ellison, Jennifer L. Walker, Patrick J. Ropp, Jacob D. Durrant, Karen M. 9 Arndt 10 11 Department of Biological Sciences, University of Pittsburgh, Pittsburgh PA, 15260 12 13 1 bioRxiv preprint doi: https://doi.org/10.1101/2020.05.16.099259; this version posted May 17, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 14 RUNNING TITLE: MutantHuntWGS 15 16 KEYWORDS: 17 Mutant hunt, variant calling, Saccharomyces cerevisiae, bulk segregant analysis, lab 18 evolution 19 20 CORRESPONDING AUTHORS: 21 Karen M. Arndt 22 Department of Biological Sciences 23 University of Pittsburgh 24 4249 Fifth Avenue 25 Pittsburgh, PA 15260 26 412-624-6963 27 [email protected] 28 29 Mitchell A. Ellison 30 Department of Biological Sciences 31 University of Pittsburgh 32 4249 Fifth Avenue 33 Pittsburgh, PA 15260 34 412-624-6992 35 [email protected] 36 2 bioRxiv preprint doi: https://doi.org/10.1101/2020.05.16.099259; this version posted May 17, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 37 ABSTRACT 38 MutantHuntWGS is a user-friendly pipeline for analyzing Saccharomyces cerevisiae 39 whole-genome sequencing data. It uses available open-source programs to: (1) perform 40 sequence alignments for paired and single-end reads, (2) call variants, and (3) predict 41 variant effect and severity. MutantHuntWGS outputs a shortlist of variants while also 42 enabling access to all intermediate files. To demonstrate its utility, we use 43 MutantHuntWGS to assess multiple published datasets; in all cases, it detects the same 44 causal variants reported in the literature. To encourage broad adoption and promote 45 reproducibility, we distribute a containerized version of the MutantHuntWGS pipeline 46 that allows users to install and analyze data with only two commands. The 47 MutantHuntWGS software and documentation can be downloaded free of charge from 48 https://github.com/mae92/MutantHuntWGS. 49 50 3 bioRxiv preprint doi: https://doi.org/10.1101/2020.05.16.099259; this version posted May 17, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 51 INTRODUCTION 52 Saccharomyces cerevisiae is a powerful model system for understanding the 53 complex processes that direct cellular function and underpin many human diseases 54 (Birkeland et al. 2010; Botstein and Fink 2011; Kachroo et al. 2015; Hamza et al. 2015, 55 2020; Wangler et al. 2017; Strynatka et al. 2018). Mutant hunts (i.e. genetic screens 56 and selections) in yeast have played a vital role in the discovery of many gene functions 57 and interactions (Winston and Koshland 2016). A classical mutant hunt produces a 58 phenotypically distinct colony derived from an individual yeast cell with at most a small 59 number of causative mutations. However, identifying these mutations using traditional 60 genetic methods (Lundblad 1989) can be difficult and time-consuming (Gopalakrishnan 61 and Winston 2019). 62 Whole-genome sequencing (WGS) is a powerful tool for rapidly identifying 63 mutations that underlie mutant phenotypes (Smith and Quinlan 2008; Irvine et al. 2009; 64 Birkeland et al. 2010). As sequencing technologies improve, the method is becoming 65 more popular and cost-effective (Shendure and Ji 2008; Mardis 2013). WGS is 66 particularly powerful when used in conjunction with lab-evolution (Goldgof et al. 2016; 67 Ottilie et al. 2017) or mutant-hunt experiments, both with (Birkeland et al. 2010; Reavey 68 et al. 2015) and without (Gopalakrishnan et al. 2019) bulk segregant analysis. 69 Analysis methods that identify sequence variants from WGS data can be 70 complicated and often require bioinformatics expertise, limiting the number of 71 investigators who can pursue these experiments. There is a need for an easy-to-use, 72 data-transparent tool that allows users with limited bioinformatics training to identify 73 sequence variants relative to a reference genome. To address this need, we created 4 bioRxiv preprint doi: https://doi.org/10.1101/2020.05.16.099259; this version posted May 17, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 74 MutantHuntWGS, a bioinformatics pipeline that processes data from WGS experiments 75 conducted in S. cerevisiae. MutantHuntWGS first identifies sequence variants in both 76 control and experimental (i.e. mutant) samples, relative to a reference genome. Next, it 77 filters out variants that are found in both the control and experimental samples while 78 applying a variant quality score-cutoff. Finally, the remaining variants are annotated with 79 information such as the affected gene and the predicted impact on gene expression and 80 function. The program also allows the user to inspect all relevant intermediate and 81 output files. 82 To enable quick and easy installation and to ensure reproducibility, we 83 incorporated MutantHuntWGS into a Docker container 84 (https://hub.docker.com/repository/docker/mellison/mutant_hunt_wgs). With a single 85 command, users can download and install the software. A second command runs the 86 analysis, performing all steps described above. MutantHuntWGS allows researchers to 87 leverage WGS for the efficient identification of causal mutations, regardless of 88 bioinformatics experience. 89 90 METHODS 91 Pipeline Overview 92 The MutantHuntWGS pipeline integrates a series of open-source bioinformatics 93 tools and Unix commands that accept raw sequencing reads (compressed FASTQ 94 format or .fastq.gz) and a text file containing ploidy information as input, and produces a 95 list of sequence variants as output. The user must provide input data from at least two 96 strains: a control strain and one or more experimental strains. The pipeline uses (1) 97 Bowtie2 to align the reads in each input sample to the reference genome (Langmead 5 bioRxiv preprint doi: https://doi.org/10.1101/2020.05.16.099259; this version posted May 17, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 98 and Salzberg 2012), (2) SAMtools to process the data and calculate genotype 99 likelihoods (Li et al. 2009), (3) BCFtools to call variants (Li et al. 2009), (4) VCFtools 100 (Danecek et al. 2011) and custom shell commands to compare variants found in 101 experimental and control strains, and (5) SnpEff (Cingolani et al. 2012) and SIFT (Vaser 102 et al. 2016) to assess where variants are found in relation to annotated genes and the 103 potential impact on the expression and function of the affected gene products (Figure 104 1). A detailed description of the commands used in the pipeline and all code is available 105 on the MutantHuntWGS Git repository (https://github.com/mae92/MutantHuntWGS; see 106 README.md, Supplemental_Methods.docx files). 107 108 Analysis of previously published data 109 To demonstrate utility, we used MutantHuntWGS to analyze published datasets 110 from paired-end sequencing experiments with DNA prepared from bulk segregants or 111 lab-evolved strains (Birkeland et al. 2010; Goldgof et al. 2016; Ottilie et al. 2017). These 112 data were downloaded from the sequence read archive (SRA) database 113 (https://www.ncbi.nlm.nih.gov/sra; project accessions: SRP003355, SRP074482, 114 SRP074623) and decompressed using the SRA toolkit (https://github.com/ncbi/sra- 115 tools/wiki). MutantHuntWGS was run from within the Docker container, and each 116 published mutant (experimental) file was compared to its respective published control. 117 118 Data Availability Statement 119 All code and supplementary information on the methods used herein are available on 120 the MutantHuntWGS Git repository (https://github.com/mae92/MutantHuntWGS). 6 bioRxiv preprint doi: https://doi.org/10.1101/2020.05.16.099259; this version posted May 17, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 121 RESULTS AND DISCUSSION 122 Installing and running the MutantHuntWGS pipeline 123 To facilitate distribution and maximize reproducibility, we implemented 124 MutantHuntWGS in a Docker container (Boettiger 2015; Di Tommaso et al. 2015). The 125 container houses the pipeline and all of its dependencies in a Unix/Linux environment. 126 To download and install the MutantHuntWGS Docker container, users need only install 127 Docker Desktop (https://docs.docker.com/get-docker/), open a command-line terminal, 128 and execute the following command: 129 $ docker run -it -v 130 /PATH_TO_DESKTOP/Analysis_Directory:/Main/Analysis_Directory 131 mellison/mutant_hunt_wgs:version1 132 After download and installation, the command opens a Unix terminal running in the 133 Docker container so users can begin their analysis.

Load more