A Pipeline for Identifying Saccharomyces Cerevisiae Mutations

Total Page:16

File Type:pdf, Size:1020Kb

A Pipeline for Identifying Saccharomyces Cerevisiae Mutations bioRxiv preprint doi: https://doi.org/10.1101/2020.05.16.099259; this version posted May 17, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 1 2 3 4 MutantHuntWGS: A Pipeline for Identifying Saccharomyces cerevisiae Mutations 5 6 7 8 Mitchell A. Ellison, Jennifer L. Walker, Patrick J. Ropp, Jacob D. Durrant, Karen M. 9 Arndt 10 11 Department of Biological Sciences, University of Pittsburgh, Pittsburgh PA, 15260 12 13 1 bioRxiv preprint doi: https://doi.org/10.1101/2020.05.16.099259; this version posted May 17, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 14 RUNNING TITLE: MutantHuntWGS 15 16 KEYWORDS: 17 Mutant hunt, variant calling, Saccharomyces cerevisiae, bulk segregant analysis, lab 18 evolution 19 20 CORRESPONDING AUTHORS: 21 Karen M. Arndt 22 Department of Biological Sciences 23 University of Pittsburgh 24 4249 Fifth Avenue 25 Pittsburgh, PA 15260 26 412-624-6963 27 [email protected] 28 29 Mitchell A. Ellison 30 Department of Biological Sciences 31 University of Pittsburgh 32 4249 Fifth Avenue 33 Pittsburgh, PA 15260 34 412-624-6992 35 [email protected] 36 2 bioRxiv preprint doi: https://doi.org/10.1101/2020.05.16.099259; this version posted May 17, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 37 ABSTRACT 38 MutantHuntWGS is a user-friendly pipeline for analyzing Saccharomyces cerevisiae 39 whole-genome sequencing data. It uses available open-source programs to: (1) perform 40 sequence alignments for paired and single-end reads, (2) call variants, and (3) predict 41 variant effect and severity. MutantHuntWGS outputs a shortlist of variants while also 42 enabling access to all intermediate files. To demonstrate its utility, we use 43 MutantHuntWGS to assess multiple published datasets; in all cases, it detects the same 44 causal variants reported in the literature. To encourage broad adoption and promote 45 reproducibility, we distribute a containerized version of the MutantHuntWGS pipeline 46 that allows users to install and analyze data with only two commands. The 47 MutantHuntWGS software and documentation can be downloaded free of charge from 48 https://github.com/mae92/MutantHuntWGS. 49 50 3 bioRxiv preprint doi: https://doi.org/10.1101/2020.05.16.099259; this version posted May 17, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 51 INTRODUCTION 52 Saccharomyces cerevisiae is a powerful model system for understanding the 53 complex processes that direct cellular function and underpin many human diseases 54 (Birkeland et al. 2010; Botstein and Fink 2011; Kachroo et al. 2015; Hamza et al. 2015, 55 2020; Wangler et al. 2017; Strynatka et al. 2018). Mutant hunts (i.e. genetic screens 56 and selections) in yeast have played a vital role in the discovery of many gene functions 57 and interactions (Winston and Koshland 2016). A classical mutant hunt produces a 58 phenotypically distinct colony derived from an individual yeast cell with at most a small 59 number of causative mutations. However, identifying these mutations using traditional 60 genetic methods (Lundblad 1989) can be difficult and time-consuming (Gopalakrishnan 61 and Winston 2019). 62 Whole-genome sequencing (WGS) is a powerful tool for rapidly identifying 63 mutations that underlie mutant phenotypes (Smith and Quinlan 2008; Irvine et al. 2009; 64 Birkeland et al. 2010). As sequencing technologies improve, the method is becoming 65 more popular and cost-effective (Shendure and Ji 2008; Mardis 2013). WGS is 66 particularly powerful when used in conjunction with lab-evolution (Goldgof et al. 2016; 67 Ottilie et al. 2017) or mutant-hunt experiments, both with (Birkeland et al. 2010; Reavey 68 et al. 2015) and without (Gopalakrishnan et al. 2019) bulk segregant analysis. 69 Analysis methods that identify sequence variants from WGS data can be 70 complicated and often require bioinformatics expertise, limiting the number of 71 investigators who can pursue these experiments. There is a need for an easy-to-use, 72 data-transparent tool that allows users with limited bioinformatics training to identify 73 sequence variants relative to a reference genome. To address this need, we created 4 bioRxiv preprint doi: https://doi.org/10.1101/2020.05.16.099259; this version posted May 17, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 74 MutantHuntWGS, a bioinformatics pipeline that processes data from WGS experiments 75 conducted in S. cerevisiae. MutantHuntWGS first identifies sequence variants in both 76 control and experimental (i.e. mutant) samples, relative to a reference genome. Next, it 77 filters out variants that are found in both the control and experimental samples while 78 applying a variant quality score-cutoff. Finally, the remaining variants are annotated with 79 information such as the affected gene and the predicted impact on gene expression and 80 function. The program also allows the user to inspect all relevant intermediate and 81 output files. 82 To enable quick and easy installation and to ensure reproducibility, we 83 incorporated MutantHuntWGS into a Docker container 84 (https://hub.docker.com/repository/docker/mellison/mutant_hunt_wgs). With a single 85 command, users can download and install the software. A second command runs the 86 analysis, performing all steps described above. MutantHuntWGS allows researchers to 87 leverage WGS for the efficient identification of causal mutations, regardless of 88 bioinformatics experience. 89 90 METHODS 91 Pipeline Overview 92 The MutantHuntWGS pipeline integrates a series of open-source bioinformatics 93 tools and Unix commands that accept raw sequencing reads (compressed FASTQ 94 format or .fastq.gz) and a text file containing ploidy information as input, and produces a 95 list of sequence variants as output. The user must provide input data from at least two 96 strains: a control strain and one or more experimental strains. The pipeline uses (1) 97 Bowtie2 to align the reads in each input sample to the reference genome (Langmead 5 bioRxiv preprint doi: https://doi.org/10.1101/2020.05.16.099259; this version posted May 17, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 98 and Salzberg 2012), (2) SAMtools to process the data and calculate genotype 99 likelihoods (Li et al. 2009), (3) BCFtools to call variants (Li et al. 2009), (4) VCFtools 100 (Danecek et al. 2011) and custom shell commands to compare variants found in 101 experimental and control strains, and (5) SnpEff (Cingolani et al. 2012) and SIFT (Vaser 102 et al. 2016) to assess where variants are found in relation to annotated genes and the 103 potential impact on the expression and function of the affected gene products (Figure 104 1). A detailed description of the commands used in the pipeline and all code is available 105 on the MutantHuntWGS Git repository (https://github.com/mae92/MutantHuntWGS; see 106 README.md, Supplemental_Methods.docx files). 107 108 Analysis of previously published data 109 To demonstrate utility, we used MutantHuntWGS to analyze published datasets 110 from paired-end sequencing experiments with DNA prepared from bulk segregants or 111 lab-evolved strains (Birkeland et al. 2010; Goldgof et al. 2016; Ottilie et al. 2017). These 112 data were downloaded from the sequence read archive (SRA) database 113 (https://www.ncbi.nlm.nih.gov/sra; project accessions: SRP003355, SRP074482, 114 SRP074623) and decompressed using the SRA toolkit (https://github.com/ncbi/sra- 115 tools/wiki). MutantHuntWGS was run from within the Docker container, and each 116 published mutant (experimental) file was compared to its respective published control. 117 118 Data Availability Statement 119 All code and supplementary information on the methods used herein are available on 120 the MutantHuntWGS Git repository (https://github.com/mae92/MutantHuntWGS). 6 bioRxiv preprint doi: https://doi.org/10.1101/2020.05.16.099259; this version posted May 17, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 121 RESULTS AND DISCUSSION 122 Installing and running the MutantHuntWGS pipeline 123 To facilitate distribution and maximize reproducibility, we implemented 124 MutantHuntWGS in a Docker container (Boettiger 2015; Di Tommaso et al. 2015). The 125 container houses the pipeline and all of its dependencies in a Unix/Linux environment. 126 To download and install the MutantHuntWGS Docker container, users need only install 127 Docker Desktop (https://docs.docker.com/get-docker/), open a command-line terminal, 128 and execute the following command: 129 $ docker run -it -v 130 /PATH_TO_DESKTOP/Analysis_Directory:/Main/Analysis_Directory 131 mellison/mutant_hunt_wgs:version1 132 After download and installation, the command opens a Unix terminal running in the 133 Docker container so users can begin their analysis.
Recommended publications
  • De Novo Genomic Analyses for Non-Model Organisms: an Evaluation of Methods Across a Multi-Species Data Set
    De novo genomic analyses for non-model organisms: an evaluation of methods across a multi-species data set Sonal Singhal [email protected] Museum of Vertebrate Zoology University of California, Berkeley 3101 Valley Life Sciences Building Berkeley, California 94720-3160 Department of Integrative Biology University of California, Berkeley 1005 Valley Life Sciences Building Berkeley, California 94720-3140 June 5, 2018 Abstract High-throughput sequencing (HTS) is revolutionizing biological research by enabling sci- entists to quickly and cheaply query variation at a genomic scale. Despite the increasing ease of obtaining such data, using these data effectively still poses notable challenges, especially for those working with organisms without a high-quality reference genome. For every stage of analysis – from assembly to annotation to variant discovery – researchers have to distinguish technical artifacts from the biological realities of their data before they can make inference. In this work, I explore these challenges by generating a large de novo comparative transcriptomic dataset data for a clade of lizards and constructing a pipeline to analyze these data. Then, using a combination of novel metrics and an externally validated variant data set, I test the efficacy of my approach, identify areas of improvement, and propose ways to minimize these errors. I arXiv:1211.1737v1 [q-bio.GN] 8 Nov 2012 find that with careful data curation, HTS can be a powerful tool for generating genomic data for non-model organisms. Keywords: de novo assembly, transcriptomes, suture zones, variant discovery, annotation Running Title: De novo genomic analyses for non-model organisms 1 Introduction High-throughput sequencing (HTS) is poised to revolutionize the field of evolutionary genetics by enabling researchers to assay thousands of loci for organisms across the tree of life.
    [Show full text]
  • MATCH-G Program
    MATCH-G Program The MATCH-G toolset MATCH-G (Mutational Analysis Toolset Comparing wHole Genomes) Download the Toolset The toolset is prepared for use with pombe genomes and a sample genome from Sanger is included. However, it is written in such a way as to allow it to utilize any set or number of chromosomes. Simply follow the naming convention "chromosomeX.contig.embl" where X=1,2,3 etc. For use in terminal windows in Mac OSX or Unix-like environments where Perl is present by default. Toolset Description Version History 2.0 – Bug fixes related to scaling to other genomes. First version of GUI interface. 1.0 – Initial Release. Includes build, alignment, snp, copy, and gap resolution routines in original form. Please note this project in under development and will undergo significant changes. Project Description The MATCH-G toolset has been developed to facilitate the evaluation of whole genome sequencing data from yeasts. Currently, the toolset is written for pombe strains, however the toolset is easily scalable to other yeasts or other sequenced genomes. The included tools assist in the identification of SNP mutations, short (and long) insertion/deletions, and changes in gene/region copy number between a parent strain and mutant or revertant strain. Currently, the toolset is run from the command line in a unix or similar terminal and requires a few additional programs to run noted below. Installation It is suggested that a separate folder be generated for the toolset and generated files. Free disk space proportional to the original read datasets is recommended. The toolset utilizes and requires a few additional free programs: Bowtie rapid alignment software: http://bowtie-bio.sourceforge.net/index.shtml (Langmead B, Trapnell C, Pop M, Salzberg SL.
    [Show full text]
  • Whole Genome Sequencing Identifies Novel Structural Variant in a Large Indian Family Affected with X - Linked Agammaglobulinemia
    medRxiv preprint doi: https://doi.org/10.1101/2020.10.05.20200949; this version posted October 6, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-ND 4.0 International license . Whole Genome Sequencing identifies novel structural variant in a large Indian family affected with X - linked agammaglobulinemia 1,2,# 3,5,#,$ 3,4 3 Abhinav Jain ,​ Geeta Madathil Govindaraj ,​ Athulya Edavazhippurath ,​ Nabeel Faisal ,​ Rahul C ​ ​ ​ ​ 1 1,2 6 3 1 1 Bhoyar ,​ Vishu Gupta ,​ Ramya Uppuluri ,​ Shiny Padinjare Manakkad ,​ Atul Kashyap ,​ Anoop Kumar ,​ ​ ​ ​ ​ ​ ​ 1,2 1,2 7 7 3 Mohit Kumar Divakar ,​ Mohamed Imran ,​ Sneha Sawant ,​ Aparna Dalvi ,​ Krishnan Chakyar ,​ ​ ​ ​ ​ ​ 7 6 1,2,$ 1,2,$ Manisha Madkaikar ,​ Revathi Raj ,​ Sridhar Sivasubbu ,​ Vinod Scaria ​ ​ ​ ​ 1 CSIR-Institute​ of Genomics and Integrative Biology, Mathura Road, New Delhi, Delhi, India 2 Academy​ of Scientific and Innovative Research (AcSIR), Mathura Road, New Delhi, Delhi, India 3 Department​ of Pediatrics, Government Medical College Kozhikode, Kozhikode, Kerala, India 4 Multidisciplinary​ Research Unit, Government College Kozhikode, Kozhikode, Kerala, India 5 FPID​ Regional Diagnostic Centre, Government Medical College Kozhikode, Kozhikode, Kerala India 6 Department​ of Pediatric Hematology, Oncology, Blood and Marrow Transplantation, Apollo Hospitals, 320, Padma complex, Anna salai, Teynampet, Chennai, Tamil Nadu, India 7 Department​ of Pediatric Immunology and Leukocyte Biology, ICMR-National Institute of Immunohaematology, KEM Hospital, Parel, Mumbai, Maharashtra, India. # Joint first authors $ Corresponding author Vinod Scaria: [email protected] ​ Sridhar Sivasubbu: [email protected] ​ Geeta Madathil Govindaraj: [email protected] ​ NOTE: This preprint reports new research that has not been certified by peer review and should not be used to guide clinical practice.
    [Show full text]
  • Exome Sequencing and Functional Validation in Zebrafish Identify
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Elsevier - Publisher Connector REPORT Exome Sequencing and Functional Validation in Zebrafish Identify GTDC2 Mutations as a Cause of Walker-Warburg Syndrome M. Chiara Manzini,1,2 Dimira E. Tambunan,1,2 R. Sean Hill,1,2 Tim W. Yu,1,2 Thomas M. Maynard,3 Erin L. Heinzen,4 Kevin V. Shianna,4 Christine R. Stevens,5 Jennifer N. Partlow,1,2 Brenda J. Barry,1,2 Jacqueline Rodriguez,1,2 Vandana A. Gupta,1,6 Abdel-Karim Al-Qudah,7 Wafaa M. Eyaid,8 Jan M. Friedman,9,10 Mustafa A. Salih,11 Robin Clark,12 Isabella Moroni,13 Marina Mora,14 Alan H. Beggs,1,6 Stacey B. Gabriel,5 and Christopher A. Walsh1,2,5,* Whole-exome sequencing (WES), which analyzes the coding sequence of most annotated genes in the human genome, is an ideal approach to studying fully penetrant autosomal-recessive diseases, and it has been very powerful in identifying disease-causing muta- tions even when enrollment of affected individuals is limited by reduced survival. In this study, we combined WES with homozygosity analysis of consanguineous pedigrees, which are informative even when a single affected individual is available, to identify genetic mutations responsible for Walker-Warburg syndrome (WWS), a genetically heterogeneous autosomal-recessive disorder that severely affects the development of the brain, eyes, and muscle. Mutations in seven genes are known to cause WWS and explain 50%–60% of cases, but multiple additional genes are expected to be mutated because unexplained cases show suggestive linkage to diverse loci.
    [Show full text]
  • Genomic Approaches for Understanding the Genetics of Complex Disease
    Downloaded from genome.cshlp.org on September 25, 2021 - Published by Cold Spring Harbor Laboratory Press Perspective Genomic approaches for understanding the genetics of complex disease William L. Lowe Jr.1 and Timothy E. Reddy2,3 1Division of Endocrinology, Metabolism and Molecular Medicine, Department of Medicine, Northwestern University Feinberg School of Medicine, Chicago, Illinois 60611, USA; 2Department of Biostatistics and Bioinformatics, Duke University Medical School, Durham, North Carolina 27708, USA; 3Center for Genomic and Computational Biology, Duke University Medical School, Durham, North Carolina 27708, USA There are thousands of known associations between genetic variants and complex human phenotypes, and the rate of novel discoveries is rapidly increasing. Translating those associations into knowledge of disease mechanisms remains a fundamental challenge because the associated variants are overwhelmingly in noncoding regions of the genome where we have few guiding principles to predict their function. Intersecting the compendium of identified genetic associations with maps of regulatory activity across the human genome has revealed that phenotype-associated variants are highly enriched in candidate regula- tory elements. Allele-specific analyses of gene regulation can further prioritize variants that likely have a functional effect on disease mechanisms; and emerging high-throughput assays to quantify the activity of candidate regulatory elements are a promising next step in that direction. Together, these technologies have created the ability to systematically and empirically test hypotheses about the function of noncoding variants and haplotypes at the scale needed for comprehensive and system- atic follow-up of genetic association studies. Major coordinated efforts to quantify regulatory mechanisms across genetically diverse populations in increasingly realistic cell models would be highly beneficial to realize that potential.
    [Show full text]
  • Large Scale Genomic Rearrangements in Selected Arabidopsis Thaliana T
    bioRxiv preprint doi: https://doi.org/10.1101/2021.03.03.433755; this version posted March 7, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. 1 Large scale genomic rearrangements in selected 2 Arabidopsis thaliana T-DNA lines are caused by T-DNA 3 insertion mutagenesis 4 5 Boas Pucker1,2+, Nils Kleinbölting3+, and Bernd Weisshaar1* 6 1 Genetics and Genomics of Plants, Center for Biotechnology (CeBiTec), Bielefeld University, 7 Sequenz 1, 33615 Bielefeld, Germany 8 2 Evolution and Diversity, Department of Plant Sciences, University of Cambridge, Cambridge, 9 United Kingdom 10 3 Bioinformatics Resource Facility, Center for Biotechnology (CeBiTec, Bielefeld University, 11 Sequenz 1, 33615 Bielefeld, Germany 12 + authors contributed equally 13 * corresponding author: Bernd Weisshaar 14 15 BP: [email protected], ORCID: 0000-0002-3321-7471 16 NK: [email protected], ORCID: 0000-0001-9124-5203 17 BW: [email protected], ORCID: 0000-0002-7635-3473 18 19 20 21 22 23 24 25 26 27 28 29 30 page 1 bioRxiv preprint doi: https://doi.org/10.1101/2021.03.03.433755; this version posted March 7, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.
    [Show full text]
  • An Open-Sourced Bioinformatic Pipeline for the Processing of Next-Generation Sequencing Derived Nucleotide Reads
    bioRxiv preprint doi: https://doi.org/10.1101/2020.04.20.050369; this version posted May 28, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. An open-sourced bioinformatic pipeline for the processing of Next-Generation Sequencing derived nucleotide reads: Identification and authentication of ancient metagenomic DNA Thomas C. Collin1, *, Konstantina Drosou2, 3, Jeremiah Daniel O’Riordan4, Tengiz Meshveliani5, Ron Pinhasi6, and Robin N. M. Feeney1 1School of Medicine, University College Dublin, Ireland 2Division of Cell Matrix Biology Regenerative Medicine, University of Manchester, United Kingdom 3Manchester Institute of Biotechnology, School of Earth and Environmental Sciences, University of Manchester, United Kingdom [email protected] 5Institute of Paleobiology and Paleoanthropology, National Museum of Georgia, Tbilisi, Georgia 6Department of Evolutionary Anthropology, University of Vienna, Austria *Corresponding Author Abstract The emerging field of ancient metagenomics adds to these Bioinformatic pipelines optimised for the processing and as- processing complexities with the need for additional steps sessment of metagenomic ancient DNA (aDNA) are needed in the separation and authentication of ancient sequences from modern sequences. Currently, there are few pipelines for studies that do not make use of high yielding DNA cap- available for the analysis of ancient metagenomic DNA ture techniques. These bioinformatic pipelines are tradition- 1 4 ally optimised for broad aDNA purposes, are contingent on (aDNA) ≠ The limited number of bioinformatic pipelines selection biases and are associated with high costs.
    [Show full text]
  • Sequence Alignment/Map Format Specification
    Sequence Alignment/Map Format Specification The SAM/BAM Format Specification Working Group 3 Jun 2021 The master version of this document can be found at https://github.com/samtools/hts-specs. This printing is version 53752fa from that repository, last modified on the date shown above. 1 The SAM Format Specification SAM stands for Sequence Alignment/Map format. It is a TAB-delimited text format consisting of a header section, which is optional, and an alignment section. If present, the header must be prior to the alignments. Header lines start with `@', while alignment lines do not. Each alignment line has 11 mandatory fields for essential alignment information such as mapping position, and variable number of optional fields for flexible or aligner specific information. This specification is for version 1.6 of the SAM and BAM formats. Each SAM and BAMfilemay optionally specify the version being used via the @HD VN tag. For full version history see Appendix B. Unless explicitly specified elsewhere, all fields are encoded using 7-bit US-ASCII 1 in using the POSIX / C locale. Regular expressions listed use the POSIX / IEEE Std 1003.1 extended syntax. 1.1 An example Suppose we have the following alignment with bases in lowercase clipped from the alignment. Read r001/1 and r001/2 constitute a read pair; r003 is a chimeric read; r004 represents a split alignment. Coor 12345678901234 5678901234567890123456789012345 ref AGCATGTTAGATAA**GATAGCTGTGCTAGTAGGCAGTCAGCGCCAT +r001/1 TTAGATAAAGGATA*CTG +r002 aaaAGATAA*GGATA +r003 gcctaAGCTAA +r004 ATAGCT..............TCAGC -r003 ttagctTAGGC -r001/2 CAGCGGCAT The corresponding SAM format is:2 1Charset ANSI X3.4-1968 as defined in RFC1345.
    [Show full text]
  • Assembly Exercise
    Assembly Exercise Turning reads into genomes Where we are • 13:30-14:00 – Primer Design to Amplify Microbial Genomes for Sequencing • 14:00-14:15 – Primer Design Exercise • 14:15-14:45 – Molecular Barcoding to Allow Multiplexed NGS • 14:45-15:15 – Processing NGS Data – de novo and mapping assembly • 15:15-15:30 – Break • 15:30-15:45 – Assembly Exercise • 15:45-16:15 – Annotation • 16:15-16:30 – Annotation Exercise • 16:30-17:00 – Submitting Data to GenBank Log onto ILRI cluster • Log in to HPC using ILRI instructions • NOTE: All the commands here are also in the file - assembly_hands_on_steps.txt • If you are like me, it may be easier to cut and paste Linux commands from this file instead of typing them in from the slides Start an interactive session on larger servers • The interactive command will start a session on a server better equipped to do genome assembly $ interactive • Switch to csh (I use some csh features) $ csh • Set up Newbler software that will be used $ module load 454 A norovirus sample sequenced on both 454 and Illumina • The vendors use different file formats unknown_norovirus_454.GACT.sff unknown_norovirus_illumina.fastq • I have converted these files to additional formats for use with the assembly tools unknown_norovirus_454_convert.fasta unknown_norovirus_454_convert.fastq unknown_norovirus_illumina_convert.fasta Set up and run the Newbler de novo assembler • Create a new de novo assembly project $ newAssembly de_novo_assembly • Add read data to the project $ addRun de_novo_assembly unknown_norovirus_454.GACT.sff
    [Show full text]
  • Supplemental Material Nanopore Sequencing of Complex Genomic
    Supplemental Material Nanopore sequencing of complex genomic rearrangements in yeast reveals mechanisms of repeat- mediated double-strand break repair McGinty, et al. Contents: Supplemental Methods Supplemental Figures S1-S6 Supplemental References Supplemental Methods DNA extraction: The DNA extraction protocol used is a slight modification of a classic ethanol precipitation. First, yeast cultures were grown overnight in 2 ml of complete media (YPD) and refreshed for 4 hours in an additional 8 ml YPD to achieve logarithmic growth. The cultures were spun down, and the pellets were resuspended in 290 µl of solution containing 0.9M Sorbitol and 0.1M EDTA at pH 7.5. 10 µl lyticase enzyme was added, and the mixture was incubated for 30 minutes at 37oC in order to break down the yeast cell wall. The mixture was centrifuged at 8,000 rpm for two minutes, and the pellet was resuspended in 270 µl of solution containing 50mM Tris 20mM EDTA at pH 7.5, and 30 µl of 10% SDS. Following five minutes incubation at room temperature, 150 µl of chilled 5M potassium acetate solution was added. This mixture was incubated for 10 min at 4oC and centrifuged for 10 min at 13,000 rpm. The supernatant was then combined with 900 µl of pure ethanol, which had been chilled on ice. This solution was stored overnight at -20oC, and then centrifuged for 20 minutes at 4,000 rpm in a refrigerated centrifuge. The pellet was then washed twice with 70% ethanol, allowed to dry completely, and then resuspended in 50 µl TE (10mM Tris, 1mM EDTA, pH 8).
    [Show full text]
  • Providing Bioinformatics Services on Cloud
    Providing Bioinformatics Services on Cloud Christophe Blanchet, Clément Gauthey InfrastructureC. Blanchet Distributed and C. Gauthey for Biology IDB-IBCPEGI CF13, CNRS Manchester, FR3302 - LYON 9 April - FRANCE2013 http://idee-b.ibcp.fr Infrastructure Distributed for Biology - IDB CNRS-IBCP FR3302, Lyon, FRANCE IDB acknowledges co-funding by the European Community's Seventh Framework Programme (INFSO-RI-261552) and the French National Research Agency's Arpege Programme (ANR-10-SEGI-001) Bioinformatics Today • Biological data are big data • 1512 online databases (NAR Database Issue 2013) • Institut Sanger, UK, 5 PB • Beijing Genome Institute, China, 4 sites, 10 PB ➡ Big data in lot of places • Analysing such data became difficult • Scale-up of the analyses : gene/protein to complete genome/ proteome, ... • Lot of different daily-used tools • That need to be combined in workflows • Usual interfaces: portals, Web services, federation,... ➡ Datacenters with ease of access/use • Distributed resources ADN Experimental platforms: NGS, imaging, ... • ADN • Bioinformatics platforms BI Federation of datacenters M ➡ BI CC ADN ADN BI ADN EGI CF13, Manchester, 9 April 2013 ADN ADN BI ADN ADN M M Sequencing Genomes Complete genome sequencing become a lab commodity with NGS (cheap and efficient) source: www.genomesonline.org source: www.politigenomics.com/next-generation-sequencing-informatics EGI CF13, Manchester, 9 April 2013 Infrastructures in Biology Lot of tools and web services to treat and vizualize lot of data EGI CF13, Manchester, 9 April 2013
    [Show full text]
  • Syntax Highlighting for Computational Biology Artem Babaian1†, Anicet Ebou2, Alyssa Fegen3, Ho Yin (Jeffrey) Kam4, German E
    bioRxiv preprint doi: https://doi.org/10.1101/235820; this version posted December 20, 2017. The copyright holder has placed this preprint (which was not certified by peer review) in the Public Domain. It is no longer restricted by copyright. Anyone can legally share, reuse, remix, or adapt this material for any purpose without crediting the original authors. bioSyntax: Syntax Highlighting For Computational Biology Artem Babaian1†, Anicet Ebou2, Alyssa Fegen3, Ho Yin (Jeffrey) Kam4, German E. Novakovsky5, and Jasper Wong6. 5 10 15 Affiliations: 1. Terry Fox Laboratory, BC Cancer, Vancouver, BC, Canada. [[email protected]] 2. Departement de Formation et de Recherches Agriculture et Ressources Animales, Institut National Polytechnique Felix Houphouet-Boigny, Yamoussoukro, Côte d’Ivoire. [[email protected]] 20 3. Faculty of Science, University of British Columbia, Vancouver, BC, Canada [[email protected]] 4. Faculty of Mathematics, University of Waterloo, Waterloo, ON, Canada. [[email protected]] 5. Department of Medical Genetics, University of British Columbia, Vancouver, BC, 25 Canada. [[email protected]] 6. Genome Science and Technology, University of British Columbia, Vancouver, BC, Canada. [[email protected]] Correspondence†: 30 Artem Babaian Terry Fox Laboratory BC Cancer Research Centre 675 West 10th Avenue Vancouver, BC, Canada. V5Z 1L3. 35 Email: [[email protected]] bioRxiv preprint doi: https://doi.org/10.1101/235820; this version posted December 20, 2017. The copyright holder has placed this preprint (which was not certified by peer review) in the Public Domain. It is no longer restricted by copyright. Anyone can legally share, reuse, remix, or adapt this material for any purpose without crediting the original authors.
    [Show full text]