The Khmer Software Package: Enabling Efficient Nucleotide Sequence Analysis [Version 1; Peer Review: 2 Approved, 1 Approved with Reservations]
Total Page:16
File Type:pdf, Size:1020Kb
F1000Research 2015, 4:900 Last updated: 23 JUL 2021 SOFTWARE TOOL ARTICLE The khmer software package: enabling efficient nucleotide sequence analysis [version 1; peer review: 2 approved, 1 approved with reservations] Michael R. Crusoe 1, Hussien F. Alameldin2, Sherine Awad3, Elmar Boucher4, Adam Caldwell5, Reed Cartwright6, Amanda Charbonneau7, Bede Constantinides8, Greg Edvenson9, Scott Fay10, Jacob Fenton11, Thomas Fenzl12, Jordan Fish11, Leonor Garcia-Gutierrez13, Phillip Garland14, Jonathan Gluck15, Iván González16, Sarah Guermond17, Jiarong Guo18, Aditi Gupta1, Joshua R. Herr1, Adina Howe 19, Alex Hyer20, Andreas Härpfer21, Luiz Irber11, Rhys Kidd22, David Lin23, Justin Lippi24, Tamer Mansour3,25, Pamela McA'Nulty26, Eric McDonald11, Jessica Mizzi27, Kevin D. Murray28, Joshua R. Nahum29, Kaben Nanlohy30, Alexander Johan Nederbragt31, Humberto Ortiz-Zuazaga32, Jeramia Ory33, Jason Pell11, Charles Pepe-Ranney34, Zachary N. Russ35, Erich Schwarz36, Camille Scott11, Josiah Seaman37, Scott Sievert38, Jared Simpson39,40, Connor T. Skennerton41, James Spencer42, Ramakrishnan Srinivasan43, Daniel Standage44,45, James A. Stapleton46, Susan R. Steinman47, Joe Stein48, Benjamin Taylor11, Will Trimble49, Heather L. Wiencko50, Michael Wright11, Brian Wyss11, Qingpeng Zhang11, en zyme51, C. Titus Brown1,3,11 1Microbiology and Molecular Genetics, Michigan State University, East Lansing, MI, USA 2Department of Plant, Soil and Microbial Sciences, Michigan State University, East Lansing, MI, USA 3Population Health and Reproduction, University of California, Davis, Davis, CA, USA 4Department of Biomedical Engineering, Oregon Health and Science University, Portland, OR, USA 5Biology Department, San Jose State University, San Jose, CA, USA 6School of Life Sciences and The Biodesign Institute, Arizona State University, Tempe, AZ, USA 7Genetics, Michigan State University, East Lansing, MI, USA 8Computational and Evolutionary Biology, Faculty of Life Sciences, University of Manchester, Manchester, UK 9Micron Technology, Seattle, WA, USA 10Invitae, San Francisco, CA, USA 11Computer Science and Engineering, Michigan State University, East Lansing, MI, USA 12Independent Researcher, Munich, Germany 13Mathematics Institute, University of Warwick, Warwick, UK 14Eastlake Data, Seattle, WA, USA 15Graduate Program, University of Maryland, College Park, MD, USA 16Athinoula A. Martinos Center for Biomedical Imaging, Department of Radiology, Massachusetts General Hospital, Charlestown, MA, USA 17Independent Researcher, Seattle, WA, USA 18Center for Microbial Ecology, Michigan State University, East Lansing, MI, USA 19Department of Agricultural and Biosystems Engineering, Iowa State University, Ames, IA, USA 20Department of Biology, University of Utah, Salt Lake City, UT, USA 21 Page 1 of 13 F1000Research 2015, 4:900 Last updated: 23 JUL 2021 ConSol* Software GmbH, Munchen, Germany 22Independent Researcher, Sydney, Australia 23Verdematics, Fremont, CA, USA 24Independent Researcher, San Francisco, CA, USA 25Clinical Pathology, Mansoura University, Mansoura, Egypt 26Addgene, Cambridge, MA, USA 27Biochemistry and Molecular Biology, Michigan State University, East Lansing, MI, USA 28ARC Centre of Excellence in Plant Energy Biology, The Australian National University, Canberra, ACT, Australia 29BEACON Center, Michigan State University, East Lansing, MI, USA 30Independent Researcher, New Orleans, LA, USA 31Centre for Ecological and Evolutionary Synthesis, Dept. of Biosciences, University of Oslo, Oslo, Norway 32Department of Computer Science, Rio Piedras Campus, University of Puerto Rico, San Juan, Puerto Rico 33Biochemistry, St. Louis College of Pharmacy, St. Louis, MO, USA 34Crop and Soil Sciences, Cornell University, Ithaca, NY, USA 35Department of Bioengineering, UC Berkeley, Berkeley, CA, USA 36Department of Molecular Biology and Genetics, Cornell University, Ithaca, NY, USA 37Data Visualization, Newline Technical Innovations, Windsor, CO, USA 38Electrical and Computer Engineering, University of Minnesota, Minneapolis, MN, USA 39Ontario Institute for Cancer Research, Toronto, ON, Canada 40Computer Science, University of Toronto, Toronto, ON, Canada 41Division of Geological and Planetary Sciences, California Institute of Technology, Pasadena, CA, USA 42Dept of Physics and Dept of Materials, Imperial College London, London, UK 43Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA 44Department of Biology, Indiana University, Bloomington, IN, USA 45Bioinformatics and Computational Biology Graduate Program, Iowa State University, Ames, IA, USA 46Chemical Engineering & Materials Science, Michigan State University, East Lansing, MIS, USA 47The New York Eye and Ear Infirmary of Mount Sinai, New York, NY, USA 48Independent Researcher, Providence, RI, USA 49Mathematics and Computer Science Division, Argonne National Laboratory, Lemont, IL, USA 50Department of Genetics, Smurfit Institute, Trinity College Dublin, Dublin, Ireland 51Independent Researcher, Boston, MA, USA v1 First published: 25 Sep 2015, 4:900 Open Peer Review https://doi.org/10.12688/f1000research.6924.1 Latest published: 25 Sep 2015, 4:900 https://doi.org/10.12688/f1000research.6924.1 Reviewer Status Invited Reviewers Abstract The khmer package is a freely available software library for working 1 2 3 efficiently with fixed length DNA words, or k-mers. khmer provides implementations of a probabilistic k-mer counting data structure, a version 1 compressible De Bruijn graph representation, De Bruijn graph 25 Sep 2015 report report report partitioning, and digital normalization. khmer is implemented in C++ and Python, and is freely available under the BSD license at 1. Ewan Birney , European Bioinformatics https://github.com/dib-lab/khmer/. Institute, Hinxton, UK Keywords bioinformatics , dna sequencing analysis , k-mer , kmer , khmer , 2. Daniel S. Katz , University of Chicago, online, low-memory , streaming Chicago, USA 3. Rob Patro , Stony Brook University, Stony Brook, USA Page 2 of 13 F1000Research 2015, 4:900 Last updated: 23 JUL 2021 Any reports and responses or comments on the This article is included in the Iowa State article can be found at the end of the article. University collection. Corresponding author: C. Titus Brown ([email protected]) Competing interests: No competing interests were disclosed. Grant information: khmer development has largely been supported by AFRI Competitive Grant no. 2010-65205-20361 from the USDA NIFA, and is now funded by the National Human Genome Research Institute of the National Institutes of Health under Award Number R01HG007513, as well as by the the Gordon and Betty Moore Foundation under Award number GBMF4551, all to CTB. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Copyright: © 2015 Crusoe MR et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. How to cite this article: Crusoe MR, Alameldin HF, Awad S et al. The khmer software package: enabling efficient nucleotide sequence analysis [version 1; peer review: 2 approved, 1 approved with reservations] F1000Research 2015, 4:900 https://doi.org/10.12688/f1000research.6924.1 First published: 25 Sep 2015, 4:900 https://doi.org/10.12688/f1000research.6924.1 Page 3 of 13 F1000Research 2015, 4:900 Last updated: 23 JUL 2021 Introduction Operation DNA words of a fixed-length k, or “k-mers”, are a common abstrac- khmer is primarily developed on Linux for Python 2.7 and 64-bit tion in DNA sequence analysis that enable alignment-free sequence processors, and several core developers use Mac OS X. The project analysis and comparison. With the advent of second-generation is tested regularly using the Jenkins continuous integration system sequencing and the widespread adoption of De Bruijn graph-based running on Ubuntu 14.04 LTS and Mac OS X 10.10; the current assemblers, k-mers have become even more widely used in recent development branch is also tested under Python 3.3, 3.4, and 3.5. years. However, the dramatically increased rate of sequence data Releases are tested against many Linux distributions, including generation from Illumina sequencers continues to challenge the RedHat Enterprise Linux, Debian, Fedora, and Ubuntu. khmer basic data structures and algorithms for k-mer storage and manip- should work on most UNIX derivatives with little modification. ulation. This has led to the development of a wide range of data Windows is explicitly not supported. structures and algorithms that explore possible improvements to k-mer-based approaches. Memory requirements for using khmer vary with the complexity of data and are user configurable. Several core data structures can Here we present version 2.0 of the khmer software package, a high- trade memory for false positives, and we have explored these details performance library implementing memory- and time-efficient in several papers, most notably Pell et al. 20122 and Zhang et al. algorithms for the manipulation and analysis of short-read data sets. 20141. For example, most single organism mRNAseq data sets can khmer contains reference implementations of several approaches, be processed in under 16 GB of RAM3,8, while memory require- including a probabilistic k-mer counter based on the CountMin ments for metagenome data sets may vary from dozens of gigabytes Sketch1, a compressible