A Comparison of Three Programming Languages for a Full-Fledged Next-Generation Sequencing Tool Pascal Costanza *†, Charlotte Herzeel† and Wilfried Verachtert

Costanza et al. BMC Bioinformatics (2019) 20:301 https://doi.org/10.1186/s12859-019-2903-5 METHODOLOGY ARTICLE Open Access A comparison of three programming languages for a full-fledged next-generation sequencing tool Pascal Costanza *†, Charlotte Herzeel† and Wilfried Verachtert Abstract Background: elPrep is an established multi-threaded framework for preparing SAM and BAM files in sequencing pipelines. To achieve good performance, its software architecture makes only a single pass through a SAM/BAM file for multiple preparation steps, and keeps sequencing data as much as possible in main memory. Similar to other SAM/BAM tools, management of heap memory is a complex task in elPrep, and it became a serious productivity bottleneck in its original implementation language during recent further development of elPrep. We therefore investigated three alternative programming languages: Go and Java using a concurrent, parallel garbage collector on the one hand, and C++17 using reference counting on the other hand for handling large amounts of heap objects. We reimplemented elPrep in all three languages and benchmarked their runtime performance and memory use. Results: The Go implementation performs best, yielding the best balance between runtime performance and memory use. While the Java benchmarks report a somewhat faster runtime than the Go benchmarks, the memory use of the Java runs is significantly higher. The C++17 benchmarks run significantly slower than both Go and Java, while using somewhat more memory than the Go runs. Our analysis shows that concurrent, parallel garbage collection is better at managing a large heap of objects than reference counting in our case. Conclusions: Based on our benchmark results, we selected Go as our new implementation language for elPrep, and recommend considering Go as a good candidate for developing other bioinformatics tools for processing SAM/BAM data as well. Keywords: Next-generation sequencing, Sequence analysis, SAM/BAM files, C++, Go, Java, Runtime performance, Memory usage, Garbage collection, Reference counting Background duplicates, recalibrating base quality scores, indel realign- The sequence alignment/map format (SAM/BAM) [1]is ment, and various filtering options, which typically pre- the de facto standard in the bioinformatics community for cede variant calling. Many alternative software packages storing mapped sequencing data. There exists a large body [4–10, 12, 14, 15] focus on optimizing the computations of work on tools for processing SAM/BAM files for anal- of these operations, either by providing alternative algo- ysis [1–15]. The SAMtools [1], Picard [2], and Genome rithms, or by using parallelization, distribution, or other Analysis Toolkit (GATK) [3] software packages devel- optimization techniques specific to their implementation oped by the Broad and Sanger institutes are considered language, which is often C, C++, or Java. to be reference implementations for many operations on We have developed elPrep [8, 16], an open-source, SAM/BAM files, examples of which include sorting reads, multi-threaded framework for processing SAM/BAM marking polymerase chain reaction (PCR) and optical files in sequencing pipelines, especially designed for optimizing computational performance. It can be used as a drop-in replacement for many operations imple- *Correspondence: [email protected] mented by SAMtools, Picard, and GATK, while pro- †Pascal Costanza and Charlotte Herzeel contributed equally to this work. imec, ExaScience Lab, Kapeldreef 75, 3001 Leuven, Belgium ducing identical results [8, 16]. elPrep allows users to © The Author(s). 2019 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Costanza et al. BMC Bioinformatics (2019) 20:301 Page 2 of 10 specify arbitrary combinations of SAM/BAM opera- Garbage collection Memory is automatically managed tions as a single pipeline in one command line. elPrep’s by a separate component of the runtime library unique software architecture then ensures that running called the garbage collector. At arbitrary points in such a pipeline requires only a single pass through time, it traverses the object graph to determine the SAM/BAM file, no matter how many operations which objects are still directly or indirectly accessible are specified. The framework takes care of merg- by the running program, and deallocates inacces- ing and parallelizing the execution of the operations, sible objects. This ensures that object lifetimes do which significantly speeds up the overall execution of not have to be explicitly modelled, and that point- a pipeline. ers can be more freely passed around in a program. In contrast, related work focuses on optimizing Most garbage collector implementations interrupt individual SAM/BAM operations, but we have shown the running program and only allow it to continue that our approach of merging operations outperforms this executing after garbage collection – they “stop the strategy [8]. For example, compared to using GATK4, world” [19] – and perform object graph traversal elPrep executes the 4-step Broad Best Practices pipeline using a sequential algorithm. However, advanced [17] (consisting of sorting, marking PCR and optical dupli- implementation techniques, as employed by Java [21] cates, and base quality score recalibration and application) and Go [22], include traversing the object graph up to 13x faster on whole-exome data, and up to 7.4x concurrently with the running program while limit- faster on whole-genome data, while utilizing fewer com- ing its interruption as far as possible; and using a pute resources [8]. multi-threaded parallel algorithm that significantly All SAM/BAM tools have in common that they need speeds up garbage collection on modern multicore to manipulate large amounts of data, as SAM/BAM files processors. easily take up 10–100 gigabytes (GB) in compressed Reference counting Memory is managed by maintain- form. Some tools implement data structures that spill ing a reference count with each heap object. When to disk when reaching a certain threshold on random pointers are assigned to each other, these reference access memory (RAM) use, but elPrep uses a strategy counts are increased or decreased to keep track of where data is split upfront into chunks that are processed how many pointers refer to each object. Whenever entirely in memory to avoid repeated file input/output a reference count drops to zero, the corresponding [16]. Our benchmarks show that elPrep’s representation object can be deallocated.1 of SAM/BAM data is more efficient than, for example, GATK version 4 (GATK4), as elPrep uses less memory elPrep was originally, up to version 2.6, implemented for loading the same number of reads from a SAM/BAM in the Common Lisp programming language [23]. Most file in memory [8]. However, since elPrep does not pro- existing Common Lisp implementations use stop-the- vide data structures that spill to disk, elPrep currently world, sequential garbage collectors. To achieve good per- requires a fixed minimum amount of RAM to process a formance, it was therefore necessary to explicitly control whole-exome or whole-genome file, whereas other tools how often and when the garbage collector would run to sometimes allow putting a cap on the RAM use by using avoid needless interruptions of the main program, espe- disk space instead. Nonetheless, for efficiency, it is rec- cially during parallel phases. As a consequence, we also ommended to use as much RAM as available, even when had to avoid unnecessary memory allocations, and reuse spilling to disk [8, 18]. This means that, in general, tools already allocated memory as far as possible, to reduce for processing SAM/BAM data need to be able to manip- the number of garbage collector runs. However, our more ulate large amounts of allocated memory. recent attempts to add more functionality to elPrep (like In most programming languages, there exist more or optical duplicate marking, base quality score recalibra- less similar ways to explicitly or implicitly allocate mem- tion, and so on) required allocating additional memory for ory for heap objects which, unlike stack values, are not these new steps, and it became an even more complex task bound to the lifetimes of function or method invocations. and a serious productivity bottleneck to keep memory However, programming languages strongly differ in how allocation and garbage collection in check. We there- memory for heap objects is subsequently deallocated. A fore started to look for a different programming language detailed discussion can be found in “The Garbage Collec- using an alternative memory management approach to tion Handbook” by Jones, Hosking, and Moss [19]. There continue developing elPrep and still achieve good perfor- are mainly three approaches: mance. Existing literature on comparing programming lan- Manual memory management Memory has to be guages and their implementations for performance explicitly deallocated in the program source code typically focus on specific

A Comparison of Three Programming Languages for a Full-Fledged Next-Generation Sequencing Tool Pascal Costanza *†, Charlotte Herzeel† and Wilfried Verachtert

Acme: a User Interface for Programmers Rob Pike [email protected]−Labs.Com

Tiny Tools Gerard J

Sequence Alignment/Map Format Specification

Revenues 1010 1020 1030 1040 1045 1060 1090 82,958 139,250

Sam Quick Reference Card

Computer Oral History Collection, 1969-1973, 1977

Using the Go Programming Language in Practice

Programming Languages

Acme: a User Interface for Programmers Rob Pike AT&T Bell Laboratories Murray Hill, New Jersey 07974

8½, the Plan 9 Window System

Lecture 18 Regular Expressions the Grep Command

An Introduction to Linux and Bowtie