A Comparison of Three Programming Languages for a Full-Fledged Next

bioRxiv preprint doi: https://doi.org/10.1101/558056; this version posted February 22, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. A comparison of three programming languages for a full-fledged next-generation sequencing tool Costanza, Pascal∗ Herzeel, Charlotte∗ [email protected] [email protected] Verachtert, Wilfried [email protected] imec, ExaScience Life Lab, Kapeldreef 75, 3001 Leuven, Belgium February 22, 2019 Abstract icantly slower than both Go and Java, while using somewhat more memory than the Go runs. Our anal- Background elPrep is an established multi-threaded ysis shows that concurrent, parallel garbage collec- framework for preparing SAM and BAM files in se- tion is better at managing a large heap of objects quencing pipelines. To achieve good performance, than reference counting in our case. its software architecture makes only a single pass Conclusions Based on our benchmark results, we through a SAM/BAM file for multiple preparation selected Go as our new implementation language for steps, and keeps sequencing data as much as pos- elPrep, and recommend considering Go as a good sible in main memory. Similar to other SAM/BAM candidate for developing other bioinformatics tools tools, management of heap memory is a complex task for processing SAM/BAM data as well. in elPrep, and it became a serious productivity bot- tleneck in its original implementation language during recent further development of elPrep. We there- Background fore investigated three alternative programming languages: Go and Java using a concurrent, parallel The sequence alignment/map format garbage collector on the one hand, and C++17 using (SAM/BAM) [1] is the de facto standard in reference counting on the other hand for handling the bioinformatics community for storing mapped large amounts of heap objects. We reimplemented sequencing data. There exists a large body of elPrep in all three languages and benchmarked their work on tools for processing SAM/BAM files for runtime performance and memory use. analysis [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]. Results The Go implementation performs best, The SAMtools [1], Picard [2], and GATK [3] software yielding the best balance between runtime perfor- packages developed by the Broad and Sanger insti- mance and memory use. While the Java benchmarks tutes are considered to be reference implementations report a somewhat faster runtime than the Go bench- for many operations on SAM/BAM files, examples of marks, the memory use of the Java runs is signifi- which include sorting reads, marking PCR and opti- cantly higher. The C++17 benchmarks run signif- cal duplicates, recalibrating base quality scores, indel realignment, and various filtering options, which ∗Equal contributor typically precede variant calling. Many alternative 1 bioRxiv preprint doi: https://doi.org/10.1101/558056; this version posted February 22, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. software packages [4, 5, 6, 7, 8, 9, 10, 12, 14, 15] focus data structures that spill to disk, elPrep currently re- on optimizing the computations of these operations, quires a fixed minimum amount of RAM to process either by providing alternative algorithms, or by a whole-exome or whole-genome file, whereas other using parallelization, distribution, or other opti- tools sometimes allow putting a cap on the RAM use mization techniques specific to their implementation by using disk space instead. Nonetheless, for effi- language, which is often C, C++, or Java. ciency, it is recommended to use as much RAM as We have developed elPrep [8, 16], an open-source, available [8, 18]. This means that, in general, tools multi-threaded framework for processing SAM/BAM for processing SAM/BAM data need to be able to files in sequencing pipelines, especially designed for manipulate large amounts of allocated memory. optimizing computational performance. It can be In most programming languages, there exist more used as a drop-in replacement for many operations or less similar ways to explicitly or implicitly allo- implemented by SAMtools, Picard, and GATK, while cate memory for heap objects which, unlike stack producing identical results [8, 16]. elPrep allows users values, are not bound to the lifetimes of function to specify arbitrary combinations of SAM/BAM op- or method invocations. However, programming lan- erations as a single pipeline in one command line. guages strongly differ in how memory for heap ob- elPrep's unique software architecture then ensures jects is subsequently deallocated. A detailed discus- that running such a pipeline requires only a single sion can be found in \The Garbage Collection Hand- pass through the SAM/BAM file, no matter how book" by Jones, Hosking, and Moss [19]. There are many operations are specified. The framework takes mainly three approaches: care of merging and parallelizing the execution of the Manual memory management Memory has to operations, which significantly speeds up the overall be explicitly deallocated in the program source execution of a pipeline. code (for example by calling free in C [20]). In contrast, related work focuses on optimizing in- dividual SAM/BAM operations, but we have shown Garbage collection Memory is automatically that our approach of merging operations outperforms managed by a separate component of the this strategy [8]. For example, compared to us- runtime library called the garbage collector. ing GATK4, elPrep executes the 4-step Broad Best At arbitrary points in time, it traverses the Practices pipeline [17] (consisting of sorting, mark- object graph to determine which objects are ing PCR and optical duplicates, and base qual- still directly or indirectly accessible by the ity score recalibration and application) up to 13x running program, and deallocates inaccessible faster on whole-exome data, and up to 7.4x faster objects. This ensures that object lifetimes do on whole-genome data, while utilizing fewer compute not have to be explicitly modelled, and that resources [8]. pointers can be more freely passed around All SAM/BAM tools have in common that they in a program. Most garbage collector im- need to manipulate large amounts of data, as plementations interrupt the running program SAM/BAM files easily take up 10-100GB in com- and only allow it to continue executing after pressed form. Some tools implement data structures garbage collection { they \stop the world" [19] that spill to disk when reaching a certain thresh- { and perform object graph traversal using old on RAM use, but elPrep uses a strategy where a sequential algorithm. However, advanced data is split upfront into chunks that are processed implementation techniques, as employed by entirely in memory to avoid repeated file I/O [16]. Java [21] and Go [22], include traversing the Our benchmarks show that elPrep's representation object graph concurrently with the running of SAM/BAM data is more efficient than, for exam- program while limiting its interruption as far ple, GATK4, as elPrep uses less memory for loading as possible; and using a multi-threaded parallel the same number of reads from a SAM/BAM file in algorithm that significantly speeds up garbage memory [8]. However, since elPrep does not provide collection on modern multicore processors. 2 bioRxiv preprint doi: https://doi.org/10.1101/558056; this version posted February 22, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. Reference counting Memory is managed by main- like parallel sorting or concurrent duplicate mark- taining a reference count with each heap object. ing, but also from the overall software architecture When pointers are assigned to each other, these that organizes these steps into a single-pass, multi- reference counts are increased or decreased to threaded pipeline. Since such software-architectural keep track of how many pointers refer to each ob- aspects are not covered by the existing literature, it ject. Whenever a reference count drops to zero, therefore became necessary to perform the study de- the corresponding object can be deallocated.1 scribed in this article. elPrep is an open-ended software framework that elPrep was originally, up to version 2.6, imple- allows for arbitrary combinations of different func- mented in the Common Lisp programming lan- tional steps in a pipeline, like duplicate marking, sort- guage [23]. Most existing Common Lisp implemen- ing reads, replacing read groups, and so on; addi- tations use stop-the-world, sequential garbage collec- tionally, elPrep also accommodates functional steps tors. To achieve good performance, it was therefore provided by third-party tool writers. This openness necessary to explicitly control how often and when makes it difficult to precisely determine the lifetime the garbage collector would run to avoid needless in- of allocated objects during a program run. It is terruptions of the main program, especially during known that manual memory management can con- parallel phases. As a consequence, we also had to tribute to extremely low productivity when develop- avoid unnecessary memory allocations, and reuse al- ing such software frameworks. See for example the ready allocated memory as far as possible, to reduce IBM San Francisco project, where a transition from the number of garbage collector runs. However, our C++ with manual memory management to Java with more recent attempts to add more functionality to garbage collection led to an estimated 300% produc- elPrep (like optical duplicate marking, base quality tivity increase [33].

A Comparison of Three Programming Languages for a Full-Fledged Next

Acme: a User Interface for Programmers Rob Pike [email protected]−Labs.Com

Tiny Tools Gerard J

Sequence Alignment/Map Format Specification

Revenues 1010 1020 1030 1040 1045 1060 1090 82,958 139,250

Sam Quick Reference Card

Computer Oral History Collection, 1969-1973, 1977

Using the Go Programming Language in Practice

Programming Languages

Acme: a User Interface for Programmers Rob Pike AT&T Bell Laboratories Murray Hill, New Jersey 07974

8½, the Plan 9 Window System

Lecture 18 Regular Expressions the Grep Command

An Introduction to Linux and Bowtie