MSc thesis Master’s Programme in Computer Science

Speeding up dynamic vectors

Saska Donges¨

April 20, 2021

Faculty of Science University of Helsinki Supervisor(s) Assoc. Prof. Simon J. Puglisi

Examiner(s) Assoc. Prof. Simon J. Puglisi, Prof. Veli. Makinen¨

Contact information

P. O. Box 68 (Pietari Kalmin katu 5) 00014 University of Helsinki, Finland

Email address: [email protected].fi URL: http://www.cs.helsinki.fi/ HELSINGIN YLIOPISTO – HELSINGFORS UNIVERSITET – UNIVERSITY OF HELSINKI

Tiedekunta — Fakultet — Faculty Koulutusohjelma — Utbildningsprogram — Study programme

Faculty of Science Master’s Programme in Computer Science

Tekija¨ — Forfattare¨ — Author Saska Donges¨

Tyon¨ nimi — Arbetets titel — Title

Speeding up dynamic bit vectors

Ohjaajat — Handledare — Supervisors

Assoc. Prof. Simon J. Puglisi

Tyon¨ laji — Arbetets art — Level Aika — Datum — Month and year Sivuma¨ar¨ a¨ — Sidoantal — Number of pages

MSc thesis April 20, 2021 39 pages, 12 appendice pages

Tiivistelma¨ — Referat — Abstract

Bit vectors have many applications within succinct data structures, compression and bioinformatics among others. Any improvements in bit vector performance translates to improvements in the applica- tions. In this thesis we focus on dynamic bit vector performance. Fully dynamic succinct bit vectors enable other dynamic succinct data structures, for example dynamic compressed strings.

We briefly discuss the theory of bit vectors and the current state of research related to static and dynamic bit vectors.

The main focus of the thesis is on our research into improving the dynamic bit vector implementation in the DYNAMIC ++ library (Prezza, 2017). Our main contribution is the inclusion of buffering to speed up insertions and deletions while not negatively impacting non-modifying operations. In addition, we optimized some of the code in the DYNAMIC library and experimented with vectorizing some of the access operations.

Our code optimizations yield a substantial improvement to insertion and deletion performance. Our buffering implementation speeds up insertions and deletions significantly, with negligible impact to other operations or space efficiency. Our implementation acts as proof-of-concept for buffering and suggests that future research into more advanced buffering is likely to increase performance. Finally, our testing indicates that using vectorized instructions in the AVX2 and AVX512 microarchitecture extensions is beneficial in at least some cases and should be researched further.

Our implementation available at https://github.com/saskeli/DYNAMIC should only be consid- ered a proof-of-concept as there are known bugs in some of the operations that are not extensively tested.

ACM Computing Classification System (CCS) Data → Data Structures Theory of computation → Design and analysis of algorithms → Data structures design and analysis →

Avainsanat — Nyckelord — Keywords bit vector, buffering, C++, compression, , dynamic, SIMD Instructions, Vectorization

Sailytyspaikka¨ — Forvaringsst¨ alle¨ — Where deposited Helsinki University Library

Muita tietoja — ovriga¨ uppgifter — Additional information Algorithms study track

Contents

1 Introduction1

2 Definitions and notation3 2.1 Bit vectors...... 3 2.2 O-notation...... 4

3 Static bit vectors6 3.1 Simple bit vector implementations...... 6 3.2 Population count vectorization...... 7 3.3 Constant time(ish) Rank and Select...... 7 3.3.1 Rank...... 8 3.3.2 Select...... 9 3.4 Practical mostly static implementation...... 9

4 Dynamic bit vectors 10 4.1 B-trees...... 10 4.2 Practical fully dynamic implementation ...... 11 4.2.1 Access and modification ...... 13 4.2.2 Rank and Select...... 14 4.2.3 Insertion and removal...... 14 4.2.4 Space requirement ...... 14

5 Our contributions 16 5.1 Division and modulus bypass...... 16 5.2 Background on write optimization ...... 17 5.3 Buffering ...... 17 5.3.1 Insertion...... 19 5.3.2 Removal...... 20 5.3.3 Access and modification ...... 21 5.3.4 Rank...... 21 5.3.5 Select...... 22 5.4 AVX experimentation...... 22

6 Experiments 25 6.1 Scaling experiment ...... 26 6.2 Mixture test...... 29 6.3 Application testing ...... 31 6.4 Vectorization testing...... 32

7 Conclusions and future work 35 7.1 Optimizing memory allocation...... 35 7.2 Query caching...... 36 7.3 Branchless binary search ...... 36 7.4 Bit vector compression ...... 36

Bibliography 37

A Experiment results expanded 1 Introduction

Bit vectors are an integral part of many widely used algorithms and data structures. In the simplest case, bit vector implementations, such as std::bitset in C++, can be used as efficient arrays of booleans for use with path finding algorithms or similar. Several compressed and succinct data structures can be built using bit vectors (Navarro, 2016). Any gains in efficiency for the underlying bit vector implementation translate directly to gains for the applications. Further, support for additional operations like Rank and Select can be leveraged in the application. For example, a string that supports random access, can be represented in compressed space as a wavelet (a specifically built tree with bit vectors as nodes), given bit vectors that support Rank and Select. If modifications, inserts and removals are supported as well, these can also be used in an application to provide, for example, insert and remove operations for compressed strings. A great deal of effort has been made to optimize static lookup structures for bit vectors (Gog, Beller, et al., 2014). However, dynamic bit vectors can still be optimized to improve the perfor- mance of fully dynamic implementations of data structures that rely on them. This thesis is a look at work that has been done so far on dynamic bit vectors, as well as a presentation of some of our own preliminary results, along with some promising avenues for further research. Our research is based on modifications to the succinct bit vector implementa- tion in the DYNAMIC library presented in Prezza, 2017, where a blocking approach along with a tree structure is used. Similar approaches are presented in at least Zhou et al., 2013,Karkk¨ ainen¨ et al., 2014, Klitzke and Nicholson, 2016 and Cordova and Navarro, 2016, where the blocking and tree structures are used to enable dynamism, block compression or support structures for Rank and Select. Blocking approaches potentially suffer greatly from memory fragmentation due to the nature of memory allocation for such structures by operating systems. Our contribution is a modification to the leaves of the succinct bit vector tree structure of the DYNAMIC library. With simple code optimization and an addition of buffering, our leaf imple- mentation significantly speeds up insertion and removal operations for the data structure without significant penalty to non-modifying operations or data structure size. We have not, as of yet, considered compression of the bit vectors, as the main focus of our research is to speed up the current dynamic implementations with minimal impact on space usage. Compression schemes may be beneficial both in terms of space and run time for some 2 input data. We do not address memory fragmentation either, beyond briefly noting it in the results of some of our experiments. Possible future work related to compression and reducing memory fragmentation are discussed in Section 7.4. We will start by presenting definitions and the notation used throughout the thesis. After this, some current approaches for static structures are discussed, followed by a presentation of one state of the art implementation for fully dynamic bit vectors. After presenting our research and results in Chapters5 and6, conclusions and discussion of possible future directions for research will be follow in Chapter7. 2 Definitions and notation

2.1 Bit vectors

In principle, bit vectors are sequences of values (v0, v1,..., vn) where vi = {0, 1}. For a given bit vector V ∈ M, the operations below are typically defined. The precise implementation of these operations depend on the bit vector implementation.

Access: A function f : M × Ž → {0, 1} maps f (V, i) = vi to the value of the bit vector V at position i. This is often denoted V.at(i) or simply V[i] in (pseudo)code. Formally,

f (V, i) can be defined to be 0 if vi is undefined. In practice, querying beyond the length of the bit vector is considered undefined behavior.

Modification: A function f : M × Ž × {0, 1} → M maps f (V, i, b) = V0 to another bit vec- tor where ∀ j (( j = i ⇒ V0[ j] = b) ∧ ( j , i ⇒ V0[ j] = V[ j])). In this thesis, the notation V.(i, v) or V[i]=v will be used in code.

Insertion: A function f : M × Ž × {0, 1} → M maps f (V, i, b) = V0 to another bit vector where ∀ j (( j < i ⇒ V0[ j] = V[ j]) ∧ ( j = i ⇒ V0[ j] = b) ∧ ( j > i ⇒ V0[ j] = V[ j − 1])). Typically, the notation V.insert(i, v) is used to denote insertion in code.

Removal: A function f : M × Ž → M maps f (V, i) = V0 to another bit vector where ∀ j (( j < i ⇒ V0[ j] = V[ j]) ∧ ( j ≥ i ⇒ V0[ j] = V0[ j + 1])). Typically, denoted V.remove(i) in code.

Pi Rank: A function f : M × Ž → Ž where f (V, i) = j=0 v j, that is the number of 1- in the first i + 1 elements of the bit vector. This is typically denoted V.rank(i) .

Select: A function f : M × Ž+ → Ž where f (V, i) maps to the minimum value n such that Rank(V, n) = i, that is the index of the ith 1-bit in the bit vector.

The operations above are defined as zero-indexed. This definition leaves V.select(0) unde- fined, while it could also refer to the location of the first 1-bit in an inverse zero indexed table of 1-bit locations. Implementations may differ slightly in indexing, especially with regards to Rank and Select. The formal operation definitions above are pure functions where any modifications 4 imply creating a new modified bit vector, while practical implementations generally modify data structures in place to avoid copying data.

Additionally, Rank0 and Select0 may be defined, where 0-bits are counted instead of 1-bits. Generally, the same points are valid for the 0-bit counting operations as for the 1-bit operations. These 0-bit operations will not be discussed further in this thesis.

2.2 O-notation

With O-notation the asymptotic behavior of functions can be explored. Specifically, if f (n) ∈ O (g(n)), then f belongs to the set of functions that are not asymptotically more complex than g. More formally for functions f : Ž → ’ and g : Ž → ’

f (n) ∈ O (g(n)) iff ∃n0, c (∀n > n0 (c · g(n) > f (n))) , where c, n, n0 ∈ Ž. That is, for a large enough value of n, f (n) is no more than g(n) times some constant. Generally, in a horrible abuse of notation, f (n) = O (g(n)) is used interchangeably with f (n) ∈ O (g(n)), and this standardized abuse will be continued in this thesis. The definition above is targeted at computer science applications. In mathematics and physics, the applications for more general functions and series expansions use slightly different notation and definitions. O-notation conveniently removes many constants and lower order terms to simplify compar- √  ison of asymptotic time or space complexities, for example 5n log6(n) + n = O n log(n) . O-notation indicates an asymptotic scaling for a function that can be compared with the asymp- totic scaling of another function. The same O-complexity shared by two algorithms essentially guarantees that one of the algorithms will not behave radically differently compared to the other one as the size of the input grows. For implementations of algorithms, a O-complexity guarantee is important, but can often hide information about constant factors and lower order terms that have a significant impact on run time at practical input sizes. In addition to O-notation, o-notation will also be used in this thesis. f (n) ∈ o (g(n)) (or equiva- lently f (n) = o (g(n))) simply means that f (n) is asymptotically less than g(n), that is

g(n) lim = 0. n→∞ f (n) For example, if f (n) = o(n), f (n) is strictly sublinear and n + f (n) → n. Again, like O- n→∞ notation, the o-notation may hide factors and terms that may be important for practical run time 5 or data structure size, while still being very important for scaling guarantees as n is essentially unbounded. Other complexity notation is also used in computer science (Θ, Ω, ω). There is also much more to be said about the O and o-notations that is not particularly relevant for this thesis. More information on these can be found in for example Cormen et al., 2009, Sedgewick and Flajolet, 1996 or Navarro, 2016. 3 Static bit vectors

3.1 Simple bit vector implementations

The most simple implementation of a bit vector is an array of booleans or a set of integers pro- vided by any popular programming language. For example, bool* bit_vector= new bool[n]; in C++ would create a “bit vector” with n elements. This data structure supports access and modification in constant time using normal indexing. Insertion, removal, Rank and Select can be implemented in linear time. In common programming languages the boolean primitive takes 8 bits of space, meaning a bit vector with n elements takes 8n+o(n) bits of space. An implemen- tation like this is very fast in practice for small values of n, as long as access and modification are the main required operations. A common optimization is to pack the data into the bits of a contiguous array of 32 or 64-bit integers. In C++ uint64_t* bit_vector= new uint64_t[1+n/ 64]; provides a pointer to a contiguous area of memory with n to n + 63 usable bits, that takes up at most n + o(n) bits of space. Constant time access to bit i with (uint64_t)1& (bit_vector[i >>6] >> (i& 63)); is still possible, as well as constant time setting of bit i to value v with

bit_vector[i>>6]&= ∼((uint64_t)1<< (i& 63)); //clear bit i bit_vector[i>>6]|= (v<< (i& 63)); //set bit i to v.

Insertion, removal, Rank and Select take linear time for this version as well. There are some trade-offs between these two implementations. The packed version can benefit from vectorized population counting for significant speedups to rank, and possible improve- ments to select (see Section 3.2). The packed version results in a slight overhead in constant time operations while the unpacked version can be slightly faster for small data structure sizes due to -aligned access and modify operations. As the data structure size increases, the im- proved cache efficiency of the packed version will start to pull ahead in constant time operations, at least for queries with independent locality. Clearly, there is also a fairly significant (even if not asymptotic) difference in space requirement, with approximately one bit used per stored bit in the packed version, versus the approximately 8 bits per stored bit in the byte-aligned imple- mentation. 7 3.2 Population count vectorization

Counting 1-bits has a multitude of applications in cryptography, information retrieval and bioin- formatics, among others. Unsurprisingly, most common PC processor architectures support efficient population counting for computer words of up to 64 bits in width. For the Gnu family of compilers, the library function __builtin_popcountll will use architecture specific intrin- sics for calculating the population count of 64-bit integers. These intrinsics were added to both AMD and Intel microarchitectures in 2007 (Fog, 2020). Modern implementations typically have a practical pipelined throughput of one operation per clock cycle or better. Further improvements to population counting are possible with Single Instruction, Multiple Data (SIMD) instructions (Muła et al., 2017). Unfortunately, the Streaming SIMD Extensions (SSE) and the first and second versions of the Advanced Vector Extensions (AVX and AVX2) available on common microprocessors do not directly support population counting for 128-, 256- and 512- bit words. The AVX-512 extension does support population count instructions for all of these vector sizes and is the fastest with some caveats. The AVX-512 extension is available only on top-of-the line Intel microprocessors, where running AVX-512 instructions lowers the maximum clock speed. The native vectorized population count instructions also have a comparatively high latency so the benefits of vectorized population count intrinsics compared to single word population count instructions do not scale linearly with data size. Even without the native AVX population count intrinsics available, SIMD instructions can be utilized to calculate population counts with a higher throughput than with the 64-bit intrinsics made available by __biltin_popcountll . The comparatively high latency of the 256- or 128-bit instructions makes these approaches slightly slower for small blocks of data, as can be seen in Table 3.1.

3.3 Constant time(ish) Rank and Select

Constant time lookups for Rank and Select are possible with the help of precomputed support data structures taking up o(n) space (see e.g. Vigna, 2008). Here, the actual constant time and o(n) space solutions are touched upon, but we will focus on solutions requiring n/k space, where k ≥ 1, and either constant O(1) time or practically fast linear O(n/k) time complexity. 8

Array size popcnt AVX2 Muła AVX2 HS 256 B 1.12 1.38 – 512 B 1.06 0.94 – 1 kB 1.03 0.81 0.69 2 kB 1.01 0.73 0.61 4 kB 1.01 0.70 0.54 8 kB 1.01 0.69 0.52 16 kB 1.01 0.69 0.52 32 kB 1.01 0.69 0.52 64 kB 1.01 0.69 0.52

Table 3.1: Number of cycles per 64-bit population count calculation for different data sizes. Tabulated for na- tive 64-bit popcnt , the Muła function utilizing AVX2 instructions and the Harvey-Seal algorithm using AVX2 instructions (Muła et al., 2017).

3.3.1 Rank

A fairly simple and practical way to implement constant time Rank support for a bit vector is j k BV rbv n as follows. Given a bit vector of n bits, an array of k·212 64-bit integers is allocated, where k ∈ N is a tuning parameter. For each 64-bit element i of the rbv array, the value is set to j k BV.rank((i+1)* pow(2, 12)* k) n ≤ . This structure is k·26 n bits in size. These cumulative sums can be used to calculate the result of a Rank query in O(k) = O(1) (for a constant k) time using the algorithm outlined in Figure 3.1.

def rank(i): r=0 if int(i/(2 12 * k))>0: r= rbv[int(i/(2 12 * k))-1] for j in range(int(i/(2 12 * k)), i): r=r+ bv[j] return r

Figure 3.1: Pseudocode for a constant time rank query

With practical values for k and efficient use of population counting instead of a naive loop, the Rank queries implemented as described above are very fast in practice. If we let the value of k be n on the order of log(n) and precalculate solutions in an array with 212·log(n) elements, the support structure takes o(n) space and supports O(log(n)) time rank queries. Using the “Four Russians” -algorithm (Aho and Hopcroft, 1974), the asymptotic performance of the support structure for Rank queries can be further improved to o(n) space while supporting constant time execution of 9 queries. Experiments comparing constant time and o(n) implementations with constant time and linear space implementations have found that for modern data structure sizes, the linear space implementations are both faster and more space efficient (Gog, Beller, et al., 2014).

3.3.2 Select

Given the solution for constant time Rank queries, an O log(n) solution to Select queries can be reached simply by binary searching over the bit vector using Rank queries. This is often fairly fast in practice and takes no additional space beyond the structure required for the Rank queries (Gonzalez´ et al., 2005). Constant time solutions for Select that utilize o(n) Select-specific additional data structures also exist (Clark, 1997; Vigna, 2008). However, guaranteed constant time Select structures typi- cally require a significant amount of extra space (even if asymptotically o(n)), and the constant factors for the Select operation are high enough for most practical data structure sizes to often make binary searching with Rank or simply precomputing the solutions for sparse arrays more efficient. Additionally, a “not quite constant time” approach is presented in Vigna, 2008. Essentially, results are precomputed for a subset of possible Select queries. These precomputed results are utilized along with linear population count scanning to calculate results. Technically, since there are no guarantees on the distribution of 1-bits, this solution cannot be considered constant time, unless the bit vector is sparse enough to allow precalculation of all Select queries. Still, implementations based on this approach seem to be the fastest Select implementations in practice (Gog, Beller, et al., 2014).

3.4 Practical mostly static implementation

The ideas in this chapter are implemented in the sdsl-lite C++ library (Gog, Beller, et al., 2014). Uncompressed bit vectors in the library support access and modification. For Rank and Select queries additional separate support structures of o(n) size are required. Insertions and removals are not supported, and modification of the bit vector requires rebuilding the support structures. The library is practical, well tested (Navarro, 2016) and used in multiple research applications (for example Siren´ et al., 2019; Alipanahi et al., 2020; Gog, Karkk¨ ainen,¨ et al., 2019; Muggli et al., 2017). 4 Dynamic bit vectors

For static bit vector implementations, insertions and removals require rewriting the entire bit vector as well as any support structures (in the worst case). For applications where insertions and removals are common and interleaved with other queries, a simple contiguous block of bits is impractical. A solution is needed where contiguous blocks are short enough to not make rewrites prohibitively expensive. Such a solution is presented in this chapter.

4.1 B-trees

The original intention of the B-tree was to provide an efficient dynamic data structure for large amounts of data residing partially in secondary storage, typically disk or tape drives (Cormen et al., 2009; Comer, 1979; Bayer and McCreight, 1970). The structure is specifically designed to benefit from the way data is loaded as fairly large pages from disk (or tape). This of memory locality is desirable at other levels of the memory hierarchy as well. The B-tree is a direct alternative to search tree structures like the AVL tree or red-black tree. Binary trees are reference based structures and generally perform poorly in terms of memory locality. B-trees avoid this problem by storing blocks of data with multiple child pointers (Figure 4.1). For frequent queries, this is a significant benefit with regards to memory performance. The root is practically guaranteed to be cached and other blocks that are frequently accessed are also likely to be. While the asymptotic performance of the B-tree is the same as that of AVL or red-black trees, practical performance is typically significantly better. The search for high performance dynamic data structures has led to the development of several B-tree-like data structures due to the efficiency of B-trees in hierarchical memory systems (Fer- ragina and Venturini, 2016; Awad et al., 2019 among other). B-trees are also the basis of the dynamic bit vector by Prezza, 2017, described in the next section. Finally, the Bε-tree of Bender et al., 2015, which buffers updates at nodes in the tree instead of committing them immediately, has been used successfully in database and file systems, and is the inspiration for our buffering efforts. 11

40 33 75 24 34 70 94 4 25 57 72 78 98

16 49 68 71 95 99

44

34 68 72 94

4 16 24 25 33 40 44 49 57 70 71 75 78 95 98 99

Figure 4.1: Comparison of red-black tree and B-tree (with B = 3) structures with keys (24, 72, 33, 34, 40, 98, 75, 4, 25, 68, 70, 71, 94, 78, 49, 94, 57, 44, 99, 16) inserted in order.

4.2 Practical fully dynamic implementation

The B-tree-like implementation of succinct bit vectors in the DYNAMIC library (Prezza, 2017) has succinct bit vector blocks in the leaves and cumulative sums, subtree sizes and child pointers in the internal nodes. This differs from a “normal” B-tree in that none of the actual data is stored in the internal nodes. The internal nodes take the place of the additional support structures for Rank and Select. This library is used in multiple implementations related to recent studies in dynamic succinct data structures (Alipanahi et al., 2020; Siren´ et al., 2019). In addition to the branching factor B, which is a tuning parameter, another parameter for leaf sizes is used. In general, these tuning parameters have lower values than those used for sec- ondary storage optimized B-trees due to the smaller page size of processor caches compared to main memory. The DYNAMIC library has a default branching factor of 16 and a leaf size of 8192 bits. This leads to node sizes on the order of one to a few KB, which is approximately the size of main memory pages on modern microarchitectures. The tuning parameters also have a significant impact on the performance on different operations. Operations that have to scan leaves benefit from small leaf sizes while operations that only perform constant time operations on leaves benefit from the shallower tree that follows from increased leaf size. Similarly, the branching factor allows a trade-off between tree depth and efficiency of operations on the inter- nal nodes. Figure 4.2 is a simplified illustration of the data structure implemented in DYNAMIC . 12

Element overhead (O(1) for each element) 1-bit 0-bit Cumulative sums, child sizes and child pointers

Figure 4.2: Visualization of data structure as built by the DYNAMIC library implementation with branching factor four and leaf size 2048. These values guarantee that internal nodes besides the root have between four and eight children, and leaves have between 2048 and 4096 elements. The element sizes are scaled to give an intuitive indication of the relative sizes of elements in the data structure with default values for branching factor and leaf size and an input 16 times larger. The input data is the first 32770 binary digits of pi. This image does not reflect in-memory layout and does not account for overhead due to memory fragmentation.

In transitioning from a contiguous memory block model (used in static bit vector implementa- tions) to a tree structure, almost all of the operations become dependent on the tree depth, which  is O logB(n/L) , where B is the branching factor, L the leaf size in bits, and n the data structure size. For practical values, the tree height will remain small (< 10). Still, traversing the internal nodes implies branching and following pointers, which is costly on modern architectures. Access, modification, insertion and removal operations essentially only target the leaves, using the child sizes in the internal nodes to traverse to the correct leaf. When a leaf is modified and possibly split, the internal nodes are updated and possibly rebalanced accordingly. Rank and 13

Select queries additionally use the cumulative sums and do calculations when traversing the tree. Rank traverses to the correct leaf using the child sizes and calculates intermediate Rank results for the branching points, that are summed together with the Rank result from a leaf to generate the final result for the rank query. Conversely, Select traverses to the correct leaf using the cumulative sums, while collecting child sizes for branching points, that are summed together with the Select result from the leaf. Figure 4.3 is a simple demonstration on how tree traversal is done for queries.

Cumulative child sizes 642901 1061957 1716726 Cumulative sums 314755 528841 816414

26931 59077 77085 97353 119101 ... 337718 354578 373997 395899 419056 26247 40480 46952 50747 67845 ... 176060 182789 192179 206301 214086

... 1 0 1 1 0 0 1 1 0 1 0 0 1 1 1 1 1 0 0 1 1 0 1 0 0 0 0 1 0 0 1 1 1 0 0 1 ...

Selection of bits in the leaf

Figure 4.3: Example of tree traversal for queries. A Rank(700 000) query targets the second child of the root since 1 061 957 > 700 000 > 642 901. The query gets translated to a 314 755 + Rank(700 000 − 642 901) query for the child node based on the cumulative sum and child value of the first child of the root. At the internal node, the second branch is targeted since 59 077 > 700 000 − 642 901 > 26 931, and the query gets translated further to 314 755 + 26 247 + Rank(700 000 − 642 901 − 26 931) for the leaf. At the leaf Rank(30 168) is calculated by counting 1-bits up to the desired position.

Time complexity of operations on the tree are given as O log(n) in Prezza, 2017. This hides the dependence on the tuning parameters that potentially significantly impact the practical run time. Below is some discussion on the complexity of operations that is part of the basis for our research.

4.2.1 Access and modification

Both access and modification require only constant time in the leaves, thus the asymptotic time complexity is entirely based on traversal and updating of the internal nodes. Branching in the 14 internal nodes is linear in B and updating the cumulative sums on modification is also a linear  time operation. Thus, the asymptotic time complexity is given by B · logB(n/L) = O log(n) for constant B and L parameters. We note that branch selection using cumulative sums in the internal  nodes could be a O log2(B) operation using binary search instead of linear scanning. However, binary search typically has higher constant factors so a logarithmic implementation may not be faster for practical branching factors (from 16 to a few hundred at most). Exploring this potential trade-off is a possible topic for further work. Both access and modification naturally benefit from increasing leaf sizes as this flattens the tree and the operations at the leaf level remain constant time.

4.2.2 Rank and Select

Internal branch selection is essentially the same for Rank and Select operations as for access op- erations. Additionally, linear time in L is spent at the leaf to calculate population counts. Again,   the asymptotic performance of O B · logB(n/L) + L simplifies to O log(n) , but preprocessing the leaves for faster Rank and/or Select operations or use of more advanced vectorization could improve performance.

4.2.3 Insertion and removal

Insertion and removal typically require rewriting a leaf, and while the complexity of the opera-  tions is defined by the same O logB(n/L) + L , the impact of the linear O(L) term is significant, since rewriting and potential reallocation is costly on modern architectures. This is not the case for operations which cause rebalancing of the tree structure. Generally, rebalancing and leaf splitting are considered amortized by the more simple insertion and removal operations as long as leaf sizes and branching factors are big enough. Generally, insertions and removals benefit from smaller leaf sizes complemented by a higher branching factor to keep the tree shallow.

4.2.4 Space requirement

Both Prezza, 2017 and the DYNAMIC library documentation quote the practical space require- ment of the data structure as approximately 1.2 bits per bit stored. We suspect that this space requirement is based on an earlier implementation where removal had not yet been considered. Our testing indicates that sequences including removals may push the space requirement sig- nificantly beyond 1.2 bits per stored bit. Observations of 1.5 bits per bit were not uncommon, 15 while the expected space usage tended to be close to the quoted 1.2 bits per bit. This difference is likely due to allowing change in leaf sizes below the L parameter to avoid churn in the tree structure for repeated insertion and removal operations. 5 Our contributions

Our research has been focused on optimizing the leaves of the B-tree, which are, essentially, simple contiguous bit vectors, where insertions and removals require rewriting the entire leaf following the insertion or removal point. Our main contribution has been to examine the possible benefits of insertion and removal buffer- ing at the leaf nodes to reduce the number of rewrites. Additionally, we have found a set of division and modulus calculations in the DYNAMIC library, that cannot be optimized away by the compiler but can be manually circumvented. We have also done some preliminary testing on vectorizing Rank and Select operations on the leaf level. Our leaf implementation is available in our fork of DYNAMIC on GitHub1.

5.1 Division and modulus bypass

In the templated DYNAMIC library, a generic leaf class is used for the bit vector implementation, which necessitates redundant (for bit vectors) integers per word and offset calculations on data structure access. Since the bits per integer ( width_ ) value is stored in the leaf node as an object member, the compiler cannot optimize away redundant calculations in the code (Figure 5.1)

int_per_word_= 64/ width_; ... uint64_t word_nr=i/ int_per_word_; uint8_t pos=i- int_per_word_* word_nr;

Figure 5.1: Example of divisions that cannot be optimized by the compiler.

For an implementation where leaves are always succinct bit vectors, these division and modulus calculations are completely redundant. Our own leaf implementation omitted these calculations. This optimization should yield a significant constant improvement for insertion and removal op- erations and we assume this simple change is responsible for the raw speed up of our unbuffered implementation compared to the leaves of the DYNAMIC library.

1https://github.com/saskeli/DYNAMIC 17 5.2 Background on write optimization

Insertion and removal operations on contiguous data blocks are linear in the block size since at least the entirety of the block beyond the insertion or removal position needs to be rewritten to keep the block contiguous, even if full reallocation is not required. For B-trees with big node sizes this makes insertions and removals significantly slower than read operations. A significant improvement to the cost of modifications can be made by amortizing the operations. In practice this can be done by buffering modification operations in part of the node blocks (Bender et al., 2015). This does have a negative impact on read operations (such as access), since the number of branch elements in the internal nodes decreases as part of the space is dedicated to a buffer, making the tree deeper, and hence root to leaf traversals longer. Additionally, there is some overhead due to scanning the buffers when traversing the tree. For big B-trees with node sizes on the order of megabytes and significant portions of the data structure in secondary memory, the buffering has a very big positive impact on overall perfor- mance. The same principles hold for smaller data structures that fit entirely in main memory and have smaller node sizes.

5.3 Buffering

The main goal of our buffered implementation was to speed up execution times for modification without incurring prohibitive penalties for non-modifying operations. Our buffering approach is simply to allocate a 32 · b -bit array for each leaf (where b is the buffer size), where insertion and removal operations are buffered. Each 32-bit buffer element contains the insert/removal location as well as the needed data about the operations. The insert/removal location is encoded in 24 bits and limits the maximum leaf size to 224 −1. Experiments show that a leaf size of 106 is sufficient even for operations that most benefit from large leaf sizes, so this is not likely to be a problem in practice. The buffer element size of 32 was selected to align with a common power-of-two element size. A 16-bit element size would have left at most 13 bits for storing insert/removal location, which was deemed too limiting. A 24-bit element size would still align elements to word boundaries, while providing a sufficient 20 or 21 bits for position storage, but departing from common element sizes may still reduce access speeds to buffer elements, as well as being slightly harder to implement. Optimizing buffer element storage is still work in progress. Figure 5.2 illustrates the space used by the buffers as a comparison to Figure 4.2. 18

Element overhead (O(1) for each element) Buffer 1-bit 0-bit Cumulative sums, child sizes and child pointers

Figure 5.2: Visualization similar to Figure 4.2 but with the additional space allocated to buffering highlighted.

The buffer is kept sorted by always inserting new elements in the correct position. This makes non-modifying access operations faster and more simple to implement. The trade-off is that the total cost of buffer editing for b successive modifying access operations is O(b2) instead of O b log(b). For practical values of b this has not been observed to have any negative impact. This is likely due to the very limited sizes of the buffers. The small constant factors associated with incremental insertions and using memmove to shift elements, seems to offset the theoretical asymptotic advantage of only sorting when committing the buffer. Also, if assuming a uniform random access pattern, the expected number of buffers that need to be modified on insert and re- move operations is b/2 elements for a sorted buffer and b for unsorted buffers. While uniformly random access patterns may not be common in practice, this illustrates the effect of only having to consider part of the buffer for most of the operations. 19

No changes have been made to the internal working of the tree structure. As such the operation descriptions and pseudocode presented here only consider the effect of buffering on operations at the leaf level.

5.3.1 Insertion

Given an insertion position k and a boolean value v. For any existing buffer elements with an index value k0 > k, the index of the buffer is set to k0 + 1. For existing insertions in the buffer where the index value k0 = k, the index is also updated to k0 + 1. For removals with index value k0 = k, nothing is done. A new element is created and added into position in the buffer such that the new element is after any elements with index values k0 < k and after any buffered removal where k0 = k, and before any elements where the new index value k0 > k. The procedure is documented in Figure 5.3.

def insert(i, v): idx= buffer_count # Number of elements in buffer. for element in reversed(buffer): if element.index>i or (element.index ==i and element.is_insertion): element.index +=1 idx= idx-1 else: break # idx is now the desired position for the new element in the buffer. if idx == buffer_count: append_buffer(i, insertion, v) else: insert_buffer(idx, i, insertion, v) if buffer_count == buffer_size: commit_buffer()

Figure 5.3: Pseudocode for the buffered insertion procedure

It would be possible to change an existing removal for the same location to a modify operation and commit this modification immediately, enabling the remove operation to be cleared. This has not shown to be practical in testing to date as the overhead for these checks seem to outweigh the benefit of more efficient buffer usage. 20

5.3.2 Removal

Given a position k. For any existing buffer elements with an index value k0 > k, the buffer index is changed to k0 − 1. If an insertion to index k is found, the insertion is cleared, and the procedure is finished. If no such insertion exists, a new removal operation is added the buffer. The procedure is detailed in Figure 5.4.

def remove(i): x= value_at(i) idx= buffer_count # Number of elements in buffer. for element in reversed(buffer): if element.index == i: if element.is_insertion: # An existing insertions can be removed to handle the removal delete_buffer_element(element) return else: break else if element.index> i: element.index -=1 else: break idx= idx-1 # idx is now the desired position for the new element in the buffer. if idx == buffer_count: append_buffer(i, removal, x) else: insert_buffer(idx, i, removal, x) if buffer_count= buffer_size: commit_buffer()

Figure 5.4: Pseudocode for the buffered removal procedure

For removals, the overhead for “cancelling out” a removal with an insertion is minimal. No additional offset calculations are required and insertion and removal to the same position in the buffer incurs the same cost for shuffling the buffer with memmove as the same number of elements need to be shifted one position to the left / right. 21

5.3.3 Access and modification

The unbuffered access pattern of words[index >>6] >> (index&6) does not directly work with buffered leaves since any buffered insertion before index effectively decrements the lo- cation of the desired bit by one, while a buffered removal increments the desired index. This must be accounted for while scanning the buffer, so that the access to the underlying succinct bit vector targets the correct location. Given an access or modification position k, the buffer is scanned from the start and for each element with index value k0 < k, an offset into the underlying data structure is incremented or decremented based on the buffered operation. If an insertion with index value k is encountered, the value of that buffered insertion can be returned or modified, completing the operation. If no such insertion was encountered in the buffer, the calculated offset is applied to the original location (k := k + offset), and the operation is carried out to the offset position exactly as it would be in an unbuffered implementation. The access operation is detailed in Figure 5.5, and the modification operation is very similar.

def value_at(i): offset_index=i for element in buffer: if element.index: if element.is_insertion: return element.value offset_index= offset_index+1 else if element.index< i: offset_index= offset_index+(-1 if element.is_insertion else 1) else: break return 1& (words[offset_index >>6] >> (offset_index& 63))

Figure 5.5: Code for the buffered access procedure

5.3.4 Rank

Given a position k, the buffer is scanned from the start while keeping track of the change in offset and values of buffered operations. After all relevant elements of the buffer have been scanned, the population count of the underlying data structure is queried to the offset location k + offset and this population count is added to the seeding population count from scanning the buffer. The 22

def rank(n): offset_index=n count=0 for element in buffer: if element.index >= n: break if element.is_insertion: offset_index= offset_index-1 count= count+ element.value else: offset_index= offset_index+1 count= count- element.value target_word= offset_index >>6 for i in range(target_word): count= count+ popcount(words[i]) count= count+ popcount(words[target_word]&((1 << (offset_index& 63))-1)) return count

Figure 5.6: Pseudocode for the buffered Rank procedure population count of the underlying bit vector is calculated by using the 64-bit popcnt machine instructions by default. The code for the default buffered Rank operation is detailed in Figure 5.6. See Section 5.4 for efforts to optimize Rank and Select operations using AVX.

5.3.5 Select

Given a number m, the buffer and underlying data structure are scanned in parallel, one 64-bit word, and associated buffer elements at a time. Once the population count for the scanned sec- tions is greater to or equal to m, the leaf is scanned backwards one element at a time with access operations, until the correct location has been found. See Figure 5.7 for the detailed implemen- tation. Efforts to optimize the Select operations with binary search and AVX are detailed in the next section.

5.4 AVX experimentation

Muła et al., 2017, describes efficient methods for population counting on modern CPU archi- tectures. A practical implementation of this, libpopcnt , has been made available online1. We

1https://github.com/kimwalisch/libpopcnt (accessed 3. March 2021) 23 def select(n): population=0 position=0 current_buffer= buffer[0] position_offset=0 for w in words: population= population+ popcount(w) position= position+ 64 while current_buffer and current_buffer.index< position: if current_buffer.is_insertion: population= population+ current_buffer.value position= position+1 position_offset= position_offset-1 else: population= population -(1& (words[(current_buffer.index+ position_offset) >>6] >> (current_buffer.index+ position_offset))) position= position-1 position_offset= position_offset+1 current_buffer= current_buffer.next_element() if population >= n: break while population >= n: position= position-1 population= population- value_at(position) return position

Figure 5.7: Pseudocode for the buffered Select procedure utilized this provided implementation in an attempt to optimize Rank and Select operations at the leaf level. We hypothesize that optimizations may be beneficial with the 256-bit vectors provided with the AVX2 architecture extension, and the benefits would likely be greater with AVX512 instructions available. Unfortunately, benchmarking systems currently available to us only support AVX2. AVX2 instructions do not include native population counting, but efficient population counting can still be done faster than with __builtin_popcountll for sufficiently large bit vectors. Since the AVX512 architecture extension provides native population count intrinsics for 128-, 256- and 512-bit words, the expectation is that running on architectures sup- porting AVX512 instructions would provide additional speedups. The AVXoptimization of the Rank operation detailed in Figure 5.6 simply substitutes the loop of 64-bit __builtin_popcountll calls with a call to the popcnt function provided by libpopcnt . 24

This is a straightforward substitution and results should be positive if the blocks to population count are sufficiently large. Since the vectorized population count scales well with the increase in vector size, the vectorized Select implementation was changed to one where the correct target word is computed using binary search instead of probing iteration. This is achieved by calculating the population count for an initial search target using the provided popcnt function, followed by iteratively adding or removing the change in population count as the search target changes. This is essentially an optimized binary search with Rank operations. This implementation adds significant overhead in the binary search, but the hope is that this is offset by the speed of the population counts. 6 Experiments

In this chapter we present experiments aimed at determining the effect of our changes to the leaf implementation. First, we ran simple scaling experiments to determine the practical performance of the data structures as a function of data structure size, and whether the effect of buffering on operations supports our hypothesis of speeding up modifications while slightly slowing down lookups. After this we ran mixture tests in an attempt to quantify the trade-off between speed for modification versus speed for lookup. Finally, we did test runs with some applications that utilize dynamic bit vector implementations to verify that our implementation works correctly and to get “real world” performance comparisons to the implementation presented in Prezza, 2017 and implemented in the DYNAMIC library. Experiments were run on multiple systems in an attempt to verify that the results are not archi- tecture dependent. The main results and experimental procedures are presented in this chapter with additional results detailed in AppendixA for completeness. All of our test systems had modern Linux based operating systems, a minimum of 16GB of RAM and negligible system load when running the tests. Code was compiled using the GNU Compiler Collection (GCC) version 9.3 or later with the -Ofast and -march=native compiler optimization options. The systems used consist of the following:

1. Desktop PC with an Intel® Core™ i7-4790 CPU with 256 kB 8-way set associative L1D cache, 1024 kB 8-way set associative L2 cache and 8192 kB 16-way set associative L3 cache running Ubuntu Linux.

2. Desktop PC with an AMD® Ryzen™ 5 2600X CPU with 576 kB 8-way set associative L1D cache, 3072 kB 8-way set associative L2 cache and 16384 kB 16-way set associative L3 cache running Ubuntu Linux.

3. Laptop with an Intel® Core™ i5-6200U with 64 kB 8-way set associative L1D cache, 512 kB 4-way set associative L2 cache and 3072 kB 12-way set associative L3 cache CPU running Arch Linux.

4. HPC cluster Ukko2 with Intel® Xeon™ E5-2680 v4 CPUs 448 kB 8-way set associative L1D cache, 3584 kB 8-way set associative L2 cache and 35840 kB 20-way set associative 26

L3 cache1 running CentOS Linux.

Unless otherwise stated, the results presented in this chapter will be based on the Intel i7 system.

6.1 Scaling experiment

Our test add_bench t N k builds a data structure of type t ∈ { DYNAMIC , buffered(8), unbuffered} up to size N  106 in k steps and does operation timing for each step. Steps are distributed exponentially to provide evenly distributed data points on a logarithmic axis of data structure size. We ran tests with data structure sizes from 106 to N = 1010 in k = 100 steps to ensure that data structures are large enough to require significant cache acquisitions for randomized queries on the larger data structure sizes. The test works as follows. Initially a data structure of size 106 is built using random insertions. After this, the following is done for each step

1. Random insertions are done to extend the data structure to the desired size. 2. 105 random insertions are generated and stored (one vector for locations and one for values). 3. The random insertions are committed to the data structure and timed. 4. 105 random removals are generated and stored. 5. The random removals are committed to the data structure and timed. 6. 105 random access operations are generated and stored. 7. The random access operations are committed to the data structure and timed. 8. 105 random Rank queries are generated and stored. 9. The random Rank queries are committed to the data structure and timed. 10. 105 random Select queries are generated and stored. 11. The random Select queries are committed to the data structure and timed.

The C++ class std::uniform_int_distribution is used to gen- erate random unsigned 64-bit values, and these are converted to valid locations by taking the modulus of the random value. This does not generate a true uniform distribution and queries targeting the 264 mod size first elements may be up to ≈ 5 · 10−8 percent more likely. Figure 6.1 shows an overview of performance scaling.

1According to https://en.wikichip.org/wiki/intel/xeon_e5/e5-2680_v4, as user access to the clus- ter disallows reading exact hardware specifications. 27

Figure 6.1: Execution time scaling as a function of data structure size for DYNAMIC , our implementation with buffer size 8 and our implementation with buffer size 0. Tests were run at 100 points in a (106, 1010) range.

The results for this experiment mostly confirmed our hypotheses. Our leaf implementation is a significant improvement for insert and remove operations compared to the DYNAMIC library im- plementation. This improvement is mainly due to code optimization. The improvement seems worse at large data structure sizes. This is likely due to the poor locality for the buffer and underlying values caused by the memory allocation in our implementation. For non-modifying operations our implementations are generally faster or as fast as the DYNAMIC library implemen- tation as demonstrated in Figure 6.2.

Figure 6.2: Execution time scaling of non-modifying operations as a function of data structure size for DYNAMIC , our implementation with buffer size 8 and our implementation with buffer size 0.

The effect of buffering is not very clear in this experiment. Figure 6.3 illustrates the effect of buffering on insert and Rank operations on the Intel i7 system. Buffered insertions are generally slightly faster than unbuffered insertions, while Rank queries are slightly slower for the buffered implementation, as expected. 28

Figure 6.3: Insert and Rank operation execution time scaling comparisons between our buffered and unbuffered leaf implementation on the Intel i7 test system.

Figure 6.4 shows insertion scaling with approximate cache size annotations. The lines show the size after which the data structure is guaranteed not to fit entirely in the cache in question. This does not take into account the effects of memory fragmentation or clashes due to associativity, so the actual transition point for the data structure fitting in cache is lower by some factor. The figure shows that none of our results show any strange scaling behavior based on cache residency, and it seems the main contributor to scaling is the tree height. In addition to mean execution time of operations we also collected some data on data structure sizes and memory fragmentation. The DYNAMIC library and our leaf implementation include a

Figure 6.4: Insert query execution time scaling on each test system for DYNAMIC and our buffered leaf implemen- tation with cache size annotation. 29 bits_size() function to gather data on memory allocation. In addition to the allocated memory we also collected resident set size data during out experiments. The allocated size of the data structure was typically ≈ 20% greater than the number of stored bits and an additional ≈ 30% memory was used due to memory fragmentation (resident set size / bits_size() ). This high- lights the importance of efficient memory allocation if minimizing memory footprint is a high priority.

6.2 Mixture test

To further explore the effect of buffering, a mixture test was used, where mean operation times are generated as a function of insertion vs. Rank probability. A test run consists of first seeding the three different data structures at the same time ( DYNAMIC reference implementation, un- buffered and with buffer size 8) with 106 elements to guarantee that the content of all the bit vectors are the same. After seeding, for each of the different insertion probabilities, 106 random queries are generated with the desired insertion probability. These queries are run on each of the data structures separately and the mean execution time for each of the data structures for each of the insertion probabilities are output. Timing of individual queries would have been preferred, to give a better understanding of the timing for different queries. It was found, however, that some single query times can be so short that the resolution of the timing device was insufficient.

Figure 6.5: Mean query times for 106 element bit vectors as a function of insertion vs. Rank operation probability. Shaded area is standard deviation for the 20 runs.

Figure 6.5, shows the mean of mean query execution times over 20 runs of the experiment. The data for single runs is very noisy and multiple runs were required to visualize the differences 30 in expected execution times. The plot shows that the reference implementation (Prezza, 2017) is slower in every case, except for pure Rank operation sequences, where it performs similarly to our implementation. Further, the plot seems to confirm the expected behavior of buffered vs. unbuffered operations, where the buffered leaf implementation becomes faster as the fraction of 6 insert operations increases beyond approximately 10 . On our other test systems, the comparison to the DYNAMIC library implementation was very similar, as can be seen in AppendixA. However, the bu ffered versus unbuffered comparisons yielded some confusing and unexpected results depicted in Figure 6.6. Results for the i5 sys- tem indicate inferior performance of the buffered implementation compared to the unbuffered implementation at high insertion probabilities. The results for the AMD system are very noisy and seem to indicate that the buffered implementation is slightly faster overall. We suspect this may be related to memory fragmentation or bad cache alignment, but more research is required to ascertain the reason. Clearly architecture has some impact on data structure behavior.

Figure 6.6: Unexpected results for Intel i5 (left) and AMD Ryzen (right) mixture tests.

We also ran a series of tests with 1010 element data structures to explore the effect of data structure size on the mixture performance. This experiment only consisted of 10 runs on the Ukko2 HPC cluster as the runs take long enough to be impractical on our other test systems. The test had very high run-to-run variance, likely due to how the data structure is built for each run. Each run produced similar patterns but with different mean execution times as can be seen in AppendixA. At approximately 20% insertion probability the bu ffered implementation overtakes the unbuffered implementation. Figure 6.7 shows that for pure Rank queries execution times on 1010 element bit vectors are slightly (≈ 0.04µs) faster when leaves are not buffered, while pure insertion execution time is ≈ 0.3µs faster when buffered leaves are used. Scaling is approximately linear between the extremes. This indicates that buffering is beneficial for big data structures in all cases where a significant number of modifying operations are required. 31

Figure 6.7: Mixture test results for 1010 element bit vectors on the Ukko2 HPC cluster. Mean and standard deviation of the difference in mean buffered versus unbuffered query times as a function of insertion probability.

6.3 Application testing

The expected result of substituting our leaf implementation in the reference library would be a significant speed increase with a slight increase in data structure size. The DYNAMIC library implementation contains benchmark code for several different use cases of the dynamic bit vec- tor. The available benchmarks are h0_lz77 , rle_lz77_v2 , rle-bwt , rle-lz77-v1 and cw-bwt . Of these only the context-wise Burrows-Wheeler transform cw-bwt benchmark produces the correct output. Other benchmarks either terminate before completion or produce inconsistent output. This indicates severe problems with our leaf implementation that need to be addressed in the future. We suspect this may be related to bugs in some leaf operations. The standard operations presented in 2.1 are extensively tested in our implementation. However, the DYNAMIC library also defines for example select0 , rank0 , serialization and deserialization operations, that are implemented but not tested in our leaf implementation. We tested our implementation using the E.coli genome data set from the Canterbury Corpus1 as well as SOURCES, PROTEINS, DNA and XML data sets from the text collection of the Pizza&Chili corpus2. Both run time and resident set size were collected for all benchmarks as well as for both the default DYNAMIC implementation and substituting our own leaf imple-

1https://corpus.canterbury.ac.nz/descriptions/#large (accessed 9 February 2021) 2http://pizzachili.dcc.uchile.cl/texts.html (accessed 9 February 2021) 32 mentation with buffer size eight. The results are shown in Figure 6.8

Figure 6.8: Results for the cw-bwt benchmark. Comparison of run time and resident set size of DYNAMIC and our buffered leaf implementation.

The results we managed to obtain are promising. Running the experiment with our leaf im- plementation was on an average 42.77% faster with only a 1.70% penalty in resident set size. Problems with our leaf implementation need to be addressed before rerunning tests using the other available benchmarks as well to properly evaluate the performance. Raw data for the cw-bwt benchmark results can be found in AppendixA. The Xeon E5-2680 tests that where run on the Ukko2 HPC cluster are outliers compared to the other systems. This is likely due to some instability in the cluster resources during testing. There was a huge run-to-run variance for this system and some experiments failed to finish in the allocated 17 hours of CPU time. The data points presented were selected from multiple runs as seemingly good representatives of reasonable performance.

6.4 Vectorization testing

The AVX and non-AVX implementations are compared by generating random procedures for creating data structures and Rank queries. These procedures are then applied separately to empty data structures using vectorized and unvectorized implementations. The procedures consist of first generating a random bit vector by inserting 106 alternating 0 and 1 bits to random locations 33 in the bit vector, after this 106 random Rank (or Select) queries are generated. All of these 2·106 operation descriptions are stored as a series of integers in a file. The procedure can then be run by reading the text file and executing the described operations. For Rank operations the experiments show a significant expected speedup of between approxi- mately 5 and 30 % for random queries depending on architecture. Figure 6.9 shows non-AVX versus AVX execution times for 100 test runs, with a mean speedup of 9.95%.

Figure 6.9: 100 runs of Rank experiments on the Intel i7 test system. Shows mean query times of the vectorized implementation versus the non-vectorized implementation with x = y line for reference. A single point in the left plot is the mean operation time with AVX as a function of mean operation time without AVX for one particular test case.

In the case of Select operations it seems the added overhead of the binary search far outweighs the benefits of vectorized population counting at least for AVX2 and the default leaf size. Ac- cording to our testing the AVX implementation is approximately 20 to 70% slower than the non-AVX implementation. Figure 6.10 shows our test results with the Intel i7 test system, with an estimated mean slowdown of 71.22% compared to the non-AVX version. 34

Figure 6.10: 100 runs of Select experiments on the Intel i7 test system. Shows mean query times of the non- vectorized implementation versus the vectorized implementation with x = y line for reference. 7 Conclusions and future work

Our work on implementing buffering, using vectorized operations when applicable and op- timization of existing code, clearly shows that there is significant room for improvement in the performance of state-of-the-art dynamic bit vectors. Our fairly simple modifications to the DYNAMIC library make modifying operations significantly faster while not appreciably impacting access operations. All of our efforts are still works in progress and a natural next step, after conceptually proving that our approaches are worthwhile, would be to create a new optimized implementation from scratch where the issues related to generic header types can be avoided. There may be more issues similar to the modulus / division optimization, that are yet to be identified. In addition, the highly templated and generic nature of the current reference code base could also be more maintainable as a dedicated bit vector library. Beyond the improvements we are currently working on, we feel that optimizing memory alloca- tion, query caching, branchless binary search and bit vector compression should be explored.

7.1 Optimizing memory allocation

Memory fragmentation could be reduced by managing how memory is allocated. This could potentially improve data locality as well as data alignment for faster loads and stores. Research has been done on efficient memory allocation of dynamic data structures to minimize memory fragmentation (Klitzke and Nicholson, 2016). Practical use of manual memory management would lead to less memory fragmentation, thus less overhead due to fragmentation. However, if attempts are made to reduce performance overhead due to reallocation, an optimized allocation scheme may lead to an overall higher memory overhead (if node and leaf allocations are made large enough to avoid repeated reallocation). The practical negative effect on space usage of the data structure, may push the space used per stored bit to the realm of ≈ 2 or more, in comparison to the claimed 1.2 bits for bit stored of the DYNAMIC library. In any case, optimizing memory allocation will likely lead to improvements in data structure space requirements or operation timings. 36 7.2 Query caching

Some access patterns generate the same access queries multiple times. Caching one or more queries at an internal node or leaf level would speed up execution for these cases at a penalty in space efficiency. Cache checking would also cause overhead where cache misses occur. Whether these negatives outweigh the speedup of repeat query performance should be evaluated.

7.3 Branchless binary search

An attempt at speeding up branch selection with traditional binary search would likely lead to reduced performance due to code path branch mis-predictions. By eliminating code path branching from the binary searches (Sanders and Winkel, 2004), binary searching for branch selection may be faster than the linear scanning approach in the DYNAMIC library. This could lead to a significant performance improvement as the branching factor of the internal nodes could be increased without significant penalty to branch selection, leading to a more shallow tree structure with better memory locality. This would incur some added cost for modification operations as a higher number of internal cumulative sums would need to be updated.

7.4 Bit vector compression

For some inputs, data compression of the leaf bit vectors may increase performance. If properly implemented, a scheme where elements are either compressed or succinctly stored on a per- leaf basis, could, in some cases, significantly reduce the memory footprint of the data structure. Intuitively, the blocking approach presented in Karkk¨ ainen¨ et al., 2014 should be straightforward to implement on the leaf blocks. The overhead of encoding conversion and checks may be prohibitively expensive for all but the most space constrained applications. Bibliography

Aho, A. V. and Hopcroft, J. E. (1974). The Design and Analysis of Computer Algorithms. 1st. USA: Addison-Wesley Longman Publishing Co., Inc. isbn: 0201000296. Alipanahi, B., Kuhnle, A., Puglisi, S. J., Salmela, L., and Boucher, C. (May 2020). “Succinct Dynamic de Bruijn Graphs”. In: Bioinformatics. btaa546. issn: 1367-4803. doi: 10.1093/ bioinformatics/btaa546. eprint: https://academic.oup.com/bioinformatics/ advance - article - pdf / doi / 10 . 1093 / bioinformatics / btaa546 / 33317620 / btaa546.pdf. url: https://doi.org/10.1093/bioinformatics/btaa546. Awad, M. A., Ashkiani, S., Johnson, R., Farach-Colton, M., and Owens, J. . (2019). “Engi- neering a High-Performance GPU B-Tree”. In: Proceedings of the 24th Symposium on Prin- ciples and Practice of Parallel Programming. PPoPP ’19. Washington, District of Columbia: Association for Computing Machinery, pp. 145–157. isbn: 9781450362252. doi: 10.1145/ 3293883.3295706. url: https://doi.org/10.1145/3293883.3295706. Bayer, R. and McCreight, E. (1970). “Organization and Maintenance of Large Ordered Indices”. In: Proceedings of the 1970 ACM SIGFIDET (Now SIGMOD) Workshop on Data Description, Access and Control. SIGFIDET ’70. Houston, Texas: Association for Computing Machinery, pp. 107–141. isbn: 9781450379410. doi: 10.1145/1734663.1734671. url: https://doi. org/10.1145/1734663.1734671. Bender, M., Farach-Colton, M., Jannen, W., Johnson, R., Kuszmaul, B., Porter, D., Yuan, J., and Zhan, Y. (2015). “An Introduction to Bε-trees and Write-Optimization”. In: login Usenix Mag. 40. Clark, D. (1997). Compact PAT trees. url: http://hdl.handle.net/10012/64. Comer, D. (June 1979). “Ubiquitous B-Tree”. In: ACM Comput. Surv. 11.2, pp. 121–137. issn: 0360-0300. doi: 10.1145/356770.356776. url: https://doi.org/10.1145/356770. 356776. Cordova, J. and Navarro, G. (2016). “Practical Dynamic Entropy-Compressed Bitvectors with Applications”. In: Experimental Algorithms. Ed. by A. V. Goldberg and A. S. Kulikov. Cham: Springer International Publishing, pp. 105–117. isbn: 978-3-319-38851-9. Cormen, T. H., Leiserson, C. E., Rivest, R. L., and Stein, C. (2009). Introduction to Algorithms, Third Edition. 3rd. The MIT Press. isbn: 0262033844. 38

Ferragina, P. and Venturini, R. (Aug. 2016). “Compressed Cache-Oblivious String B-Tree”. In: ACM Trans. Algorithms 12.4, 52:1–52:17. issn: 1549-6325. doi: 10.1145/2903141. url: http://doi.acm.org/10.1145/2903141. Fog, A. (Oct. 2020). Instruction tables: Lists of instruction latencies, throughputs and micro- operation breakdowns for Intel, AMD, and VIA CPUs. Tech. rep. Technical University of Denmark. url: https://www.agner.org/optimize/. Gog, S., Beller, T., Moffat, A., and Petri, M. (2014). “From Theory to Practice: Plug and Play with Succinct Data Structures”. In: 13th International Symposium on Experimental Algo- rithms, (SEA 2014), pp. 326–337. Gog, S., Karkk¨ ainen,¨ J., Kempa, D., Petri, M., and Puglisi, S. J. (Apr. 2019). “Fixed Block Com- pression Boosting in FM-Indexes: Theory and Practice”. In: Algorithmica 81.4, pp. 1370– 1391. issn: 1432-0541. doi: 10.1007/s00453-018-0475-9. url: https://doi.org/10. 1007/s00453-018-0475-9. Gonzalez,´ R., Grabowski, S., Makinen,¨ V., and Navarro, G. (2005). “Practical Implementation of Rank and Select Queries”. English. In: Poster Proceedings of 4th International Workshop on Efficient and Experimental Algorithms. Volume: Proceeding volume: ; 4th International Workshop on Efficient and Experimental Algorithms ; Conference date: 10-05-2005 Through 13-05-2005. Karkk¨ ainen,¨ J., Kempa, D., and Puglisi, S. J. (2014). “Hybrid Compression of Bitvectors for the FM-Index”. In: 2014 Data Compression Conference, pp. 302–311. doi: 10.1109/DCC. 2014.87. Klitzke, P. and Nicholson, P. K. (2016). “A General Framework for Dynamic Succinct and Com- pressed Data Structures”. In: 2016 Proceedings of the Meeting on Algorithm Engineering and Experiments (ALENEX), pp. 160–173. doi: 10.1137/1.9781611974317.14. eprint: https://epubs.siam.org/doi/pdf/10.1137/1.9781611974317.14. url: https: //epubs.siam.org/doi/abs/10.1137/1.9781611974317.14. Muggli, M. D., Bowe, A., Noyes, N. R., Morley, P. S., Belk, K. E., Raymond, R., Gagie, T., Puglisi, S. J., and Boucher, C. (Feb. 2017). “Succinct colored de Bruijn graphs”. In: Bioinfor- matics 33.20, pp. 3181–3187. issn: 1367-4803. doi: 10.1093/bioinformatics/btx067. eprint: https://academic.oup.com/bioinformatics/article-pdf/33/20/3181/ 25165624/btx067.pdf. url: https://doi.org/10.1093/bioinformatics/btx067. Muła, W., Kurz, N., and Lemire, D. (May 2017). “Faster Population Counts Using AVX2 In- structions”. In: The Computer Journal 61.1, pp. 111–120. issn: 0010-4620. doi: 10.1093/ comjnl/bxx046. eprint: https://academic.oup.com/comjnl/article-pdf/61/1/ 111/23571969/bxx046.pdf. url: https://doi.org/10.1093/comjnl/bxx046. 39

Navarro, G. (2016). Compact Data Structures: A Practical Approach. 1st. USA: Cambridge University Press. isbn: 1107152380. Prezza, N. (2017). “A Framework of Dynamic Data Structures for String Processing”. In: Inter- national Symposium on Experimental Algorithms. Leibniz International Proceedings in Infor- matics (LIPIcs). Sanders, P. and Winkel, S. (2004). “Super Scalar Sample Sort”. In: Algorithms – ESA 2004. Ed. by S. Albers and T. Radzik. Berlin, Heidelberg: Springer Berlin Heidelberg, pp. 784–796. isbn: 978-3-540-30140-0. Sedgewick, R. and Flajolet, P. (1996). An Introduction to the Analysis of Algorithms. Boston, MA, USA: Addison-Wesley Longman Publishing Co., Inc. isbn: 0-201-40009-X. Siren,´ J., Garrison, E., Novak, A. M., Paten, B., and Durbin, R. (July 2019). “Haplotype-aware graph indexes”. In: Bioinformatics 36.2, pp. 400–407. issn: 1367-4803. doi: 10 . 1093 / bioinformatics/btz575. eprint: https://academic.oup.com/bioinformatics/ article-pdf/36/2/400/31962776/btz575.pdf. url: https://doi.org/10.1093/ bioinformatics/btz575. Vigna, S. (2008). “Broadword Implementation of Rank/Select Queries”. In: Experimental Algo- rithms. Ed. by C. C. McGeoch. Berlin, Heidelberg: Springer Berlin Heidelberg, pp. 154–168. isbn: 978-3-540-68552-4. Zhou, D., Andersen, D. G., and Kaminsky, M. (2013). “Space-Efficient, High-Performance Rank and Select Structures on Uncompressed Bit Sequences”. In: Experimental Algorithms. Ed. by V. Bonifaci, C. Demetrescu, and A. Marchetti-Spaccamela. Berlin, Heidelberg: Springer Berlin Heidelberg, pp. 151–163. isbn: 978-3-642-38527-8.

Appendix A Experiment results expanded

Simple scaling comparison for all test systems of the DYNAMIC library solution, our leaf imple- mentation with buffer size zero and our leaf implementation with buffer size 8. ii Appendix A

Simple scaling comparison for non-modifying operations. Appendix A iii

Insert and Rank operation scaling comparison between buffered and unbuffered implementation on the Intel i7-4790 test system.

Insert and Rank operation scaling comparison between buffered and unbuffered implementation on the AMD Ryzen 5 2600X test system. iv Appendix A

Insert and Rank operation scaling comparison between buffered and unbuffered implementation on Intel the i5-6200U test system.

Insert and Rank operation scaling comparison between buffered and unbuffered implementation on the Intel E5-2680 test system. Appendix A v

Mixture tests run on the Intel i7-4790, comparing mean operation time as a function of inser- tion (vs. Rank) probability between DYNAMIC and our buffered implementation (left), and our buffered implementation and unbuffered implementation (right) on 106 element bit vector.

Mixture tests run on the Intel i5-6200U, comparing mean operation time as a function of in- sertion (vs. Rank) probability between DYNAMIC and or buffered implementation (left), and our buffered implementation and unbuffered implementation (right) on 106 element bit vector. vi Appendix A

Mixture tests run on the Intel E5-2680, comparing mean operation time as a function of inser- tion (vs. Rank) probability between DYNAMIC and our buffered implementation (left), and our buffered implementation and unbuffered implementation (right) on 106 element bit vector.

Mixture tests run on the Intel AMD Ryzen 5 2600X, comparing mean operation time as a func- tion of insertion (vs. Rank) probability between DYNAMIC and our buffered implementation (left), and our buffered implementation and unbuffered implementation (right) on 106 element bit vector. Appendix A vii

Mixture tests run on the AMD Ryzen 5 2600X, comparing mean operation time as a function of insertion (vs. Rank) probability between DYNAMIC and our buffered implementation (left), and our buffered implementation and unbuffered implementation (right) on 107 element bit vector. viii Appendix A

Mixture tests run on the Intel E5-2680 with 1010 element bit vectors, showing mean operation time as a function of insertion (vs. Rank) probability between our buffered and unbuffered implementations. Appendix A ix

AVX experiment comparing mean Rank query times over 100 runs on Intel i7-5790.

AVX experiment comparing mean Rank query times over 100 runs on Intel i5-6200U.

AVX experiment comparing mean Rank query times over 100 runs on Intel Xeon E5-2680.

AVX experiment comparing mean Rank query times over 100 runs on AMD Ryzen 5 2600X. x Appendix A

AVX experiment comparing mean Select query times over 100 runs on Intel i7-5790. Appendix A xi

AVX experiment comparing mean Select query times over 100 runs on Intel i5-6200U.

AVX experiment comparing mean Select query times over 100 runs on Intel Xeon E5-2680.

AVX experiment comparing mean Select query times over 100 runs on AMD Ryzen 5 2600X. xii Appendix A

Results for cw-bwt benchmark run times in seconds. System leaf type E.coli SOURCES PROTEINS DNA XML i5-6200U DYNAMIC 13.19 2137.65 11527.35 1732.14 2827.79 buffered 3.72 798.11 5741.48 839.67 1186.25 i7-4790 DYNAMIC 9.88 1529.14 9116.89 1398.38 2014.80 buffered 2.81 555.17 3959.26 686.58 764.55 Xeon E5-2680 DYNAMIC 4.36 715.07 5287.80 845.05 997.70 buffered 4.08 944.33 3867.85 1125.02 905.74 Ryzen 5 2600X DYNAMIC 9.04 1437.42 8536.90 1249.29 1940.72 buffered 2.87 583.74 4427.84 660.36 849.84 Results for cw-bwt benchmark maximum resident set sizes in kB. System leaf type E.coli SOURCES PROTEINS DNA XML i5-6200U DYNAMIC 5788 133976 868408 185036 157960 buffered 6196 138008 893140 189200 162352 i7-4790 DYNAMIC 5416 134324 860192 183360 158000 buffered 5348 138064 886028 187780 162820 Xeon E5-2680 DYNAMIC 3780 133796 867200 183252 157896 buffered 3796 133800 866964 183180 157892 Ryzen 5 2600X DYNAMIC 6180 134504 868860 185268 158220 buffered 6372 135748 876352 186564 159676