A Novel GPU Algorithm for Indexing Columnar with Column Imprints

A THESIS SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL OF THE UNIVERSITY OF MINNESOTA BY

Manaswi Mannem

IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE

Eleazar Leal

July 2020 c Manaswi Mannem 2020 Acknowledgements

I would like to thank my advisor, Dr. Eleazar Leal, who gave me the opportunity to work on this research. He has guided and supported me throughout my degree program, including this research. I have really learnt a lot from him. Additionally, I would like to thank Dr. Arshia Khan and Dr. Desineni Subbaram Naidu for being on my defense committee and for taking the time to go over this research work with me. Finally, many thanks to all the Masters of Computer Science students from the graduating classes of 2018, 2019 and 2020 for being an integral part of my growth here at UMD.

i Dedication

I dedicate this thesis to my parents, Syam Prasanna and Kamala Mannem, and my brother, Manasseh for being the most remarkable role models to me throughout my life. Their constant support and guidance has always pushed me to pursue my dreams and to excel at them. I would also like to dedicate this thesis to the incredible faculty of the Department of Computer Science at UMD. Working and interacting with each one of them as a student, a researcher and a teaching assistant has been the most rewarding learning experience of my graduate school tenure.

ii Abstract

Columnar management systems (CDBMS) are specialized database sys- tems that store data in column-major order, i.e. where all the values of each attribute are stored consecutively in storage, as opposed to row-major order, which stores all the values of each row consecutively and is the most commonly strategy used by systems such as Oracle, SQLServer, PostgreSQL, etc. The column-major order approach makes CDBMS more appropriate for data warehouses because query workloads for the latter systems usually involve retrieving all the values of a small subset of attributes. Just like in relational database management systems, query response time per- formance in CDBMS can greatly benefit from the existence of specially designed data structures, called indexes, that can help avoid exhaustively searching the entire database. However, existing indexing techniques for CDBMS like bitmaps, zonemaps etc. have been shown to result in large storage overhead and memory traffic between storage and CPU. Column imprints are an indexing approach for CDBMS that deals with these two issues by compressing the data in storage and by storing data in such a way that it reduces expensive data transfers between Random Access Mem- ory (RAM) and Central Processing Unit (CPU) caches. However, and decompression, which are necessary for query processing in CDBMS imposes a significant computational burden. To deal with this problem, parallel architectures, such as Graphical Processing Units (GPU) can be used. GPUs are co-processors that have been shown to outperform CPUs for highly parallel tasks. Besides this, GPUs are highly energy efficient, affordable, and available in many types of machines, from mobile devices to supercomputers. Despite their advantages, no work has focused on the use of GPUs for column imprints indexing.

iii To address the gap mentioned before, this thesis introduces the first GPU algo- rithm, named GPUImprints, for indexing columnar databases with column imprints. We also performed the first experimental study, using one real-world dataset and one synthetic dataset, on the use of GPUs versus CPUs for column imprints in terms of their index creation time, query response times and index space requirements. Our experiments showed that GPUImprints can speedup the column imprints construc- tion time by a factor of 20X and can speedup the query processing times of range queries by a factor of 3.7X when compared to the CPU imprints algorithm.

iv Contents

1 Introduction1 1.1 Motivation...... 3 1.2 Contributions...... 5 1.3 Outline...... 6

2 Background7 2.1 Columnar Databases...... 7 2.1.1 Overview...... 7 2.1.2 Indexing Techniques...... 9 2.1.3 Applications...... 15 2.2 GPUs...... 16 2.2.1 Overview...... 16 2.2.2 CUDA...... 17 2.2.3 Applications...... 20 2.3 Why use GPUs for Columnar Databases?...... 20

3 Proposed Algorithm 22 3.1 Overview...... 22 3.2 Stage 1: Creation of Imprint Index...... 22

v 3.3 Stage 2: Imprints Compression...... 27 3.4 Stage 3: Query Evaluation...... 29

4 Experimental Results 32 4.1 Challenges of Designing GPU Algorithm for Column Imprints.... 32 4.2 Hardware...... 33 4.3 Datasets and Query Workload...... 33 4.4 Parameters...... 34 4.5 Analysis of GPUImprint Index...... 35 4.6 Performance Analysis of Range Queries...... 39

5 Conclusions and Future Work 43 5.1 Conclusions...... 43 5.2 Future Work...... 44

References 45

vi List of Tables

2.1 Example of bitmap indexing...... 11

vii List of Figures

1.1 Examples of OLAP and OLTP...... 1 1.2 Example of row and columnar databases...... 2

2.1 Example of database cracking...... 9 2.2 Example of -aligned bitmap code (BBC) compression...... 12 2.3 Example of WAH compression...... 13 2.4 Example of zonemaps...... 15 2.5 APIs of CUDA...... 18 2.6 CUDA stack...... 19

3.1 Example of GPU column imprints index...... 29

4.1 Comparison between imprints creation time of CPU and GPU for 2- byte integer data type (TPC-H)...... 35 4.2 Comparison between imprints creation time of CPU and GPU for 4- byte data type (Skyserver)...... 36 4.3 Comparison between imprints creation time of CPU and GPU for 8- byte data type (Skyserver)...... 37 4.4 GPU imprints index creation time per 10,000 values for different value type with varying widths of GPU imprint vector...... 38

viii 4.5 GPU imprints index creation time per 1000 values for different value type with varying cache line size...... 39 4.6 Comparison of imprints index size for different column sizes for CPU and GPU...... 40 4.7 Comparison of query response times for CPU and GPU...... 41 4.8 Query response time for different width GPU imprint vectors with varying cache line size...... 42

ix 1 Introduction

Database management systems are used to record daily activities. They store data in tables either in row-wise (row databases) or column-wise (columnar databases) fashion. Most of the systems like order entry, retail sales, and financial transaction systems use databases managed by Online Transaction Processing Systems (OLTP) to record their daily updates [33]. However, for decision support and strategic planning the organizations also need historical data which is aggregated over many years [31]. The systems that store historical data are referred to data warehouses, also called Online Analytical Processing Systems (OLAP). Some of the examples of OLAP and OLTP systems are shown in Figure 1.1. In a row-oriented database, the data is stored row by row, that is after storing the

Figure 1.1: Examples of OLAP and OLTP

1 entire first row, the second row is stored. Similarly, in columnar databases, data is stored column-wise, that is all the data associated with column is stored adjacently. In Figure 1.2, for row-oriented database, data is organized by row, keeping all of the data associated with a row next to each other in memory. Similarly in column- oriented databases, data is organized by column, keeping all the data associated with a column next to each other in memory. Some examples of columnar databases are MonetDB [20], MariaDB [17], CrateDB [9], Apache Hbase [3] etc. MonetDB is a main-memory database management system that stores data in columns. The archi- tecture of MonetDB has three layers: the front-end layer(top), back-end layer(middle) and the database kernel(bottom) [20]. The two indexing mechanisms used in Mon- etDB are Database Cracking [14] and Column Imprints [27].

Figure 1.2: Example of row and columnar databases

2 1.1 Motivation

Column-oriented databases can scan the entire database faster than row-oriented databases [1]. This is because in column-oriented databases, scan operation is per- formed by traversing an array with the use of loops. This traversing can be en- hanced by exploiting indexing properly. According to [1], column-oriented database indexes are faster compared to traditional row-oriented database indexes and used as a standard for storing and querying large data warehouses. The main feature for the columnar architecture is that compression [35] can be efficiently performed. As most of the values in the columns are repetitive and have the same data type, com- pression algorithm can achieve higher compression ratio as they compress similar and consecutive values into one. An index is a data structure which helps to speed up query search. One example of index is an online catalog at a library. If a person is searching for a book Database Concepts, then he/she can search it either by using the name of some chapter in the book or by its author or by using the title of the book. A primarily exists to speed up the search process to one or more rows of information stored in a database. Indexes cannot be created on all attributes of the table in a database because they consume disk space and system resources. If the title of a book is updated, then the index created on title must also be updated. This results in additional consumption of storage, I/O, CPU, and memory resources. In a similar way if indexes are created on every column in the library database, then it results in excessive consumption of disk usage and system resources. Row-oriented database indexes perform best on queries involving single row op- erations, when searching for a particular value, or for queries on a small range of values, i.e with transactional workloads because they tend to update on a daily basis. Column-oriented database indexes give maximum performance for analytic queries

3 that scan large amounts of data, especially on large tables. These indexes can be used on data warehousing and analytic workloads, because most of the values in their columns are repetitive and have same data type. There are many reasons for using column-oriented database indexes over row-oriented database index. Some of them are:

• A column-oriented database index can provide a very high level of data com- pression, typically by 10 times [8], to significantly reduce your storage cost. This is because, all the values are from same domain and have similar values. Therefore, memory traffic and consumption I/O is minimized.

• By minimizing memory traffic, high compression rates improve query perfor- mance [8].

• It reduces total I/O consumption as queries often select only a few columns from a table.

Some of the column database index types which can be created in database man- agement systems such as Bitmap [4], zonemaps etc. Bitmap indexes [4] are best suited for data warehouse or decision support systems [15]. These indexes are generally used on low-cardinality columns and are efficient when there are many queries that join or filter on indexed columns. They key advantages with bitmap indexes are that they can be created very quickly and generally take up much less space. Zonemaps store light-weight metadata on a per page basis, e.g., min/max. However, existing indexing techniques for columnar database management systems like bitmaps, zonemaps etc. [27] have been shown to result in large storage overhead and memory traffic between storage and CPU. Column imprints are an indexing approach that deals with these two issues by compressing the data in storage and by

4 storing data in such a way that it reduces expensive data transfers between Random Access Memory (RAM) and Central Processing Unit (CPU) caches. However, data compression and decompression, which are necessary for query processing in columnar databases imposes a significant computational burden. To deal with this problem, parallel architectures, such as Graphical Processing Units (GPU) can be used. To address the gap mentioned before, this thesis introduces the first GPU algorithm, named GPUImprints, for indexing columnar databases with column imprints.

1.2 Contributions

The contributions of this thesis are the following:

• We introduce a new indexing algorithm called GPUImprints to improve query response times for range queries and decrease the storage overhead of indexes compared to CPU imprints.

• We study the effect of GPU index creation times on GPUImprints with varying data type values and cache line sizes. We also performed the first experimental study, using one real-world dataset and one synthetic dataset, on the use of GPUs versus CPUs for column imprints in terms of their index creation time, query response times and index space requirements.

• Our experiments showed that GPUImprints can speedup the column imprints construction time by a factor of 20X and can speedup the query processing times of range queries by a factor of 3.7X when compared to the CPU imprints algorithm.

5 1.3 Outline

The rest of this thesis is organized as follows: Chapter 2 discusses the background and related work of the proposed technique. Chapter 3 discusses the implementation details and algorithms of three stages in GPUImprints: Creation of GPU Imprint Index, Compression and Evaluation of Range Queries. Chapter 4 discusses results from the experimental evaluation of proposed algorithm with CPU column imprints in terms of their index creation time, query response times and index space requirements. Chapter 5 presents the conclusions and future work of this thesis.

6 2 Background

This chapter discusses related work of this thesis. Section 2.1 includes overview, applications and indexing techniques of columnar databases. Section 2.2 describes GPUs overview, its programming model and applications. Section 2.3 explains why GPUs are necessary for Columnar Databases.

2.1 Columnar Databases

2.1.1 Overview

In columnar databases, the data for columns is grouped together on disk. The main difference between columnar and row-oriented storage is that, in a columnar database, data is stored column by column, that is all the data associated with column is stored adjacently, while in the row-oriented model, the data is stored row by row, that is after storing the entire first row, the second row is stored. Let us consider an example, that is retrieving the sum of salaries from the row store must scan five blocks, while in column store only one column is to be accessed. The main feature for the columnar architecture is that compression [35] can be efficiently performed. As most of the values in the columns are repetitive and have similar data type, compression algorithm can achieve higher compression ratio as they compress similar and consecutive values into one [35].

• Columnar Write Penalty

7 The main limitation of columnar databases is that it exhibits low performance in case of single row operations. If a query requests the balance of a single customer, then row-oriented databases need to access only one row, where as columnar databases have to access entire database. Therefore, for OLTP oper- ations, columnar databases are considered as poor choice as it involves unnec- essary writes [13].

Some of the examples of Columnar Databases are MonetDB [20], MariaDB [17], CrateDB [9], Apache Hbase [3] etc.

MonetDB MonetDB developed column-store solutions for high-performance data warehouses for business intelligence and e-Science since 1993 [20]. MonetDB has high performance compared to the traditional database systems because of its storage model (column- store), a modern CPU-tuned query execution architecture, adaptive indexing, run- time query optimization. In MonetDB, the data is held in memory which does not require additional buffer maintenance, providing scope for the maximum efficiency in executing queries. All the processing associated with data is partitioned together. MonetDB uses Decomposed Storage Model (DSM) [20], wall in monetDB i.e each column is stored in a separate Binary Association Table (BAT). It uses two simple memory arrays and internally columns are stored using memory-mapped files[20][14]. MonetDB is a column store database kernel where the complete database can be held in main memory.

8 Figure 2.1: Example of database cracking

2.1.2 Indexing Techniques

Database Cracking

Database cracking is an incremental partial indexing and/or sorting of the data. Database cracking is a mechanism which aims at enhancing query processing by adapting the way data is stored to the query workload. This adaptation happens through physical reorganization of data. Each query reorganizes the data in such a way that future queries can be answered faster. In this way query processing improves as more queries are answered by adapting to the query and data workload. However,the initial query response time is larger in cracking mechanism compared to other indexing techniques. With database cracking, the columns of a database are progressively partitioned into pieces (cracking indexes) based on user queries. This partitioning of columns could help improve the query response time of future queries because latter queries need to look only certain portions of these columns, instead of looking on the whole database [25].

9 Bitmap

Bitmap indexing is mostly used in scenarios where the data required by the appli- cations is not modified frequently. The major applications of bitmaps are applications matching of texts, calculating similarity, visualization and indexing of scientific data. A bitmap-based index is a collection of vectors, where each bit vector is as- sociated with some functionality. The bit vector will have a 1 in its position if that functionality is satisfied, otherwise it it will be 0. Every value in the column of database represents a bit vector in bit-map indexing. The size of the bitmap indexes depends on the number of distinct values in the column. If the number of distinct values in the column are less than that of the number of used for representing original data, then the size of bitmap indexes will be smaller than the original data. For example, if the number of values in a column is 8 and integers are stored using 32 bits, then the bitmap index for such an attribute, which only has 8 bit vectors, is 4 times smaller than the original data. Many types of bitmap-based indexes have been proposed, such as interval encoded bitmaps [7], encoded bitmaps [38], range encoded bitmaps [23][34], to minimize the count of bit vectors that are required to go through to answer a query. For applications with large number of columns, binning [6][16] was proposed. To improve the performance of the bitmap indexes, the two compres- sion techniques used are Byte-aligned Bitmap Code (BBC) [2], and Word Aligned Hybrid (WAH) [36] compression method. By using these compression techniques, the efficiency of bitmap indexing for large size attributes has increased [37]. Binary matrices are similar to bitmap table. Each row in bitmap table represents one row value in the database. The columns in the bitmap table represents the bins, where each bin encloses a set of values. A bit vector is encoded for every value in the database based on which bin the value falls into. The encoding of a value is done

10 ID hascar Bitmap yes no 1 yes 1 0 2 no 0 1 3 yes 1 0

Table 2.1: Example of bitmap indexing. based on the property associated with that row.

• Bitmap Encoding For creation of bitmap index, if a value falls into a bin, this bin is marked “1”, otherwise “0”. Since a value can only fall into a single bin, “1” can exist for each row of each attribute. After binning, the whole database is converted into a large 0–1 bitmap, where rows correspond to rows of the database and columns correspond to bins. Consider an example as shown in the table: the ID refers to the unique number assigned to each student, hascar is the data to be indexed, the content of the bitmap index is shown as two columns under the heading bitmaps. Each column in the left illustration is a bitmap in the bitmap index. In this case, there are two such bitmaps, one for ”has car” Yes and one for ”has car” No. It is easy to see that each bit in bitmap yes shows whether a particular row refers to a person who has car. This is the simplest form of bitmap index. Most columns will have more distinct values.

• Query Execution

Evaluation of point and range queries using bitmap indexes can exhibit high performance as most of the operations involved are logical bit-wise operations. Range queries are evaluated by performing logical OR followed by AND oper- ation to all the bit vectors of the values specified in the given range. If the

11 range of the values queried are more than half of the values in the database, then we perform logical operations on the values which does not fall into the range and take its complement. Their performance can be further improved by using multi-resolution bitmap indexes [28]. Evaluation of point queries using bitmaps is straight forward procedure, i.e we have to initially create bit vectors for the attributes involved in the query and then perform logical AND on the two bitmaps together.

• Bitmap Compression The main limitation of bitmap indexing is that it consumes a lot of space. To

Figure 2.2: Example of byte-aligned bitmap code (BBC) compression

overcome this limitation, we can use bitmap index compression technique by evaluating queries over the index which is compressed. The two most popular run-length compression techniques are the Byte-aligned Bitmap Code (BBC)[2] and the Word-Aligned Hybrid (WAH) compression method [36]. BBC stores the compressed data in while WAH stores it in words. Both techniques represents two successive bits of called a fill or a gap by their bit value and their length. Initially BBC first breaks down the sequence of bits into bytes and then combine bytes into runs. There are four types of runs in BBC. In all types of runs, literal bytes are allocated after the fill bytes. As BBC is represented in

12 bytes, a run always represents the fill bit in terms of bytes. The fill length is an integer mostly as multiple of bytes. Figure 2.2 represents Type 1 run. In type 1 run, header is of one byte, 0-3 bytes are the fill bytes and 8-15 are the literal bytes. This ensures that during any bit wise logical operation a tail byte is never broken into individual bits.

Figure 2.3: Example of WAH compression

Compressed data is stored in words in WAH unlike BBC. The literal word(0) and a fill word(1) are distinguished by the most significant bit of a word. Lower bits of a literal word contain the bit values from the bit sequence. The second most significant bit of a fill word is the fill bit and the lower (w-2) bits store the fill length. Example of WAH compression is shown in Figure 2.3. In figure 2.3, we assume 32 bit words where each literal word stores 31 bits from the bitmap and each fill represents the multiple of 31 bits. First line in the figure 2.3 rep- resents the original bit vector, second line represents how bit vector is divided into 31-bit groups and third line represents the hexadecimal representation of the groups. In general, bit operations over the compressed WAH bitmap file are faster than BBC while BBC gives better compression ratio.

Data manipulation operations in bitmap indexing is a costly operation. In order to update the table, we have to drop the index first, perform update operation and create the index again. However to insert a row only the row to be inserted

13 is to be accessed and added. To improve update operation performance, a identifier is associated with compressed bitmap to encode all the values of the table using the bits which are not set [5].

A bitmap index which uses binning reduces the number of bitmaps used in the index as one bin encloses range of values. For example, column “month” represented by integers between 1 and 12 might be divided into 4 bins uniformly with bin 1 for “month” between 1 and 3, bin 2 for “month” between 4 and 6, and so on. The query “month between 4 and 8” will have bins 2 and 3 . In traditional bitmap indexing 12 bit vectors are necessary if all the values are distinct, but by using bitmap indexing with binning the number of bins are reduced to 4. Therefore, bitmap indexing with binning is more efficient when there are larger range of values.

Zonemaps

Zonemaps [27] is another choice for indexing secondary attributes. A zonemap index is created by dividing the column to be queried by a fixed value. This results in the creation of zones and then the minimum and the maximum values are calculated for each zone. All the zones in the column will have equal number of values. Figure 2.4 represents example of zonemaps. In figure 2.4, the column is divided into 3 zones where each zone has exactly 3 elements. The minimum and maximum of each zone is calculated. To evaluate a query using zonemaps, the minimum and maximum values of each zone are compared with the predicates of the query. If the ranges of the predicates are similar to that of the range of a zone, then the zone is retrieved. This is the basic technique of zone maps specified in [27].

14 Figure 2.4: Example of zonemaps

2.1.3 Applications

Columnar databases are suitable for applications that require decision making and strategic planning as it allows for fast retrieval of columns of data. Generally analytical applications prefer columnar databases. Columnar databases are more suitable for query processing in data warehouse as most of the data have similar values and those operations require only one or two attributes. Columnar databases have high efficiency in big data processing applications because it reduces the amount of data that needs to be read from disk [10]. For example, consider you want to query, increase in sales of a bike over a long period of time. In row-oriented databases we should access all the records in the database to get the required information, which would involve a lot of unnecessary disk seeks and disk reads therefore impacting performance where as in columnar database accessing only two columns would be

15 required. All the applications that require high performance in data mining, OLAP, scientific databases, XML Query, text and multimedia retrieval are using MonetDB, a column oriented database [14].

2.2 GPUs

2.2.1 Overview

A A (GPU) is not a standalone platform, but a electronic circuit that assists CPU in performing specialized operations. GPUs are co-processors that have been shown to outperform CPUs for highly parallel tasks. Besides this, GPUs are highly energy efficient, affordable, and available in many types of ma- chines, from mobile devices to supercomputers. Applications that uses both CPU and GPU generally consists of two parts: host code and device code. The host code runs on CPUs and the device code runs on GPUs. These applications are imple- mented in the two stages. the first stage is that CPU accountable for managing the environment, code, and data required for the device, initializes the instructions and loads them on to the device. The second stage is that the device execute those in- structions by exhibiting data parallelism. GPUs are called the hardware accelerators, because although they are separated from CPU physically they perform tasks for the same application. The two important features that describe GPU capability are the number of CUDA cores and memory size. Accordingly, the two different metrics used to describe GPU performance are peak computational performance and memory bandwidth. GPU computing is not meant to replace CPU computing . Each approach has advantages for certain kinds of programs. CPU computing is good for applications

16 which require control, and GPU computing is good for applications which requires them to perform parallel computations [19]. CPUs can be used for applications which have dynamic workloads, example OLTP databases perform short sequence of oper- ations in day-to-day life. GPUs can be used for applications performing complex operations involving large amounts of data. There are two main dimensions that differentiate the scope of applications for CPU and GPU [19]:

• Parallelism level

• Data size

Because of its ability to manage logical operations, the best problems that CPU can solve are the problems which have control logical operations, require basic parallelism and are small in size. Similarly the problems that GPUs can exhibit its maximum performance is when the problem requires massive parallelism, multi-threading and large in size.

2.2.2 CUDA

CUDA C (Compute Unified Device Architecture) [22] is an application program- ming interface model performing parallel computations by using different program- ming languages like C, C++, Fortran. It is a model created by and is designed for applications that require both CPU and GPU. CUDA provides two API levels for managing the GPU device and organizing threads [22] are driver API and run time API, as shown in Figure 2.2. The driver API is a low-level API and is responsible for providing control over how GPU device is used. In comparison to runtime API, it is hard to program because every function in the runtime API is divided into smaller operations and are sent to

17 Figure 2.5: APIs of CUDA driver API. [22]. CUDA has both host code and device code. NVIDIA’s CUDA nvcc compiler separates the device code from the host code during the compilation process [22][11]. As shown in Figure 2.3, the host code is standard C code and is further compiled with C compilers. The device code is written using CUDA C extended with keywords used to define kernels. The device code is further compiled by nvcc [22][26].

Programming Model (CUDA)

A typical CUDA program structure consists of five main steps:

1. Allocate GPU memory.

2. Copy data from CPU memory to GPU memory.

18 Figure 2.6: CUDA stack

3. Invoke the CUDA kernel to perform program specific computation.

4. Copy data back from GPU memory to CPU memory.

5. Destroy GPU memory.

The code that runs on GPU, called the kernel is managed by CUDA by scheduling them on GPU threads. All the information required to map host and device based on application data and type of GPU is defined. When the code is executed, CPU becomes independent of the code running on device providing a scope perform addi- tional tasks.The CUDA programming model is primarily asynchronous so that GPU computation performed on the GPU can be overlapped with host-device communica- tion. A typical CUDA program consists of serial code complemented by parallel code

19 [21][22]. The serial code (as well as task parallel code) is executed on the host, while the parallel code is executed on the GPU device. The host code is written in ANSI C, and the device code is written using CUDA C. The NVIDIA C Compiler (nvcc) generates the executable code for both the host and device.

2.2.3 Applications

GPUs is used in various super computing and industrial applications. Some of them are:

• Scientific Computing [24]

– Molecular Dynamics, Genome Sequencing, Mechanical Simulation, Quan- tum Electrodynamics.

• Image Processing [18]

– Registration, interpolation, feature detection, recognition, filtering

• Data Analysis [12]

– Databases, sorting and searching, data mining

2.3 Why use GPUs for Columnar Databases?

GPUs solve complex operations in parallel with the help of thousands of process- ing cores available on a single circuit. Performing operations like sorting, grouping and aggregation can take a lot of time in CPUs, in such cases GPUs can work ef- fectively in parallel. Therefore, GPUs can exhibit maximum performance when they are processing large amounts of data, i.e data warehouses. As columnar databases

20 are more suitable for data warehouses, GPUs can be used effectively in columnar databases. NVIDIA’s CUDA API made it possible to use these GPU cards for high performance computing on standard hardware. GPUs require specific programming, and operations need to be processed differently to take maximum advantage of its threading model. Modern analytics database holds data in a columnar store, opti- mized for feeding the GPU compute capability as fast as possible. One example of Columnar GPU Database is SQream DB [30]. It is a GPU accelerated data warehouse for massive data. Companies are combining Hadoop with SQream DB, to maximize the performance of SQream DB for analytics [30].

21 3 Proposed Algorithm

In this section, we present a new GPU-based algorithm GPUImprints to speed up query processing in columnar databases. Section 3.1 describes the overview of the proposed algorithm. Section 3.2, 3.3, 3.4 explains the pseudo-code of various stages in GPUImprints.

3.1 Overview

GPUImprints consists of three stages: creation of GPU imprint index, to use compression schema on the created GPU imprint index and to evaluate range queries over the GPU imprints index. To create a GPU imprint index for a column, we sample the values of the column and create a histogram. In the second stage, similar bit vectors which are consecutive are compressed together. Then in the third stage, range queries are evaluated over GPU imprints index, which returns the ids of values of a particular range specified in the query. All the algorithms in this thesis are taken in reference to [27].

3.2 Stage 1: Creation of Imprint Index

In this stage, initially a column is transferred from CPU to GPU, to create GPU imprint vectors. For the creation of GPU imprint vectors, the values of the column are allocated in to the cacheline according their size. GPUImprints is suitable for

22 both low-cardinality columns and high-cardinality columns.In case of low-cardinality columns, for every value in the cache line a bit vector is vector. The collection of all those bit vectors forms a GPU column imprint vector. Group of these GPU column imprint vectors form GPU imprint index. In case of high-cardinality columns, we construct a histogram having similar widths for the values in every cache line. Every bin in the cache line represents a bit in the GPU imprint vector for that cache line. The number of bits in the imprint vector depends on data type of values in the col- umn, that is on the number of bins per cache line. The bins in cache line represents the range of values in that bin. If the value of the column falls into range of bin, then the bit that represents the bin in the imprint vector of cacheline is set to 1, otherwise 0. The collection of all such GPU imprint vectors forms an GPU imprint index. The structure imp i in Algorithm 1,2,3 and 4 holds all the constructs needed to maintain the GPU imprints index of one column. It consists of a pointer to the array of the cacheline dictionary(which are tall the cache lines), a pointer to the array of the imprint vectors, an array with the values that holds the bounds of the bins of the histogram, and the actual number of bins of the histogram.

Consider a column col c, this column is transferred from host memory(CPU) to device memory(GPU). The main difference from bitmap and column imprints is that instead of allocating one vector per value, imprints allocate one vector per cache- line[27]. The size of cacheline in GPU is 128 bytes and the number of values to be taken per cache line depends of the type of data column. So, no more than 4096 values per cache line are considered in the proposed algorithm. The histogram is created divides the values of columns into equal ranges(bins) according the data type of columns and size of the cache line. The bits in each vector correspond to the bins of the histogram. For this, only the bounds of each bin need to be stored in the

23 GPU imprint index structure. The creation of equi-width bins of histogram is done using the function histbin(). The two steps performed initially are: all the values per cache line are sorted and duplicates are removed from the cacheline as shown in Algorithm 1. Let us consider S to be the sampled column with unique values as shown in histbin() and assume that the data size of a value in the column is 1 byte. If the number of values in the column are less than 126(size of cache line in GPU), then the imprint can be adjusted to have as many bits as needed to map the columns with low cardinality. If the number of distinct sampled values S is 126 or more than 126,the domain is divided into ranges, where each range contains the same count of sampled values, including the count the multiple occurrences of the same value. Therefore, each bin of the histogram can contain exactly one value or more than one value in this case. The borders of the histogram are calculated based on the ranges of values per cache line. It is taken as 126, because the first bin always has values between - ∞ (i.e., the minimum value of cacheline) up until the smallest value found in the sample. Similarly, the last bin contains all values greater than the largest sampled value up to + ∞. The ranges for first bin and the last bin are allocated in such a way that any inserts to the database in future will not have effect on the way the values are distributed. The remaining bins allocated the values in them by diving the values per cache line by 126. In histbin() the result is stored in the double variable y. The values are assigned to bins in such such a way that the smaller value is included in the bin range and the largest value in the bin is included in the next bin. In order to determine the bin when a particular value is requested, we perform binary search over all the bins in the cache line. Three macros are defined for the function histbin()(used in Algorithm 2). They are :

24 Algorithm 1: Define the number of bins and the ranges of the bins of the histogram: histbin() Input: GPU imprints index structure imp i, column col c Output: number of bins imp.nb and the ranges imp.r coltype *s = sample(col,x); /* do in parallel sample of x values */ sort(s); /* sort the sample */ s sz = duplicate elimination(sample); /* remove duplicates */ if (s sz < 128 ) then /* less than 126 unique values */ for i = 0 → s sz − 1 do imp.r[i] = s[i]; /* populate r with the unique values */ end for if (i < 8 ) then imp.nb = 8 ; /* determine the number of bins */ else if (i < 16 ) then imp.nb = 16 ; else if (i < 32 ) then imp.nb = 32 ; else if (i < 64 ) then imp.nb = 64 ; else imp.nb = 128 ; end if for i = i → 127 do imp.r[i] = coltype MAX; /* default value */ end for else /* more than 126 unique values */ double x = 0 , y = s sz/126 ; for i = 0 → 126 do imp.r[i] = s[(int)x]; /* set ranges for all bins */ x+ = y; end for imp.r[126 ] = coltype MAX; end if

• middle( ): checks if the value is inside the range of the bin.

• left( ):checks if the value is smaller the left border of the bin

• right( ):checks if the value is larger than right border of the bin. It is then constructed by repeatedly dividing the search space into half, and invoking the right,middle and left macros, in that order. This function is implemented using nest-if statements instead of for loop.

A bit in each vector in the bins of histogram is set if at least one value in the cache line falls into the corresponding bin. The resulting bit vector is an imprint vector of

25 Algorithm 2: Binary search with nested if-statements to locate the bin which a value falls into: getbin() Input: GPU imprints index structure imp i, value v Output: the bin where value v falls into middle(v, p): if (v ≥ imp.r[p − 1 ] ∧ v < imp.r[p]) res = p; left(v, p): if (v < imp.r[p]) right(v, p): if (v ≥ imp.r[p − 1 ])

right(v, 64) /* divide the bins into half and start searching */ right(v, 96) /* divide the second of bins to get the bin */ right(v, 112) right(v, 120) right(v, 124) res = 124 ; right(v, 126) res = 126 ; middle(v, 122) left(v, 120) res = 120 ; middle(v, 118) left(v, 116) res = 116 ; . . middle(v, 62) left(v, 60) right(v, 32) right(v, 12) . . ...

the current cache line. The collection of all the resulting bit vectors form a unique column imprint.

26 Algorithm 3: Main function to create the column imprints index: imprints() Input: column col c of size col sz Output: GPU imprints index structure imp i for column col c

struct imp idx imp i; /* initialize the GPU column imprints index structure */ ulong imprint v = 0 ; /* the imprint vector */ binning(imp i); /* do in parallel determine the histogram’s size and bin borders */ for i = 0 → col sz − 1 do /* for all values in col */ bin = getbin(imp i, col c[i]); /* locate bin */ imprint v = imprint v | (1  bin); /* set bit */ compression(imprint i); /* perform compression on imprintsindex ∗ / end for

3.3 Stage 2: Imprints Compression

In this stage, compression of GPU imprints vectors is done by comparing consecu- tive GPU imprint vectors in GPU imprint index. If GPU imprint vector i is similar to the GPU imprint vector i+1, which is in turn similar to the GPU imprint vector i+2, then all the three GPU imprint vectors are stored into one vector and the number 3 is stored in cache line dictionary. If the i+4 imprint vector is a different imprint vector compared to i+3 then i+4 will be stored as seperate entry in imprint index structure.The significant feature of this step compared to other indexing techniques is that it is compressed by row. This stage is explained in Figure 3.1. The compression algorithm mainly involves two variable cnt and repeat as shown in Algorithm 4, cnt specifies the number of imprint bit vector that are similar and repeat holds only two values set (1) and unset (0). For example consider the imprint vector i, i+1 are similar then the repeat flag is set to 1. When the repeat flag is set and i+2 is similar then the counter gets updated to 3 indicating that there are 3 imprint bit vectors in the imprint index which are similar. This process goes through all the imprint vectors of all cache lines. After this,the cacheline dictionary is correctly updated and the GPU imprint is stored.Finally, a new GPU imprint vector

27 Algorithm 4: compression function of GPUImprints: compression()

function compression() char vpc; /* constant values per cacheline */ ulong i cnt = 0 ; /* imprints count */ ulong d cnt = 0 ; /* dictionary count */ if (i mod vpc-1 ≡ 0) then /* end of cacheline reached */ if (imp i.imprints[i cnt] ≡ imprint v ∧ /* same imprint */ imp i.cd[d cnt].cnt < max cnt − 1 ) then /* cnt not full */ if (imp i.cd[d cnt].repeat ≡ 0 ) then if (imp i.cd[d cnt].cnt 6= 1 ) then imp i.cd[d cnt].cnt − = 1 ; /* decrease count cnt */ d cnt + = 1 ; /* increase dictionary count d cnt */ imp i.cd[d cnt].cnt = 1 ; /* set count to 1 */ end if imp i.cd[d cnt].repeat = 1 ; /* turn on flag repeat */ end if imp i.cd[d cnt].cnt + = 1 ; /* increase cnt by 1 */ else /* different imprint than previous */ imp i.imprints[i cnt] = imprint v; i cnt + = 1 ; if (imp i.cd[d cnt].repeat ≡ 0 ∧ imp i.cd[d cnt].cnt < max cnt − 1 ) then imp i.cd[d cnt].cnt + = 1 ; /* increase cnt by 1 */ else d cnt + = 1 ; /* increase dictionary count d cnt */ imp i.cd[d cnt].cnt = 1 ; /* set count to 1 */ imp i.cd[d cnt].repeat = 0 ; /* set flag repeat off */ end if end if imprint v = 0; /* reset imprint for next cacheline */ end if

is created with all the bits off, the next value of the column is fetched, and the process is repeated. The compression function is explained in Algorithm 4. Because of our compression schema, some administrative overhead is created to keep the cache lines and the GPU imprint vectors aligned. Figure 3.1 summarizes creation of GPU imprint index. The required steps are:

28 Figure 3.1: Example of GPU column imprints index

1. Copy the column from host memory to device memory.

2. Create bit vectors by assigning each thread in GPU to each cache line. For larger column, we build a histogram.

3. Perform bit wise operation for all the bit vectors in the cache line to form a simple GPU imprint vector.

4. Compress the GPU imprint vectors and create GPU imprint index.

3.4 Stage 3: Query Evaluation

The code that runs on GPU, called the kernel is managed by CUDA by scheduling them on GPU threads. All the information required to map host and device based on application data and type of GPU is defined. When the code is executed, CPU becomes independent of the code running on device providing a scope perform addi-

29 tional tasks.The CUDA programming model is primarily asynchronous so that GPU ogram, and the actual number of bins of the histogram. It is not mandatory to have all 128 bins if the cardinality is small. Query evaluation is done in a similar way as described in [27] Column imprints are more suitable for range queries. Query evaluation is done on column GPU imprints index imp i. Consider evaluating range query Q¯[low,high] where low is start and high is the maximum value. the result must be the values between low and high, i.e it should satisfy low≤v≤high. The steps to evaluate range queries:

1. The first step is to set the bits in an empty bit vector q that correspond to the bins that are included in the range of query Q. This bit vector may have might have multiple bits enabled to 1(set) since the query range have the possibility to fall into multiple bins.

2. The query bit-vector is then compared against the imprint vectors present in the GPU imprint index imp i, using bit wise intersection. this comparison leads to following possible outcomes.

• If both the query vector q and the imprint vector i in the GPU imprint index have common bits, then the corresponding cache line is accessed for further processing.

• If both the vector do not have common bits, we need to examine all the values in the cache lines and eliminate false-positives.

The psuedo-code for evaluating range queries in this thesis can be referred and it is taken from [27]. To maintain the alignment of GPU imprint index structure and cache line dictionary two counters i cnt and cache cnt used. In order to differentiate

30 the minimum and maximum values of the bin from the values that fall into the range of the bin, two bit vectors mask and inner mask are produced. Innermask is a bit vector which is set when the values of the range query corresponds to the borders of the bin, where mask corresponds to the values and is set when the values are entirely in the bin excluding the border values. The algorithm used in this thesis proposed in [27], runs over all the records in the cache line dictionary and also check false positives efficiently. As the number of values per cache line increases the probability for false positives also increases, which results in increase of query response times.

31 4 Experimental Results

To study the impacts of different value types, different column sizes, and different value distributions, we used one real-world data set and one synthetic dataset gath- ered from [27]. Each experimental run is done by first copying a column into device memory (GPU), and then creating the imprints indexes.

4.1 Challenges of Designing GPU Algorithm for

Column Imprints

In this section we describe the challenges of designing a GPU algorithm for column imprints. They are the following:

• GPUImprints cannot be used for OLTP operations or for row oriented databases. It exhibits its maximum performance for range queries and those queries are related to one column. In order to execute queries involving multiple columns, the entire experiment must be performed multiple times.

• The main disadvantage of GPUImprints is that it takes more time for GPUIm- prints index construction when the columns are being updated frequently. It is because, when the column is updated, it should be transferred from CPU to GPU, thereby increasing transfer cost and the GPU imprint vector must be

32 constructed again. This will considerably increase in total creation time of GPU imprints index.

• Another limitation is that GPUImprints is only implemented for numerical data types, that is it is not implemented for categorical attributes.

• Another challenge is that increase in size of the cache line can increase the total query execution time as more bit-wise operations must be performed at the time index construction phase and there is a high probability of many false-positives as single bit in imprint vector includes many values. So, the size cache line must not be taken too high or too low.

4.2 Hardware

All experiments were performed on a machine equipped with a four-core Intel Core i5 3470 chip running at 3.2 GHz, Nvidia GPU with GDDR5 Memory, and a system RAM of 3 GB. The program was executed and benchmarked on Ubuntu 16.04 with kernel version 4.15.0. We have implemented our GPU algorithm using CUDA 10.0, Thrust and CUDA Cub 1.8.0.

4.3 Datasets and Query Workload

For our experiments, we used one real-world dataset and one synthetic dataset to evaluate about 100000 range queries with random selectivity. The two datasets used are:

• Skyserver The Skyserver [29] is a database having uniform distribution with double pre-

33 cision and floating point columns. It has 4,008 columns with real, double and long value data types. The number of rows in this database are 158,000.

• TPC-H TPC-H [32] is the benchmark published by the Transaction Processing Perfor- mance Council (TPC) for decision support. It consists of a suite of business oriented ad-hoc queries and concurrent data modifications. It comprises 61 columns having int, date and string value data types. In our experiments, we used the columns with int data types. The number of rows in this TPC-H are 148,000.

4.4 Parameters

We evaluate our algorithm GPUImprints with respect to two parameters:

• Number of bins per cache line, that is the number of bits in the imprint vector

• Number of values per cache line

The number of bins per cache line represents the number of bits in the GPUIm- print vector, that is, the width of the imprint vector. As the width of the imprint vector increases, more comparisons are required to be performed at the time of index creation. Due to this, the time taken to create GPU index will be increased. However, having large bit-imprint vectors can decrease the query response as less values are to be checked during query evaluation, that is false positive ratio is reduced. The second parameter is the number of values per GPU imprint vector. This de- pends mainly on data type of the values of the column. As the number of values per cache line increases, the GPU imprint index size decreases. However, having more

34 values per cache line can have impact on the precision of imprints.

We compare our algorithm GPUImprints with scalar column imprints in two as- pects: time taken to create the imprint index and query response times of range queries.

4.5 Analysis of GPUImprint Index

Figure 4.1: Comparison between imprints creation time of CPU and GPU for 2-byte integer data type (TPC-H)

In the stages 1 and 2 of GPUImprints, we transferred the column from the host to GPU, then created GPU Imprint vectors by creating bins based on the size of data type of column values. Then, at the end, we created a GPU imprint index by using

35 bit-wise compression. Our experiments include comparison of created GPU imprints index with CPU imprint index. Figure 4.1, 4.2, 4.3 shows imprint creation time for different data type column values for both CPU and GPU.

Figure 4.2: Comparison between imprints creation time of CPU and GPU for 4-byte data type (Skyserver)

From the figures (4.1,4.2,4.3), we can observe that on average GPU is about 20X faster than that of CPUs. This is because each thread in GPU is assigned to individual cache line. Decease in imprints creation time will ultimately lead to decrease in query response times. The creation of the GPU imprint index is much faster and is less than 1 second for a column of size of 1000MB as shown in all the figures. Figure 4.4 shows how the size of column data type influences the imprint creation time in the GPU. The maximum number of bins is taken as 128 because the maximum

36 Figure 4.3: Comparison between imprints creation time of CPU and GPU for 8-byte data type (Skyserver) number of cache line in GPU is taken as 129. It describes the time taken for creating imprints for 10,000 values in micro seconds for different data type values. From these three figures, we can observe that the time taken to create index for 2-byte data type is slightly less than 4-byte data type and 8-byte data type. This is because more number of values can be enclosed in a single cache line in 2-byte data type when compared to the other two data types. Therefore, more number of column values can be used to create an imprint vector resulting in decreasing the the index size ultimately decreasing imprints creation time. We considered a column of 1000 values and size of the cache line(input values per cache line multiplied by size of their data type) varies from 32 to 128. From Figure 4.5, we observed that the time taken for

37 Figure 4.4: GPU imprints index creation time per 10,000 values for different value type with varying widths of GPU imprint vector the creation of imprints per 1000 values is larger when the cache line size is 32 bytes and smaller when the cache line size is 128 bytes. This can be observed in figure. Although the time taken for the creation of imprints is smaller when cache line size is large, it might also lead to reduced precision, which has an adverse effect on query run time. While calculating imprints index creation time , we maintained the same compression percentage for both CPU and GPU, that is the compression percentage has a fixed upper limit, and it is always the ratio between the size of an imprint over the size of the cache line.

38 Figure 4.5: GPU imprints index creation time per 1000 values for different value type with varying cache line size

4.6 Performance Analysis of Range Queries

The size of the GPUImprints index is also very small compared to scalar imprints. This can be shown in Figure 4.6. It shows that GPUImprints index size is about 2.9 times smaller than that of scalar imprints index. This is because the size of a single cache line in GPU is almost twice as large as that of CPU, thereby cache dictionary in the GPUImprints index occupies less number of bits. Although the imprints index size is smaller, there is no guarantee that query response times would be faster as it might lead to formation of many false-positives.

39 Figure 4.6: Comparison of imprints index size for different column sizes for CPU and GPU

For each imprint configuration on each data column, we evaluate range queries with random selectivity. These queries are fired against the GPU index and CPU index and their query response are calculated. The result of the queries will be the list of ids of the result values. Figure 4.7 explains that the proposed algorithm can speedup the query processing times of range queries by a factor of 3.7X on average when compared to the CPU imprints algorithm. But GPUs take more time to execute queries than expected because as the length of GPU imprint vector is large, there is a chance in the increase in false positive ratio as more values are enclosed in a single cache line. To demonstrate this we considered 1000 imprint vectors in an imprint index. In Figure 4.8 we can observe that 1000 8-bit GPU imprint vectors are 2.5x on

40 Figure 4.7: Comparison of query response times for CPU and GPU average faster for querying when compared to 1000 128-bit GPU imprint vectors.

41 Figure 4.8: Query response time for different width GPU imprint vectors with varying cache line size

42 5 Conclusions and Future Work

In this chapter, we present our findings and conclusions for our study on columnar secondary indexing techniques, our comparison of GPU-based column imprints with CPU column imprints. We also discuss future work in these areas.

5.1 Conclusions

In Chapter 3, we proposed a novel GPU algorithm for indexing columnar databases with column imprints. The following is the summary of our discussion on proposed GPUImprints algorithm.

• We introduced GPUImprints, a secondary indexing columnar indexing database algorithm using CUDA 10.0.

• The algorithm uses a new strategy to construct GPU imprints indexes by exe- cuting hundreds of threads in parallel, thereby creating GPU imprints indexes merely in less than 1 millisecond for column size of 1000MB.

• This algorithm was compared against single core CPU imprints algorithm and proved to be 20X faster on average in GPU imprints index construction. This is because of the parallel capabilities of GPUs and the larger cache line size in GPU which is twice than that of CPU.

• This algorithm can speedup the query processing times of range queries by a

43 factor of 3.7X on average when compared to the CPU imprints algorithm as more values are enclosed in a single bit of an imprint vector.

• The GPU imprint index is about 2.9X on average smaller that the CPU imprint index because of the difference in their cache line sizes.

5.2 Future Work

• GPUImprints algorithm is implemented for range queries evaluated over a single column. This algorithm can be extended to evaluate range queries over multiple columns to enable multi-attribute query processing.

• As the proposed algorithm can be evaluated only for numerical values, the future work can be extended to evaluate categorical attributes also.

• Another direction of this thesis is that we can incorporate the proposed algo- rithm for column imprints into SQream, a GPU-based columnar database and use it for applications like data analysis.

44 References

[1] D. Abadi, P. Boncz, and S. Harizopoulos. The Design and Implementation of Modern Column-Oriented Database Systems. Hanover, MA, USA: Now Publish-

ers Inc., 2013. isbn: 1601987544 (cit. on p.3).

[2] G. Antoshenkov. “Byte-Aligned Bitmap Compression”. In: Proceedings of the Conference on Data Compression. DCC ’95. USA: IEEE Computer Society, 1995, p. 476 (cit. on pp. 10, 12).

[3] Apache Hbase. https://hbase.apache.org/. (Cit. on pp.2,8).

[4] G. Canahuate and H. Ferhatosmanoglu. “Bitmap-based Index Structures”. In: Encyclopedia of Database Systems. Ed. by L. LIU and M. T. OZSU.¨ Boston, MA:

Springer US, 2009, pp. 248–251. isbn: 978-0-387-39940-9. doi: 10.1007/978- 0-387-39940-9_1282. url: https://doi.org/10.1007/978-0-387-39940- 9_1282 (cit. on p.4).

[5] G. Canahuate, M. Gibas, and H. Ferhatosmanoglu. “Update Conscious Bitmap Indices”. In: 19th International Conference on Scientific and Statistical Database Management (SSDBM 2007) (2007), pp. 15–15 (cit. on p. 14).

[6] C.-Y. Chan and Y. E. Ioannidis. “An Efficient Bitmap Encoding Scheme for Selection Queries”. In: Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data. SIGMOD ’99. Philadelphia, Pennsylva-

45 nia, USA: Association for Computing Machinery, 1999, pp. 215–226. isbn: 1581130848. doi: 10 . 1145 / 304182 . 304201. url: https : / / doi . org / 10 . 1145/304182.304201 (cit. on p. 10).

[7] C.-Y. Chan and Y. E. Ioannidis. “Bitmap Index Design and Evaluation”. In: Proceedings of the 1998 ACM SIGMOD International Conference on Manage- ment of Data. SIGMOD ’98. Seattle, Washington, USA: Association for Com-

puting Machinery, 1998, pp. 355–366. isbn: 0897919955. doi: 10.1145/276304. 276336. url: https://doi.org/10.1145/276304.276336 (cit. on p. 10).

[8] column store Indexes. https://docs.microsoft.com/en-us/sql/relational- databases/indexes/columnstore-indexes-overview?view=-server- ver15#benefits (cit. on p.4).

[9] CrateDB. https://crate.io/. (Cit. on pp.2,8).

[10] K. Dehdouh, F. Bentayeb, O. Boussaid, and K. Nadia. “Towards an OLAP Environment for Column-Oriented Data Warehouses”. In: Sept. 2014, pp. 221–

232. isbn: 978-3-319-10159-0. doi: 10.1007/978-3-319-10160-6_20 (cit. on p. 15).

[11] A. Garcia-Garcia. “Towards a real-time 3D object recognition pipeline on mo- bile GPGPU computing platforms using low-cost RGB-D sensors”. PhD thesis. Sept. 2015 (cit. on p. 18).

[12] K. Grace and M. Shanmugam. “BIG DATA ANALYTICS METHODS USING GPU : A COMPREHENSIVE SURVEY”. In: 12 (Nov. 2016) (cit. on p. 20).

[13] G. Harrison. Next Generation Databases: NoSQL and Big Data. 1st. USA:

Apress, 2015. isbn: 1484213300 (cit. on p.8).

46 [14] S. Idreos, F. Groffen, N. Nes, S. Manegold, S. Mullender, and M. Kersten. “Mon- etDB: Two Decades of Research in Column-oriented Database Architectures”. In: IEEE Data Eng. Bull. 35 (Jan. 2012) (cit. on pp.2,8, 16).

[15] Indexes. https : / / docs . oracle . com / cd / A81042 _ 01 / DOC / server . 816 / a76994/indexes.htm. (Cit. on p.4).

[16] N. Koudas. “Space Efficient Bitmap Indexing”. In: Proceedings of the Ninth In- ternational Conference on Information and Knowledge Management. CIKM ’00. McLean, Virginia, USA: Association for Computing Machinery, 2000, pp. 194–

201. isbn: 1581133200. doi: 10.1145/354756.354819. url: https://doi. org/10.1145/354756.354819 (cit. on p. 10).

[17] MariaDB. https://mariadb.org/. (Cit. on pp.2,8).

[18] C. Marwa, H. Bahri, F. Sayadi, and M. Atri. “Image Processing Application on Graphics processors”. In: International Journal of Image Processing 8 (May 2014) (cit. on p. 20).

[19] S. Mittal and J. S. Vetter. “A Survey of CPU-GPU Heterogeneous Computing

Techniques”. In: ACM Comput. Surv. 47.4 (July 2015). issn: 0360-0300. doi: 10.1145/2788396. url: https://doi.org/10.1145/2788396 (cit. on p. 17).

[20] MonetDB. https://www.monetdb.org/. May 2019 (cit. on pp.2,8).

[21] Nvidia. https://devblogs.nvidia.com/. 2019 (cit. on p. 20).

[22] Nvidia. CUDA. https://docs.nvidia.com/cuda/. May 2019 (cit. on pp. 17, 18, 20).

[23] P. O’Neil and D. Quass. “Improved Query Performance with Variant Indexes”. In: Proceedings of the 1997 ACM SIGMOD International Conference on Man- agement of Data. SIGMOD ’97. Tucson, Arizona, USA: Association for Com-

47 puting Machinery, 1997, pp. 38–49. isbn: 0897919114. doi: 10.1145/253260. 253268. url: https://doi.org/10.1145/253260.253268 (cit. on p. 10).

[24] J. Owens, M. Houston, D. Luebke, S. Green, J. Stone, and J. Phillips. “GPU computing”. English (US). In: Proceedings of the IEEE 96.5 (May 2008), pp. 879–

899. issn: 0018-9219. doi: 10.1109/JPROC.2008.917757 (cit. on p. 20).

[25] H. Pirk, E. Petraki, S. Idreos, S. Manegold, and M. Kersten. “Database Crack- ing: Fancy Scan, Not Poor Man’s Sort!” In: Proceedings of the Tenth Interna- tional Workshop on Data Management on New Hardware. DaMoN ’14. Snow-

bird, Utah: Association for Computing Machinery, 2014. isbn: 9781450329712. doi: 10.1145/2619228.2619232. url: https://doi.org/10.1145/2619228. 2619232 (cit. on p.9).

[26] Professional CUDA C Programming. 1st. GBR: Wrox Press Ltd., 2014. isbn: 1118739329 (cit. on p. 18).

[27] L. Sidirourgos and M. Kersten. “Column Imprints: A Secondary Index Struc- ture”. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data. SIGMOD ’13. New York, New York, USA: Association

for Computing Machinery, 2013, pp. 893–904. isbn: 9781450320375. doi: 10. 1145/2463676.2465306. url: https://doi.org/10.1145/2463676.2465306 (cit. on pp.2,4, 14, 22, 23, 30–32).

[28] R. R. Sinha and M. Winslett. “Multi-Resolution Bitmap Indexes for Scientific

Data”. In: ACM Trans. Database Syst. 32.3 (Aug. 2007), 16–es. issn: 0362-5915. doi: 10.1145/1272743.1272746. url: https://doi.org/10.1145/1272743. 1272746 (cit. on p. 12).

[29] SkyServer. http://skyserver.sdss.org/. (Cit. on p. 33).

48 [30] SQream. https://sqream.com/product (cit. on p. 21).

[31] S. Srinivasa. “Query Processing Issues in Data Warehouses”. In: (Apr. 2000) (cit. on p.1).

[32] TPC-H. http://www.tpc.org/tpch/. May 2019 (cit. on p. 34).

[33] Wikipedia contributors. Online transaction processing — Wikipedia, The Free

Encyclopedia. [Online; accessed 20-June-2020]. 2020. url: https://en.wikipedia. org/w/index.php?title=Online_transaction_processing&oldid=945154394 (cit. on p.1).

[34] H. K. T. Wong, H.-F. Liu, F. Olken, D. Rotem, and L. Wong. “Bit Transposed Files”. In: VLDB. 1985 (cit. on p. 10).

[35] Wu Qiyue. “Research on column-store databases optimization techniques”. In: 2015 International Conference on Logistics, Informatics and Service Sciences (LISS). 2015, pp. 1–7 (cit. on pp.3,7).

[36] K. Wu, E. J. Otoo, and A. Shoshani. “Optimizing Bitmap Indices with Efficient

Compression”. In: ACM Trans. Database Syst. 31.1 (Mar. 2006), pp. 1–38. issn: 0362-5915. doi: 10.1145/1132863.1132864. url: https://doi.org/10. 1145/1132863.1132864 (cit. on pp. 10, 12).

[37] K. Wu, E. Otoo, and A. Shoshani. “On the Performance of Bitmap Indices for High Cardinality Attributes”. In: Proceedings of the Thirtieth International Conference on Very Large Data Bases - Volume 30. VLDB ’04. Toronto, Canada:

VLDB Endowment, 2004, pp. 24–35. isbn: 0120884690 (cit. on p. 10).

[38] M.-c. Wu and A. Buchmann. “Encoded Bitmap Indexing for Data Warehouses”. In: Proceedings - International Conference on Data Engineering (May 1998).

doi: 10.1109/ICDE.1998.655780 (cit. on p. 10).

49