A Novel GPU Algorithm for Indexing Columnar Databases with Column Imprints
Total Page:16
File Type:pdf, Size:1020Kb
A Novel GPU Algorithm for Indexing Columnar Databases with Column Imprints A THESIS SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL OF THE UNIVERSITY OF MINNESOTA BY Manaswi Mannem IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE Eleazar Leal July 2020 c Manaswi Mannem 2020 Acknowledgements I would like to thank my advisor, Dr. Eleazar Leal, who gave me the opportunity to work on this research. He has guided and supported me throughout my degree program, including this research. I have really learnt a lot from him. Additionally, I would like to thank Dr. Arshia Khan and Dr. Desineni Subbaram Naidu for being on my defense committee and for taking the time to go over this research work with me. Finally, many thanks to all the Masters of Computer Science students from the graduating classes of 2018, 2019 and 2020 for being an integral part of my growth here at UMD. i Dedication I dedicate this thesis to my parents, Syam Prasanna and Kamala Mannem, and my brother, Manasseh for being the most remarkable role models to me throughout my life. Their constant support and guidance has always pushed me to pursue my dreams and to excel at them. I would also like to dedicate this thesis to the incredible faculty of the Department of Computer Science at UMD. Working and interacting with each one of them as a student, a researcher and a teaching assistant has been the most rewarding learning experience of my graduate school tenure. ii Abstract Columnar database management systems (CDBMS) are specialized database sys- tems that store data in column-major order, i.e. where all the values of each attribute are stored consecutively in storage, as opposed to row-major order, which stores all the values of each row consecutively and is the most commonly strategy used by relational database systems such as Oracle, SQLServer, PostgreSQL, etc. The column-major order approach makes CDBMS more appropriate for data warehouses because query workloads for the latter systems usually involve retrieving all the values of a small subset of attributes. Just like in relational database management systems, query response time per- formance in CDBMS can greatly benefit from the existence of specially designed data structures, called indexes, that can help avoid exhaustively searching the entire database. However, existing indexing techniques for CDBMS like bitmaps, zonemaps etc. have been shown to result in large storage overhead and memory traffic between storage and CPU. Column imprints are an indexing approach for CDBMS that deals with these two issues by compressing the data in storage and by storing data in such a way that it reduces expensive data transfers between Random Access Mem- ory (RAM) and Central Processing Unit (CPU) caches. However, data compression and decompression, which are necessary for query processing in CDBMS imposes a significant computational burden. To deal with this problem, parallel architectures, such as Graphical Processing Units (GPU) can be used. GPUs are co-processors that have been shown to outperform CPUs for highly parallel tasks. Besides this, GPUs are highly energy efficient, affordable, and available in many types of machines, from mobile devices to supercomputers. Despite their advantages, no work has focused on the use of GPUs for column imprints indexing. iii To address the gap mentioned before, this thesis introduces the first GPU algo- rithm, named GPUImprints, for indexing columnar databases with column imprints. We also performed the first experimental study, using one real-world dataset and one synthetic dataset, on the use of GPUs versus CPUs for column imprints in terms of their index creation time, query response times and index space requirements. Our experiments showed that GPUImprints can speedup the column imprints construc- tion time by a factor of 20X and can speedup the query processing times of range queries by a factor of 3.7X when compared to the CPU imprints algorithm. iv Contents 1 Introduction1 1.1 Motivation.................................3 1.2 Contributions...............................5 1.3 Outline...................................6 2 Background7 2.1 Columnar Databases...........................7 2.1.1 Overview.............................7 2.1.2 Indexing Techniques.......................9 2.1.3 Applications............................ 15 2.2 GPUs................................... 16 2.2.1 Overview............................. 16 2.2.2 CUDA............................... 17 2.2.3 Applications............................ 20 2.3 Why use GPUs for Columnar Databases?................ 20 3 Proposed Algorithm 22 3.1 Overview.................................. 22 3.2 Stage 1: Creation of Imprint Index................... 22 v 3.3 Stage 2: Imprints Compression..................... 27 3.4 Stage 3: Query Evaluation........................ 29 4 Experimental Results 32 4.1 Challenges of Designing GPU Algorithm for Column Imprints.... 32 4.2 Hardware................................. 33 4.3 Datasets and Query Workload...................... 33 4.4 Parameters................................ 34 4.5 Analysis of GPUImprint Index...................... 35 4.6 Performance Analysis of Range Queries................. 39 5 Conclusions and Future Work 43 5.1 Conclusions................................ 43 5.2 Future Work................................ 44 References 45 vi List of Tables 2.1 Example of bitmap indexing........................ 11 vii List of Figures 1.1 Examples of OLAP and OLTP......................1 1.2 Example of row and columnar databases................2 2.1 Example of database cracking......................9 2.2 Example of byte-aligned bitmap code (BBC) compression...... 12 2.3 Example of WAH compression..................... 13 2.4 Example of zonemaps.......................... 15 2.5 APIs of CUDA.............................. 18 2.6 CUDA stack................................ 19 3.1 Example of GPU column imprints index................ 29 4.1 Comparison between imprints creation time of CPU and GPU for 2- byte integer data type (TPC-H).................... 35 4.2 Comparison between imprints creation time of CPU and GPU for 4- byte data type (Skyserver)........................ 36 4.3 Comparison between imprints creation time of CPU and GPU for 8- byte data type (Skyserver)........................ 37 4.4 GPU imprints index creation time per 10,000 values for different value type with varying widths of GPU imprint vector............ 38 viii 4.5 GPU imprints index creation time per 1000 values for different value type with varying cache line size..................... 39 4.6 Comparison of imprints index size for different column sizes for CPU and GPU................................. 40 4.7 Comparison of query response times for CPU and GPU........ 41 4.8 Query response time for different width GPU imprint vectors with varying cache line size.......................... 42 ix 1 Introduction Database management systems are used to record daily activities. They store data in tables either in row-wise (row databases) or column-wise (columnar databases) fashion. Most of the systems like order entry, retail sales, and financial transaction systems use databases managed by Online Transaction Processing Systems (OLTP) to record their daily updates [33]. However, for decision support and strategic planning the organizations also need historical data which is aggregated over many years [31]. The systems that store historical data are referred to data warehouses, also called Online Analytical Processing Systems (OLAP). Some of the examples of OLAP and OLTP systems are shown in Figure 1.1. In a row-oriented database, the data is stored row by row, that is after storing the Figure 1.1: Examples of OLAP and OLTP 1 entire first row, the second row is stored. Similarly, in columnar databases, data is stored column-wise, that is all the data associated with column is stored adjacently. In Figure 1.2, for row-oriented database, data is organized by row, keeping all of the data associated with a row next to each other in memory. Similarly in column- oriented databases, data is organized by column, keeping all the data associated with a column next to each other in memory. Some examples of columnar databases are MonetDB [20], MariaDB [17], CrateDB [9], Apache Hbase [3] etc. MonetDB is a main-memory database management system that stores data in columns. The archi- tecture of MonetDB has three layers: the front-end layer(top), back-end layer(middle) and the database kernel(bottom) [20]. The two indexing mechanisms used in Mon- etDB are Database Cracking [14] and Column Imprints [27]. Figure 1.2: Example of row and columnar databases 2 1.1 Motivation Column-oriented databases can scan the entire database faster than row-oriented databases [1]. This is because in column-oriented databases, scan operation is per- formed by traversing an array with the use of loops. This traversing can be en- hanced by exploiting indexing properly. According to [1], column-oriented database indexes are faster compared to traditional row-oriented database indexes and used as a standard for storing and querying large data warehouses. The main feature for the columnar architecture is that compression [35] can be efficiently performed. As most of the values in the columns are repetitive and have the same data type, com- pression algorithm can achieve higher compression ratio as they compress similar and consecutive values into one. An index is a data structure which helps to speed up query search. One example of index is an online catalog