Enhanced Bitmap Indexes for Large Scale Data Management
Total Page:16
File Type:pdf, Size:1020Kb
Enhanced Bitmap Indexes for Large Scale Data Management DISSERTATION Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University By Guadalupe Canahuate, MS ***** The Ohio State University 2009 Dissertation Committee: Approved by Dr. Hakan Ferhatosmanoglu, Adviser Dr. Gagan Agrawal Adviser Dr. P. Sadayappan Graduate Program Dr. Timothy Long Computer Science and Engineering c Copyright by Guadalupe Canahuate 2009 ABSTRACT Advances in technology have enabled the production of massive volumes of data through observations and simulations in many application domains. These new data sets and the associated queries pose a new challenge for efficient storage and data retrieval that requires novel indexing structures and algorithms. We propose a series of enhancements to bitmap indexes to make them account for the inherent charac- teristics of large scale datasets and to efficiently support the type of queries needed to analyze the data. First, we formalize how missing data should be handled and how queries should be executed in the presence of missing data. Then, we propose an adaptive code ordering as a hybrid between Gray code and lexicographic order- ings to reorganize the data and further reduce the size of the already compressed bitmaps. We address the inability of the compressed bitmaps to directly access a given row by proposing an approximate encoding that compresses the bitmap in a hash structure. We also extend the existing run-length encoders of bitmap indexes by adding an extra word to represent future rows with zeros and minimize the insertion overhead of new data. We propose a comprehensive framework to execute similarity searches over the bitmap indexes without changes to the current bitmap structure, without accessing the original data, and using a similarity function that is meaningful in high dimensional spaces. Finally, we propose a new encoding and query execution for non-clustered bitmap indexes that combines several attributes in one existence ii bitmap, reduces the storage requirement, and improves query execution and update time for low cardinality attributes. iii To my family, my husband and my children. iv ACKNOWLEDGMENTS Numerous people have contributed to my success in pursuing this degree. I could not have done this alone. There are too many names to make an exhaustive list and a mere thank you is not enough evidence of my deepest appreciation. Thank you God for all the blessings during all these years. Thank you to my wonderful husband, Jose Maria, who believed in me, left everything and came with me to pursue a dream. To my beautiful children, Jose Enrique and Maria Josefa, who are the joy of my life, everything I do is because you motivate me to be a better person everyday. To my parents, Francisco and Josefa, whose support has been essential and whose example has put a north in my life. To my brothers and sisters, Omar, Angela, Francisco, and Juanita, thank you for being always there. To my in-laws, Carlos Enrique and Maria Belen, who sacrificed their time together and supported us every step of the way. To my relatives both in Dominican Republic and Spain, for always encouraging us to continue forward. I especially want to thank my advisor Hakan Ferhatosmanoglu. I could not have asked for a better advisor, one of the brightest persons I have had the pleasure to meet, an excellent researcher and a better human being. Thank you for your time, patience, and guidance over the years. I also want to thank Dr. Timothy Long for his good advice and endless jokes, and Dr. Donna Byron for encouraging me to pursue a PhD during my Master’s program. I want to acknowledge the other members of my v dissertation committee Dr. Gagan Agrawal, Dr. P. Sadayappan, and Dr. Timothy Long. I would like to extend my sincere appreciation to my classmates and colleagues, especially my co-authors, with whom I have enjoyed working together and from whom I have learned a great deal. During my time at Ohio State I built friendships that I will cherish for the rest of my life. Thank you to all my friends, especially Dr. Michael Gibas for making it so easy to work with him and for his true friendship. I also want to acknowledge the administrative staff in the CSE department and the OSU community in general for making me feel welcome and part of the community. I will always keep you close to my heart. My PhD was supported in part by National Science Foundation grants OCI- 0619041 and IIS-0546713, and DOE Award No. DE-FG02-03ER25573. vi VITA Dec 12, 1978 ...............................Born - Santo Domingo, Dominican Re- public Aug 1996 - Jul 2000 ........................B.S. Systems Engineering, Pontificia Universidad Catlica Madre y Maestra, Santo Domingo, Dominican Republic Sept 2002- Dec 2003 ........................M.S. Computer Science and Engineer- ing, Ohio State University, Columbus Ohio Sept 2004-present ...........................Graduate Assistant, Department of Computer Science and Engineering, Ohio State University, Columbus, Ohio PUBLICATIONS Research Publications Guadalupe Canahuate, Tan Apaydin, Ahmet Sacan, Hakan Ferhatosmanoglu. Sec- ondary Bitmap Indexes with Vertical and Horizontal Partitioning. International Con- ference on Extending Database Technology (EDBT). Russia, 2009/March. Nilgun Ferhatosmanoglu, Theodore Allen, Guadalupe Canahuate. Vector Space Search Engines that Maximizes Expected User Utility. Accepted for publication on the In- ternational Journal of Mathematics in Operational Research (IJMOR). Tan Apaydin, Guadalupe Canahuate, Hakan Ferhatosmanoglu, Ali Saman Tosun. Dynamic Data Organization for Bitmap Indices. International ICST Conference on Scalable Information Systems (INFOSCALE), Vico Equense, Italy, 2008/June. Michael Gibas, Guadalupe Canahuate, Hakan Ferhatosmanoglu. Online Index Rec- ommendations for High-Dimensional Databases Using Query Workloads. IEEE Trans- actions on Knowledge and Data Engineering (TDKE) Journal, February 2008 (Vol. 20, no.2), pp. 246-260. vii Guadalupe Canahuate, Michael Gibas, Hakan Ferhatosmanoglu. Update Conscious Bitmap Indexes. International Conference on Scientific and Statistical Database Management (SSDBM), Banff, Canada, 2007/July. Tan Apaydin, Guadalupe Canahuate, Hakan Ferhatosmanoglu, Ali Saman Tosun. Approximate Encoding for Direct Access and Query Processing over Compressed Bitmaps. International Conference on Very Large Data Bases (VLDB), Seoul, Korea, 2006/September, pp. 846-857. Hakan Ferhatosmanoglu, Ali Saman Tosun, Guadalupe Canahuate, Aravind Ra- machandran. Efficient Parallel Processing of Range Queries through Replicated Declustering. Distributed and Parallel Databases Journal. 2006/July, pp. 117-147. Guadalupe Canahuate, Michael Gibas, Hakan Ferhatosmanoglu, Indexing Incomplete Databases, International Conference on Extending Database Technology (EDBT), Munich, Germany, 2006/March, pp. 884-901 FIELDS OF STUDY Major Field: Computer Science and Engineering viii TABLE OF CONTENTS Page Abstract....................................... ii Dedication...................................... iv Acknowledgments.................................. v Vita ......................................... vii ListofFigures ................................... xiii Chapters: 1. Introduction.................................. 1 2. Background.................................. 5 2.1 Bitmapindexes............................. 5 2.2 BitmapCompression.......................... 8 3. Extending Bitmap Indexes to Handle Missing Data . .... 11 3.1 Introduction .............................. 11 3.2 RelatedWork.............................. 13 3.2.1 MissingData........................... 13 3.2.2 Vector Approximation (VA) Files . 14 3.3 ProblemDefinition........................... 15 3.4 ProposedSolution ........................... 17 3.4.1 Bitmap Equality Encoding (BEE) . 17 3.4.2 BitmapRangeEncoding(BRE). 21 3.5 ExperimentsandResults. 24 ix 3.5.1 ExperimentalFramework . 24 3.5.2 IndexSize............................ 25 3.5.3 QueryExecutionTime. 27 3.6 Conclusions............................... 31 4. Improving Bitmap Index Compression by Data Reorganization...... 33 4.1 Introduction .............................. 33 4.2 Adaptivecodeordering . .. .. 36 4.2.1 TupleReordering. .. .. 36 4.2.2 Heuristics for Tuple Reordering . 38 4.2.3 Adaptivecodeordering(ACO) . 41 4.2.4 CriteriaforColumnOrdering . 49 4.2.5 Expected Size of the ACO Reordered Bitmap Table . 53 4.3 ExperimentalResults . .. .. 57 4.3.1 Performance of adaptive code ordering . 57 4.3.2 Effectsofcolumnordering . 66 4.3.3 Improvements in Query Execution Times . 68 4.3.4 Column Ordering by Query History Files . 68 4.4 ConclusionsandFutureWork . 69 5. Direct Access over Compressed Bitmap Indexes . .... 71 5.1 Introduction .............................. 71 5.2 RelatedWork.............................. 74 5.3 ProposedScheme............................ 75 5.3.1 Encoding General Boolean Matrices . 75 5.3.2 Approximate Bitmap (AB) Encoding . 78 5.4 AnalysisofProposedScheme . 83 5.4.1 FalsePositiveRate. 83 5.4.2 Sizevs.Precision........................ 86 5.5 ExperimentalFramework . 88 5.5.1 DataSets ............................ 88 5.5.2 HashFunctions......................... 89 5.5.3 Queries ............................. 90 5.5.4 ExperimentalSetup . 92 5.6 ExperimentalResults . .. .. 92 5.6.1 ABSize............................. 93 5.6.2 Precision ............................ 95 5.6.3 ExecutionTime ........................ 98 5.6.4 Single Hash Function vs. Independent