Enhanced Bitmap Indexes for Large Scale Data Management

DISSERTATION

Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the

Graduate School of The Ohio State University

By

Guadalupe Canahuate, MS

*****

The Ohio State University

2009

Dissertation Committee: Approved by

Dr. Hakan Ferhatosmanoglu, Adviser Dr. Gagan Agrawal Adviser Dr. P. Sadayappan Graduate Program Dr. Timothy Long Computer Science and Engineering c Copyright by

Guadalupe Canahuate

2009 ABSTRACT

Advances in technology have enabled the production of massive volumes of data through observations and simulations in many application domains. These new data sets and the associated queries pose a new challenge for efficient storage and data retrieval that requires novel indexing structures and algorithms. We propose a series of enhancements to bitmap indexes to make them account for the inherent charac- teristics of large scale datasets and to efficiently support the type of queries needed to analyze the data. First, we formalize how missing data should be handled and how queries should be executed in the presence of missing data. Then, we propose an adaptive code ordering as a hybrid between Gray code and lexicographic order- ings to reorganize the data and further reduce the size of the already compressed bitmaps. We address the inability of the compressed bitmaps to directly access a given row by proposing an approximate encoding that compresses the bitmap in a hash structure. We also extend the existing run-length encoders of bitmap indexes by adding an extra word to represent future rows with zeros and minimize the insertion overhead of new data. We propose a comprehensive framework to execute similarity searches over the bitmap indexes without changes to the current bitmap structure, without accessing the original data, and using a similarity function that is meaningful in high dimensional spaces. Finally, we propose a new encoding and query execution for non-clustered bitmap indexes that combines several attributes in one existence

ii bitmap, reduces the storage requirement, and improves query execution and update time for low cardinality attributes.

iii To my family, my husband and my children.

iv ACKNOWLEDGMENTS

Numerous people have contributed to my success in pursuing this degree. I could not have done this alone. There are too many names to make an exhaustive list and a mere thank you is not enough evidence of my deepest appreciation.

Thank you God for all the blessings during all these years. Thank you to my wonderful husband, Jose Maria, who believed in me, left everything and came with me to pursue a dream. To my beautiful children, Jose Enrique and Maria Josefa, who are the joy of my life, everything I do is because you motivate me to be a better person everyday. To my parents, Francisco and Josefa, whose support has been essential and whose example has put a north in my life. To my brothers and sisters, Omar, Angela,

Francisco, and Juanita, thank you for being always there. To my in-laws, Carlos

Enrique and Maria Belen, who sacrificed their time together and supported us every step of the way. To my relatives both in Dominican Republic and Spain, for always encouraging us to continue forward.

I especially want to thank my advisor Hakan Ferhatosmanoglu. I could not have asked for a better advisor, one of the brightest persons I have had the pleasure to meet, an excellent researcher and a better human being. Thank you for your time, patience, and guidance over the years. I also want to thank Dr. Timothy Long for his good advice and endless jokes, and Dr. Donna Byron for encouraging me to pursue a

PhD during my Master’s program. I want to acknowledge the other members of my

v dissertation committee Dr. Gagan Agrawal, Dr. P. Sadayappan, and Dr. Timothy

Long.

I would like to extend my sincere appreciation to my classmates and colleagues, especially my co-authors, with whom I have enjoyed working together and from whom

I have learned a great deal. During my time at Ohio State I built friendships that

I will cherish for the rest of my life. Thank you to all my friends, especially Dr.

Michael Gibas for making it so easy to work with him and for his true friendship.

I also want to acknowledge the administrative staff in the CSE department and the

OSU community in general for making me feel welcome and part of the community.

I will always keep you close to my heart.

My PhD was supported in part by National Science Foundation grants OCI-

0619041 and IIS-0546713, and DOE Award No. DE-FG02-03ER25573.

vi VITA

Dec 12, 1978 ...... Born - Santo Domingo, Dominican Re- public Aug 1996 - Jul 2000 ...... B.S. Systems Engineering, Pontificia Universidad Catlica Madre y Maestra, Santo Domingo, Dominican Republic Sept 2002- Dec 2003 ...... M.S. Computer Science and Engineer- ing, Ohio State University, Columbus Ohio Sept 2004-present ...... Graduate Assistant, Department of Computer Science and Engineering, Ohio State University, Columbus, Ohio

PUBLICATIONS

Research Publications

Guadalupe Canahuate, Tan Apaydin, Ahmet Sacan, Hakan Ferhatosmanoglu. Sec- ondary Bitmap Indexes with Vertical and Horizontal Partitioning. International Con- ference on Extending Technology (EDBT). Russia, 2009/March.

Nilgun Ferhatosmanoglu, Theodore Allen, Guadalupe Canahuate. Vector Space Search Engines that Maximizes Expected User Utility. Accepted for publication on the In- ternational Journal of Mathematics in Operational Research (IJMOR).

Tan Apaydin, Guadalupe Canahuate, Hakan Ferhatosmanoglu, Ali Saman Tosun. Dynamic Data Organization for Bitmap Indices. International ICST Conference on Scalable Information Systems (INFOSCALE), Vico Equense, Italy, 2008/June.

Michael Gibas, Guadalupe Canahuate, Hakan Ferhatosmanoglu. Online Index Rec- ommendations for High-Dimensional Using Query Workloads. IEEE Trans- actions on Knowledge and Data Engineering (TDKE) Journal, February 2008 (Vol. 20, no.2), pp. 246-260.

vii Guadalupe Canahuate, Michael Gibas, Hakan Ferhatosmanoglu. Update Conscious Bitmap Indexes. International Conference on Scientific and Statistical Database Management (SSDBM), Banff, Canada, 2007/July.

Tan Apaydin, Guadalupe Canahuate, Hakan Ferhatosmanoglu, Ali Saman Tosun. Approximate Encoding for Direct Access and Query Processing over Compressed Bitmaps. International Conference on Very Large Data Bases (VLDB), Seoul, Korea, 2006/September, pp. 846-857.

Hakan Ferhatosmanoglu, Ali Saman Tosun, Guadalupe Canahuate, Aravind Ra- machandran. Efficient Parallel Processing of Range Queries through Replicated Declustering. Distributed and Parallel Databases Journal. 2006/July, pp. 117-147.

Guadalupe Canahuate, Michael Gibas, Hakan Ferhatosmanoglu, Indexing Incomplete Databases, International Conference on Extending Database Technology (EDBT), Munich, Germany, 2006/March, pp. 884-901

FIELDS OF STUDY

Major Field: Computer Science and Engineering

viii TABLE OF CONTENTS

Page

Abstract...... ii

Dedication...... iv

Acknowledgments...... v

Vita ...... vii

ListofFigures ...... xiii

Chapters:

1. Introduction...... 1

2. Background...... 5

2.1 Bitmapindexes...... 5 2.2 BitmapCompression...... 8

3. Extending Bitmap Indexes to Handle Missing Data ...... 11

3.1 Introduction ...... 11 3.2 RelatedWork...... 13 3.2.1 MissingData...... 13 3.2.2 Vector Approximation (VA) Files ...... 14 3.3 ProblemDefinition...... 15 3.4 ProposedSolution ...... 17 3.4.1 Bitmap Equality Encoding (BEE) ...... 17 3.4.2 BitmapRangeEncoding(BRE)...... 21 3.5 ExperimentsandResults...... 24

ix 3.5.1 ExperimentalFramework ...... 24 3.5.2 IndexSize...... 25 3.5.3 QueryExecutionTime...... 27 3.6 Conclusions...... 31

4. Improving Bitmap Index Compression by Data Reorganization...... 33

4.1 Introduction ...... 33 4.2 Adaptivecodeordering ...... 36 4.2.1 TupleReordering...... 36 4.2.2 Heuristics for Tuple Reordering ...... 38 4.2.3 Adaptivecodeordering(ACO) ...... 41 4.2.4 CriteriaforColumnOrdering ...... 49 4.2.5 Expected Size of the ACO Reordered Bitmap Table . . . . . 53 4.3 ExperimentalResults ...... 57 4.3.1 Performance of adaptive code ordering ...... 57 4.3.2 Effectsofcolumnordering ...... 66 4.3.3 Improvements in Query Execution Times ...... 68 4.3.4 Column Ordering by Query History Files ...... 68 4.4 ConclusionsandFutureWork ...... 69

5. Direct Access over Compressed Bitmap Indexes ...... 71

5.1 Introduction ...... 71 5.2 RelatedWork...... 74 5.3 ProposedScheme...... 75 5.3.1 Encoding General Boolean Matrices ...... 75 5.3.2 Approximate Bitmap (AB) Encoding ...... 78 5.4 AnalysisofProposedScheme ...... 83 5.4.1 FalsePositiveRate...... 83 5.4.2 Sizevs.Precision...... 86 5.5 ExperimentalFramework ...... 88 5.5.1 DataSets ...... 88 5.5.2 HashFunctions...... 89 5.5.3 Queries ...... 90 5.5.4 ExperimentalSetup ...... 92 5.6 ExperimentalResults ...... 92 5.6.1 ABSize...... 93 5.6.2 Precision ...... 95 5.6.3 ExecutionTime ...... 98 5.6.4 Single Hash Function vs. Independent Hash Functions . . . 100 5.7 Conclusion ...... 100

x 6. Similarity Search over Bitmap Indexes ...... 103

6.1 Introduction ...... 103 6.2 Background...... 107 6.2.1 -slicedIndexes...... 107 6.2.2 Similarity Functions for High Dimensional Spaces ...... 108 6.2.3 PartitionStrategies...... 110 6.3 ProposedApproach...... 112 6.3.1 ObjectRanking...... 112 6.3.2 SimilarityFunction...... 114 6.3.3 BitmapNNQueryExecution ...... 116 6.3.4 AdditionalBitmapNNOperations ...... 119 6.3.5 Enhanced BitmapNN Functionality ...... 122 6.3.6 Analysis of the BitmapNN approach ...... 125 6.4 ExperimentalResults ...... 126 6.4.1 BitmapNNQualityoftheresults ...... 127 6.4.2 BitmapNN Query Execution Time ...... 129 6.4.3 IndexSize...... 133 6.4.4 Comparison with other bitmap-based approaches ...... 135 6.5 Extended query types supported by BitmapNN ...... 136 6.5.1 Weighted Similarity Search ...... 137 6.5.2 ProjectedSimilaritySearch ...... 137 6.5.3 Constrained Similarity Search ...... 137 6.5.4 Complex Similarity Search ...... 138 6.6 RelatedWork...... 138 6.7 Conclusion ...... 139

7. Update Conscious Bitmap Indexes ...... 141

7.1 Introduction ...... 141 7.2 UpdateConsciousBitmaps(UCB) ...... 144 7.2.1 TheGeneralIdea...... 144 7.2.2 BitmapCreation ...... 146 7.2.3 Insertions ...... 147 7.2.4 Deletions ...... 151 7.3 BitmapIndexCostModel ...... 151 7.3.1 Insertions ...... 151 7.3.2 Deletions ...... 153 7.3.3 QueryExecution ...... 154 7.4 ExperimentalResults ...... 156 7.4.1 ExperimentalSetup ...... 156

xi 7.4.2 UpdateTime ...... 157 7.4.3 QueryExecutionTime...... 160 7.4.4 WorstCaseExistenceBitmap ...... 161 7.4.5 CompressedSize ...... 162 7.5 Conclusion ...... 162

8. Non-clustered Bitmap Indexes ...... 164

8.1 Introduction ...... 164 8.2 Approach ...... 167 8.2.1 Technicalmotivation ...... 167 8.2.2 SystemOverview...... 172 8.2.3 IndexStructure...... 173 8.2.4 QueryEngine...... 174 8.2.5 Partitioning...... 177 8.3 ExperimentalResults ...... 180 8.3.1 NumberofRows ...... 181 8.3.2 NumberofAttributes ...... 181 8.3.3 AttributeCardinality ...... 182 8.3.4 NumberofDistinctRanks...... 182 8.3.5 NumberofSegments ...... 183 8.3.6 NCBPerformanceoverRealDatasets ...... 183 8.3.7 IndexSize...... 185 8.3.8 Comparison with Secondary Sorted Bitmaps (SSB) . . . . . 187 8.4 Conclusion ...... 189

9. ConclusionsandFutureResearch ...... 193

Bibliography ...... 196

xii LIST OF FIGURES

Figure Page

2.1 Bitmap Examples for a table with two attributes and three bins per attribute...... 6

2.2 AWAHbitvector...... 9

3.1 Interval Evaluation for Bitmap Equality Encoding ...... 20

3.2 Interval Evaluation for Bitmap Range Encoding ...... 23

3.3 Index Size Versus (a) Cardinality and (b) Percent of MissingData . . 26

3.4 Query Execution Time Versus (a) Cardinality, (b) Percent of Missing Data, and (c) Query Dimensionality ...... 29

4.1 Example for tuple reordering on an equality encoded bitmaptable. . 37

4.2 Illustration of Algorithm 1...... 43

4.3 Example for tuple reordering on a range encoded bitmap table.. . . . 46

4.4 Illustration of the adaptive code ordering effectiveness. On the left is the lexicographic (numeric) order of an equality encoded bitmap table with 9 columns (3 attributes with cardinality 3). On the right is the adaptive code ordering of the same table. White and shaded blocks represent runs of 0s and 1s, respectively. Each horizontal split indicates the segments created by the reordering algorithm at each column. Adaptive code produces less and longer runs than lexicographic order...... 49

4.5 Execution Time for ACO and WAH Compression...... 61

xiii 4.6 Algorithm scalability (a) on the number of rows and (b) on the number ofcolumns...... 61

4.7 Performance of ACO and WAH for different column orderings and varyingnumbersofattributes ...... 62

4.8 Performance comparison of ACO and WAH for HEP data. Columns are ordered in their compressed size...... 63

4.9 Performance comparison for varying cardinality ...... 63

4.10 Performance of ACO for different data distributions and varying car- dinality (a) attributes with cardinality 10 and (b) attributes with car- dinality50...... 64

4.11 Query execution times for (a) point queries and (b) range queries. . . 69

5.1 ABloomFilter ...... 75

5.2 InsertionAlgorithm...... 76

5.3 ExampleofaBooleanMatrixM...... 77

5.4 AB(M) for F (i, j)= concatenate(i, j), k = 1, and H1(x)= x mod 32. 77

5.5 RetrievalAlgorithm...... 78

5.6 Exampleofabitmaptable...... 79

5.7 RetrievalAlgorithmforBitmapQueries...... 84

5.8 False Positive Rate vs. α ...... 85

5.9 False Positive Rate vs. k ...... 86

5.10 AB Size vs. Precision trade-off. (m =log2n) ...... 93

5.11 Precision as a function of α ...... 95

5.12 Precision as a function of the number of hash functions (k)...... 96

xiv 5.13 Precisionasafunctionofthe#ofRows...... 96

5.14 Exec. Time as a function of α...... 98

5.15 Exec. Time as a function of k...... 98

5.16 Exec. Time Performance: WAH vs. AB ...... 99

5.17 SHA-1 comparison with our other hash functions: Precision as a func- tionofthenumberofhashfunctions,k...... 101

6.1 Simple bitmap example for a table with two attributes and three bins perattribute...... 107

6.2 Query Example using BitmapNN...... 119

6.3 Sample weight matrix used in query rewriting to perform column weight- ing...... 120

6.4 Query Execution Time comparison as query dimensionality varies (log scale). Data: ur random,k:1 ...... 129

6.5 Query Execution Time as the dimensionality of the query varies, Access structures designed for high-D. Data: ur random,k:1...... 130

6.6 Query Execution Time as k varies. Data: landsat, Partitions: 16 . . . 131

6.7 Query Execution Time as the number of partitions varies. Data: land- sat,k:20 ...... 131

6.8 Query Execution Time comparison of BitmapNN as query dimension- alityvaries. Data:landsat,k:20 ...... 133

6.9 Average Bitcolumn Size as the number of partitions increases. Data: landsat...... 133

7.1 WAH compressed bit vector for original bitmaps and Update Conscious Bitmaps(UCB)...... 145

7.2 Insertion operations for Update Conscious Bitmap Index over Attribute sex...... 150

xv 7.3 Index Update Time vs. Number of Rows Inserted...... 157

7.4 Index Update Time vs. Number of Rows Inserted...... 157

7.5 Index Update Time vs. Cardinality of the Attribute Indexed. .... 158

7.6 Index Update Time vs. Cardinality of the Attribute Indexed. .... 159

7.7 Index Update Time vs. Number of Attributes Indexed...... 160

7.8 Point Query Execution Time vs. Number of Attributes Queried. . . 160

7.9 Point Query Execution Time for best case EB and worst case EB. . . 161

8.1 Full Sorted Bitmap Table with 3 attributes and 3 bins per attribute. The white blocks represent 0s and black blocks represent 1s...... 169

8.2 System Overview of the proposed approach...... 172

8.3 Execution of point queries for datasets with 5 attributes each with cardinality 5, different distributions, and varying number ofrows. . . 182

8.4 Execution of point queries for datasets with varying number of at- tributes each with cardinality 5, different distributions, and 1M rows. 183

8.5 Execution of point queries for datasets with 5M rows, 5 attributes, and different distributions with varying cardinality...... 184

8.6 Execution time of point queries for increasing number of distinct ranks intheEB...... 186

8.7 Execution time of point queries for datasets with 5 attributes, each with cardinality 5, 1M rows, and different distributions. NCBs are built with varying number of horizontal partitions (segments). . . . . 187

8.8 Execution of point queries for real datasets as the query dimensionality increases. Query attributes are selected in the same order of the dataset attributes...... 187

xvi 8.9 Execution of point queries for real datasets as the query dimensionality increases. Query attributes are randomly selected...... 188

8.10 Execution of range queries with varying percentage of attribute cardi- nalitiesqueried...... 189

8.11 Index Size Comparison for FastBit, Sorted FastBit and NCB for real datasets...... 190

8.12 Execution of point queries with varying dimensionality in sorting order for Sorted Secondary Bitmaps (SSB) and Non-clustered Bitmaps (NCB).191

8.13 Execution of range queries with varying percentage of attribute cardi- nalities queried for Sorted Secondary Bitmaps (SSB) and Non-clustered Bitmaps(NCB)...... 192

xvii CHAPTER 1

Introduction

Advances in technology have enabled the production of massive volumes of data through observations and simulations in many application domains such as biology, high-energy physics, climate modeling, and astrophysics. In computational high- energy physics, simulations are continuously run, and notable events are stored in a database or a file. The number of events that need to be stored in one year is in the order of several millions [57]. In astrophysics, technological advances enabled devoting several telescopes for observations, results of which need to be stored for later query processing [59]. Genomic and proteomic technologies are now capable of generating terabytes of data in a single day’s experimentation [81]. These new data sets and the associated queries are significantly different from those of the traditional database systems, most importantly due to their enormous size and high-dimensionality (more than, for example, 200 attributes in high-energy physics experiments). These pose a new challenge for efficient storage and retrieval of data and require novel indexing structures and algorithms.

Most large scale data is read mostly. Data is written once and it is not updated later. However, insertion of new data is frequent. The main requirement is to provide

1 the user with the ability to access the data efficiently, fast, and to facilitate the

discovery of patterns and useful information from the data.

As described in [72] all hierarchical multi-dimensional index structures and space partitioning based indexes break down after a certain number of dimensions or parti- tions indexed. A suitable index for ad-hoc querying in high dimensional spaces needs to maintain dimensionality independence to avoid performance degradation. This is the case of bitmap indexes. Bitmap indexes are binary structures that identify the rows that falls into a certain criteria. The main advantage of bitmap indexes is that bit operations are performed by hardware, making them extremely fast. The main disadvantage is the amount of space they require. For discrete attributes one bitmap is created for each category or attribute value. For continuous attributes and attributes with very large cardinality, the range of values is divided into few classes/intervals.

This is known as quantization. For example, we could quantize an integer attribute using three classes: class A is less than 999, class B is 1000 to 9999 and class C is

10000 and above.

Since quantization is a lossy technique, it is mainly used in do- mains where large amount of data are available and some degree of imprecision if acceptable. In signal processing domain, for example, the original data is discarded after quantization and only the quantized data is transferred. In the database do- main, where users sometimes demand exact answers, quantization is used in a slightly different manner. The goal is to create a compressed representation of the data that

fits into main memory and can be accessed faster than the original data. Both the original data and the quantized representation are stored and maintained up to date.

2 Query execution is then performed as a two-step process. In the first step, the quan-

tized data is accessed to obtain a candidate set. In the second step, the original data

objects from the candidate set are accessed to determine the exact answer set of the

query. In many cases, when the user constraints refer to the quantized classes, the

second step is not necessary since the candidate set corresponds to the query answer

set. In some other cases, the second step is omitted when approximate answers are

acceptable in the application at hand. In these cases, the original data needs not to be

accessed and queries are answered using only the quantized data. This is preferable

since the high I/O cost of accessing the actual data is avoided. In any case, because

each dimension is quantized independently, this technique is not subject to the curse

of dimensionality.

In this dissertation, we aim to extend and improve bitmap indexes to manage large scale data efficiently. We focus on generalizing bitmap indexes to address the inherent characteristics of the data and efficiently support the type of queries users need to analyze the data. Our contributions address three main shortcomings of bitmap indexes: space requirement, incremental updates, and richer query execution types. The issues addressed in this dissertation are the following. In Chapter 2, we provide a background description of bitmap indexes. In Chapter 3, we extend bitmap indexes to handle incomplete/missing data. In Chapter 4, we propose an adaptive code ordering as a hybrid between Gray code and lexicographic orderings to reorganize the data and further reduce the size of the already compressed bitmaps.

In Chapter 5, we address the inability of the compressed bitmaps to directly access a given row by proposing an approximate encoding that compresses the bitmap in a hash structure to efficiently execute highly selective queries. In Chapter 6, we

3 design a bitmap framework to support similarity search. Our method uses the current

bitmap structure without changes and only uses the index to answer the queries.

The proposed similarity function is based on hamming distance and is meaningful

in high dimensional spaces. In Chapter 7, we propose update conscious bitmaps to

efficiently handle appends of new data over compressed bitmaps. We add a padword to the existing run-length encoders to represent future rows with zeros and minimize the insertion overhead of new data. In Chapter 8, we propose a new encoding and query evaluation algorithm for non-clustered bitmap indexes that combines several attributes in one existence bitmap, reduces the storage requirement, and improves query execution and update time for low cardinality attributes. Finally, in Chapter

9 we provide conclusion remarks and future research plans for this work.

4 CHAPTER 2

Background

The data generated by scientific experiments is composed of attributes that are

numerical or enumerated. Compared to conventional databases, a data record in a

scientific database involves many more attributes, up to order of hundreds. And the

number of tuples is huge due to the technological advances that make it possible

to generate huge volumes of data on a daily basis. High energy physics simulations

generate millions of events to be stored in a single year. Due to such large volume of

data, even simple queries are extremely slow without an effective index structure in

place. However, neither the well-known multi-dimensional indexing techniques [56, 24]

nor their extensions [38, 36, 8, 15, 10] have been successful in scientific database

systems, partly due to the effects of the infamous dimensionality problems [7, 72] and

the massive scales of these systems.

2.1 Bitmap indexes

Bitmap indexing, which has been effectively utilized in many major commer- cial Database Management Systems, e.g., Oracle [5, 6], IBM [51], Informix [27, 44],

Sybase [21, 67], has also been the most popular approach for scientific databases

[58, 47, 76, 77, 63]. The reason is that bitmap indexes exploit the fact that each

5 attribute is numeric or enumerated and are not exponentially affected by the dimen-

sionality of the data since each dimension is indexed independently. The topic of

bitmap indexes was introduced in [45]. The basic idea, is that data are partitioned or

quantized into several bins, where the number of bins per each attribute could vary,

and each bin is represented by a bitmap vector. Several bitmap encoding schemes

have been developed, such as equality [45], range [16], interval [16], and workload and

attribute distribution oriented [31].

Attribute 1 Attribute 2 Attribute 1 Attribute 2 Tuple b1 b2 b3 b1 b2 b3 b1 b2 b3 b1 b2 b3 t1 1 0 0 0 0 1 1 1 1 0 0 1 t2 0 1 0 1 0 0 0 1 1 1 1 1 t3 1 0 0 1 0 0 1 1 1 1 1 1 t4 0 0 1 0 0 1 0 0 1 0 0 1 t5 0 1 0 0 1 0 0 1 1 0 1 1 t6 0 0 1 1 0 0 0 0 1 1 1 1 (a) Basic (Equality) Encoding (b) Range Encoding

Figure 2.1: Bitmap Examples for a table with two attributes and three bins per attribute.

For the simple bitmap encoding (also called equality encoding) [45], if a value falls into a bin, this bin is marked “1”, otherwise “0”. Since a value can only fall into a single bin, only a single “1” can exist for each row of each attribute. After binning,

the whole database is converted into a huge 0-1 bitmap, where rows correspond to

tuples and columns correspond to bins. Figure 2.1 shows an example using a table

with two attributes, each partitioned into three bins. Figure 2.1(a) shows the equality

encoded bitmap for this table. The first tuple t1 falls into the first bin in attribute 1,

6 and the third bin in attribute 2. Note that after binning we can treat each tuple as

a binary number. For instance t1 = 100001 and t2 = 010100.

For range encoding bitmaps [16], a bin is marked “1” if the value falls into it or a

smaller bin, and “0” otherwise. Figure 2.1(b) shows the range encoded bitmap cor-

responding to the same example. Using this encoding, the last bin for each attribute

is all 1s. Thus, we do not explicitly store this column. Since bin 3 is not stored for

any of the attributes, the first tuple t1 is represented by the binary number 1100.

Bitmap indexes can provide very efficient performance for point and range queries

thanks to fast AND and OR bit operations over the bitmaps.

With equality encoded bitmaps a point query is executed by ANDing together the bit vectors corresponding to the values specified in the search key. For example,

finding the data points that correspond to a query where Attribute 1 is equal to 3 and

Attribute 2 is equal to 5 is only a matter of ANDing the two bitmaps together. Bitmap

Equality Encoded are optimal for point queries [16]. Range queries are executed by

first ORing together all bit vectors specified by each range in the search key and then

ANDing the answers together. If the query range for an attribute queried includes more than half of the cardinality then we execute the query by taking the complement of the ORed bitmaps that are not included in the range query.

With range encoded bitmaps the bitmaps used and the operations performed to execute a query depend on the range being queried. We identify three scenarios, de- pending on whether the range includes the minimum value, or includes the maximum value, or is within the domain and includes neither the minimum or maximum.

Most practical approaches for indexing scientific data are based on bitmap in- dexing strategies [43, 5, 16, 17, 30, 58, 47, 4, 63, 76, 77, 79, 78]. For example, [79]

7 proposed an effective bitmap indexing technique for large-scale high energy physics

data.

2.2 Bitmap Compression

No matter which bitmap encoding we use, the bitmap index table is a 0-1 table.

This table needs to be compressed to be effective on a large database. General pur- pose text compression techniques are clearly not suitable for this purpose since they significantly reduce the efficiency of queries [30, 76]. Specialized bitmap compression schemes have been proposed to overcome this problem. These schemes are based on run-length encoding, i.e., they replace repeated runs of 0’s or 1’s in the columns by a single instance of the symbol and a run count. These methods not only compress the data but also enable fast bitwise logical operations, which translates into faster query processing.

Run length encoding [55] can therefore be used over every column to compress the data when long runs of “0” or “1” blocks become available. Pure run length encoding is not a good strategy because of its accessing inefficiency. The two most popular compression techniques for bitmaps are the -aligned Bitmap Code (BBC)

[5] and the Word-Aligned Hybrid (WAH) code [79]. Unlike traditional run length encoding, these schemes mix run length encoding and direct storage. BBC stores the compressed data in while WAH stores it in Words. WAH is simpler because it only has two types of words: literal words and fill words. In our implementation it is the most significant bit that indicates the type of word we are dealing with. Let w denote the number of in a word, the lower (w-1) bits of a literal word contain the bit values from the bitmap. If the word is a fill, then the second most significant

8 128 bits 1,20*0,4*1,78*0,30*1 31-bit groups 1,20*0,4*1,6*0 62*0 10*0,21*1 9*1 groups in hex 400003C0 00000000 00000000 001FFFFF 000001FF WAH (hex) 400003C0 80000002 001FFFFF 000001FF

Figure 2.2: A WAH bit vector.

bit is the fill bit, and the remaining (w-2) bits store the fill length. WAH imposes

the word-alignment requirement on the fills. This requirement is key to ensure that

logical operations only access words.

Figure 2.2 shows a WAH bit vector representing 128 bits. In this example, we assume 32 bit words. Under this assumption, each literal word stores 31 bits from the bitmap, and each fill word represents a multiple of 31 bits. The second line in Figure

2.2 shows how the bitmap is divided into 31-bit groups, and the third line shows the hexadecimal representation of the groups.

The last line shows the values of WAH words. Since the first and third groups do not contain greater than a multiple of 31 of a single bit value, they are represented as literal words (a 0 followed by the actual bit representation of the group). The fourth group is less than 31 bits and thus is stored as a literal. The second group contains a multiple of 31 0’s and therefore is represented as a fill word (a 1, followed by the 0 fill value, followed by the number of fills in the last 30 bits). The first three words are regular words, two literal words and one fill word. The fill word 80000002 indicates a

0-fill of two-word long (containing 62 consecutive zero bits). The fourth word is the active word, it stores the last few bits that could not be stored in a regular word.

Another word with the value nine, not shown, is needed to stores the number of

9 useful bits in the active word. Logical operations are performed over the compressed

bitmaps resulting in another compressed bitmap.

Bit operations over the compressed WAH bitmap file are faster than BBC (2-20 times) [77] while BBC gives better compression ratio.

Recently, reordering has been proposed as a preprocessing step for improving the compression of bitmaps. The objective with reordering is to increase the performance of run length encoding. By reordering columns, compression ratio of large boolean matrices can be improved [29]. However, matrix reordering is NP-hard and the au- thors use traveling salesman heuristics to compute the new order. Reordering idea is also applied to compression of bitmap indexes [50]. The authors show that tuple reordering problem is NP-complete and propose Gray code ordering heuristic.

10 CHAPTER 3

Extending Bitmap Indexes to Handle Missing Data

3.1 Introduction

There are a variety of reasons why databases may be missing data. The data may not be available at the time the record was populated or it was not recorded because of equipment malfunction or adverse conditions. Data may have been unintentionally omitted or the data is not relevant to the record at hand. The allowance for and use of missing data may be intentionally designed into the database. In some cases, the missingness of data is random, i.e. the missingness of some value does not depend on the value of another variable. In that case, the missingness is ignorable and the way of dealing with it is to “complete” the value using regression or other statistical model and treat the data as if it was never missing. However, if the data are missing as a function of some other variable, a complete treatment of missing data would have to include a model that accounts for missing data. Consider the example of the analyte- disease database where diseases are the records and analyte ranges are the attributes.

This database would contain values for analyte ranges if they are relevant for a specific disease, or null values if the analyte readings are not important in the diagnosis of that disease. We may query such a database with a patient’s analyte readings to get

11 a list of potential diagnoses. We do not want to discount diseases that do not have a value for an analyte included in the query, because the act of taking an analyte’s measurement has no bearing on if a patient has a disease that is not relevant to that particular analyte. So in this case, missing data should be interpreted as a query match for that attribute. Alternatively, the intent of a query may not be to return records that could match query criteria, but to only return records that definitely match query criteria. In this case any missing data for a record that occurs in an attribute specified by the query search key means that the record does not match the query. An example of this could be a survey results query where the query asks for a count of respondents that answered question 5 with answer “A” and question 8 with answer “C”.

This chapter deals with data where missingness is not ignorable, in other words whether a data value is missing or not is important and we want to be able to distinguish between the real values and the absence of such values. In order to achieve this, we could assign a specific value for missing fields that is not in the domain of that particular attribute. For example, if the domain of an attribute is the positive integers, a value of -1 may be used to denote missing data. Traditional hierarchical multi- dimensional indexing techniques experience significant performance degradation when the database contains missing data mapped to a distinguished value. In the case of R-Trees even when there is only 10% missing data for each attribute, the time performance is 23 times worse than if the data set were complete.However, given the fact that in quantization based indexes each dimension is indexed independently, there is no degradation in performance. The techniques introduced are evaluated in terms of performance against a number of parameters including database dimensionality,

12 missing data frequency, query selectivity, and query semantics (whether missing data

indicates a query match or not).

Contributions of this work include the following:

1. Efficiently indexing databases with missing data using variations of bitmap

indexes.

2. Demonstrating that missing data not only causes semantic problems but also

degradation in the performance of queries.

3. Formalization of query processing operations for the proposed techniques in the

presence of missing data.

4. Empirical study and evaluation of results over several analysis parameters.

3.2 Related Work

3.2.1 Missing Data.

Although databases commonly deal with or contain missing data, relatively little work has been performed for this topic. Formal definitions for imperfect databases, of which databases with missing data is a subset, and database operations are provided in [82]. Two techniques for indexing databases with missing data are introduced and evaluated in [46]. This is the only other previous work that focuses on indexing missing data. These are the bitstring augmented method and the multiple one-dimensional one-attribute indexes technique, called MOSAIC.

For the bitstring-augmented index, the average of the non-missing values is used as a mapping function for the missing values. The goal is to avoid skewing the data by assigning missing values to several distinct values. However, by applying this

13 method it becomes necessary to transform the initial query involving k attributes

into 2k subqueries, making the technique infeasible for large k. MOSAIC is a set of B+-Trees where missing data is mapped to a distinguished value. Similarly to the previous method, it becomes necessary to transform the initial query involving k attributes into 2k subqueries, one for each attribute.

What makes MOSAIC perform better than the Bitstring-Augmented index for point queries is that it uses independent indexes for each dimension. However, by using several B+-Trees the query has to be decomposed and intersection and union operations need to be performed to obtain the final result. Queries that could gain a greater performance benefit by utilizing multiple-dimension indexes would not achieve it using this technique. Therefore, this method may not be useful for multiple- dimension range queries, or other queries where the number of matches associated with a single dimension is high.

Our work differs from [46] in that we introduce and evaluate techniques that do not suffer the same weaknesses as their techniques. In our approach the query need not be transformed into exponential number of queries and no extra expensive computation, such as set operations, needs to be performed in order to obtain the final result set.

In addition, our solution using bitmap indexes is also scalable with respect to the data dimensionality.

3.2.2 Vector Approximation (VA) Files

The motivation for VA-files is introduced in [72]. This work showed theoretical limitations for the classes of data and space partitioning indexing techniques with respect to dimensionality. Since reading all database pages becomes unavoidable

14 when the number of indexed dimensions is high, the authors suggest reading a much

smaller approximate version, or vector approximation (VA), of each record in the

database. An initial read approximately answers queries, and actual database pages

are read to determine the exact query answer. VA-files are more thoroughly described

in [70].

For traditional VA-files, data values are approximated by oneof2b strings of length

b bits. A lookup table provides value ranges for each of the 2bpossible representations.

bi For each attribute Ai in the database we use bi bits to represent 2 bins that enclose

the entire attribute domain. In general bi ≪ lg Ci when the cardinality is high. We

made bi = ⌈lg(Ci + 1)⌉.

3.3 Problem Definition

Let D be a database with a schema of the form (A1, A2,...,Ad). D is said to be

incomplete if tuples in it are allowed to have missing attribute values. Without loss

of generality, assume the domain of the attribute values is the integers from 1 to Ci, where Ci is the cardinality of attribute Ai. We assume that data retrieval is based on a k-dimensional search key, where k is less than or equal to d.

In range queries, a lower- and upper-bound is specified for each attribute in the search key. Each interval in the query is represented as v1 ≤ Ai ≤ v2, where v1 and v2 are between 1 and Ci. The query is said to be a point query if all lower bounds are equal to the corresponding upper bound for each attribute in the search key.

Given a range query Q with a k-dimensional search key, we have two ways to compute the results for Q. When missing data is considered to be a query match, a tuple t in the database is said to be an answer for Q if every attribute of t which

15 appears in the search key that is not a missing value falls in the corresponding range

defined in the query or is a missing value. When missing is not a match, a tuple t in

the database is said to be an answer for Q if every attribute of t which appears in the

search key is not missing and falls in the corresponding range defined in the query.

The performance of a query can be characterized by the time it takes to perform

the query and the accuracy of the result. For this work we only consider techniques

that provide accurate query results. The time it takes to perform a query when an

index is used is made up of the time to read the index (if the index does not already

reside in memory), the time to execute the query over the index, and the time to

read the database pages indicated by the index. The goal of this work is to propose

indexing techniques that exhibit better performance than existing techniques and

sequential scan when the database attributes that are specified in a search key have

missing data.

When measuring query performance we consider two metrics: index size and query execution time. Index size is simply measured as the size of the requisite index files on disk. It is indicative of the time required to initially load the index structures.

Although this metric is not as critical for static read-only databases with ample disk- space available, it becomes important as database updates become more frequent or available disk space becomes limited. Query execution time is measured in mil- liseconds for a query set. Given that the indexes are in memory, this measurement indicates the time required to process a set of queries and arrive at a set of pointers to records in the database that could match the query criteria.

16 3.4 Proposed Solution

Our proposed solution is to apply bitmap indexes modified appropriately to ac- count for missing data and to execute the query according to the query’s semantics.

To handle missing data using bitmaps, we map missing values to a distinct value, i.e.

0. By doing this we are increasing the number of bitmaps for each attribute with missing data by 1. While mapping missing data to a distinct value fails for multi- dimensional indexes, it is acceptable for bitmaps because the attributes are indexed independently and we are not creating an exponential number of subspaces that must be searched to answer a query.

Let’s denote the bitvectors or bitmap vectors for attribute Ai by Bi,j where 0 ≤ j ≤ Ci if Ai has missing values and 1 ≤ j ≤ Ci otherwise. Bi,0 represents the bitvector

for missing values.

Let’s denote by Bi,j[x] where 1 ≤ x ≤ n the bit value for record x in the bitmap

for attribute Ai and value j.

3.4.1 Bitmap Equality Encoding (BEE)

Using equality encoded bitmaps, bit Bi,j[x] is 1 if record x has value j for attribute

Ai and 0 otherwise. Using this encoding, if Bi,j[x] = 1 then Bi,k[x]=0 for all k =6 j.

If attribute Ai has missing values, we add the bitmap Bi,0 that behaves in the same manner explained above.

Adding an extra bitmap for each attribute with missing data is not a major burden with few records or few dimensions, but when we consider 1,000,000 records with 100 dimensions we are effectively adding 100,000,000 bits to our index which correspond to approximately 12 MB in size.

17 An intuitive solution that could be used to encode missing data without adding an extra bitmap would be to use different encodings depending on whether missing data is a match or not. In this alternative, when missing is a match we make Bi,j[x]=1 for all j if record x has missing data in attribute Ai; and when missing is not a match, we make Bi,j[x] = 0 for all j if record x has missing data in attribute Ai.

However, there are some problems with this approach. We will need to perform

more bitmap operations when we use the NOT operator. The reason is that when we

negate a bitmap when missing data is considered to be a query match, the resulting

bitmap would have 0’s for the missing records. In order to recover the records with

missing data we will need to AND together two bit columns. We then need to OR

that result with the original negated bitmap to arrive at a correct final result. When

missing data does not imply a query match, we would need to OR together two

bit columns to ensure we are eliminating the records with missing values and then

AND this result with the negated bitmap to get correct results. Using this approach,

it would also be impossible to distinguish between missing values and a real value

when the cardinality of the attribute is 1. In addition, by making all bits 1 for

the attribute when missing is a match we interrupt the runs of 0s and compression

decreases dramatically for the attribute bitmaps.

Empirically, we realized that after compression using WAH, the addition of an

extra bitmap to handle missing data did not introduce much overhead. For the same

example of 1,000,000 records with 100 dimensions, and assuming 10,000 records with

missing data, each bitmap for missing values would have approximately a compression

ratio of 0.47 and overall the compression ratio for the dataset would also improve.

18 Record Value B1,0 B1,1 B1,2 B1,3 B1,4 B1,5 1 5 0 0 0 0 0 1 2 2 0 0 1 0 0 0 3 3 0 0 0 1 0 0 4 missing 1 0 0 0 0 0 5 4 0 0 0 0 1 0 6 5 0 0 0 0 0 1 7 1 0 1 0 0 0 0 8 3 0 0 0 1 0 0 9 missing 1 0 0 0 0 0 10 2 0 0 1 0 0 0

Table 3.1: Equality encoded with missing data

Bitmap Vector Value B1,0 0001000010 B1,1 0000001000 B1,2 0100000001 B1,3 0010000100 B1,4 0000100000 B1,5 1000010000

Table 3.2: Bitmap indexes

Query Execution

With equality encoded bitmaps a point query is executed by ANDing together the bit vectors corresponding to the values specified in the search key. Bitmap Equality

Encoded are optimal for point queries [16]. However, with missing data when missing data means a query match we need to use two bitmaps instead of one to answer the query, i.e. the bitmap corresponding to the value queried and the one for missing values.

19 v1 ≤ Ai ≤ v2 =

(a) Missing Data is a Match (b) Missing Data is not a Match

Figure 3.1: Interval Evaluation for Bitmap Equality Encoding

Range queries are executed by first ORing together all bit vectors specified by each range in the search key and then ANDing the answers together. If the query range for an attribute queried includes more than half of the cardinality then we execute the query by taking the complement of the ORed bitmaps that are not included in the range query.

With our approach we execute the query differently depending on whether missing data is a query match or not. Figure 3.1(a) shows how a query interval for one attribute is evaluated when missing data implies a query match. Figure 3.1(b) shows the same evaluation when missing data is not a match. The query execution time is a function of the number of bitvectors used to answer the query. The number of bitvectors used in the worst case to evaluate a single interval in the query is equal to min(ASi, 1 − ASi) ∗ Ci + 1 where ASi is the attribute selectivity of attribute Ai for this query.

20 3.4.2 Bitmap Range Encoding (BRE)

For range encoded bitmaps, bit Bi,j[x] is 1 if record x has a value that is less than or equal to j for attribute Ai and 0 otherwise. Using this encoding if Bi,j[x]=1

then Bi,k[x] = 1 for all k > j. In this case the last bitmap Bi,Ci for each attribute

Ai is all 1s. Thus, we drop this bitmap and only keep Ci − 1 bitmaps to represent each attribute. If attribute Ai has missing values we add the bitmap Bi,0 which has

Bi,0[x] = 1 if record x has a missing value for attribute Ai. Also in this case Bi,j[x]=1 for all j. We are treating missing data as the next smallest possible value outside the lower bound of the domain, in our case, the value 0. In total the set of bitmaps required to represent attribute Ai with missing values is Ci.

We also tried another kind of encoding in which instead of making missing data the smallest value we consider the extra bitmap to be a flag indicating whether the data is missing. In this alternative, if record x has a missing value for attribute Ai,

Bi,0[x] = 1 and Bi,j[x] = 0 for all j > 0. However, by making Bi,Ci [x] = 0 when x has a missing value for attribute Ai, we can no longer drop it. This will effectively increase the number of bitmaps for attribute Ai to Ci + 1, and will not provide any advantage to the query evaluation logic.

Query Execution

With range encoded bitmaps the bitmaps used and the operations performed to execute a query depend on the range being queried. We identify three scenarios, de- pending on whether the range includes the minimum value, or includes the maximum value, or is within the domain and includes neither the minimum or maximum.

21 Record Value B1,0 B1,1 B1,2 B1,3 B1,4 B1,5 1 5 0 0 0 0 0 1 2 2 0 0 1 1 1 1 3 3 0 0 0 1 1 1 4 missing 1 1 1 1 1 1 5 4 0 0 0 0 1 1 6 5 0 0 0 0 0 1 7 1 0 1 1 1 1 1 8 3 0 0 0 1 1 1 9 missing 1 1 1 1 1 1 10 2 0 0 1 1 1 1

Table 3.3: Sample data using Range encoding

Bitmap Vector Value B1,0 0001000010 B1,1 0001001010 B1,2 0101001011 B1,3 0111001111 B1,4 0111101111

Table 3.4: Range Encoded Bitmap indexes

Figures 3.2(a) and 3.2(b) show how the interval is evaluated for a single query attribute when missing data implies a match or does not imply a match respectively.

The first three conditions in Figures 3.2(a) and 3.2(b) refer to point queries. The other three refer to range queries.

In the presence of missing data, range encoded bitmaps are more efficient for range queries than equality encoded bitmaps in all but extreme cases.

In the case where missing data is a query match, we will need to access between 1 and 3 bitvectors per query dimension. In databases without missing data, we would

22 v1 ≤ Ai ≤ v2 =

(a) Missing Data is a Match (b) Missing Data is not a Match

Figure 3.2: Interval Evaluation for Bitmap Range Encoding

need to access between 1 and 2 bitvectors per query dimension. We introduce some overhead to deal with the missing data case.

In the case where missing data is not a match, we need to access between 1 and

2 bitvectors per query dimension. This is also true for databases without missing data, but there are two conditions, specifically the conditions where the query range includes the minimum domain value, that require 1 extra bitvector access. This is due to the fact that missing values are encoded as 1’s in all bitmaps and a XOR operation is required to eliminate missing data from the result set.

This technique is easy to apply and require little or no modification of the query processing. As shown using empirical experiments, it is also scalable in terms of the number of data dimensions.

23 Synthetic Dataset Census Dataset % of Missing Data Tot % of Missing Data Tot Card 10 20 30 40 50 Cols Card 0 ≤10 ≤50 ≤90 <90 Cols 2 10 10 10 10 10 50 <10 11 0 2 2 0 15 5 10 10 10 10 10 50 10-50 7 2 3 5 4 21 10 20 20 20 20 20 100 51-100 2 0 1 2 2 7 20 20 20 20 20 20 100 >100 0 0 1 2 2 5 50 20 20 20 20 20 100 Total 20 2 7 11 8 48 100 10 10 10 10 10 50 Total 90 90 90 90 90 450

Table 3.5: Synthetic and Census Datasets Distribution

3.5 Experiments and Results

3.5.1 Experimental Framework

We performed experiments using both synthetic and real datasets. By using the synthetic data set we could control analysis parameters individually and gain insights into the behavior of the indexing techniques. We applied the techniques to a real data set to verify the effectiveness of the techniques on real scenarios.

For the synthetic data, we generated a uniformly distributed random dataset set

with 450 attributes and 100,000 records. For the set of attributes we varied the

cardinality and percent of missing data. Cardinality varied among 2, 5, 10, 20, 50,

and 100 values and percent of missing data among 10, 20, 30, 40, and 50 percent.

The real data is census data with 48 attributes and 463,733 records. The attribute

cardinalities widely vary from 2 to 165 (average of 37) and percent of missing data

varies from 0% to 98.5% (average of 41%). Table 3.5 details the distribution for the

synthetic and the real dataset.

24 We implemented query executors for both bitmaps and VA-Files in Java. We ran

100 queries for each type of experiment. Queries were executed in both scenarios when missing data is a query match and when missing data is not a query match.

Since the graphs look very similar in both scenarios we present only results for queries executed where missing data is a match.

Given that we used the same precision (100%) for our implementations we com-

pared bitmap indexes and VA-Files in terms of:

• Index Size. Index Size is an important factor in any indexing technique. We

are interested in indexes that can fit into memory to ensure fast query execution

without the overhead introduced when reading from disk.

• Query Execution Time. Query Execution Time is the time required to

produce a query result set.

3.5.2 Index Size

In this section we evaluate how the attribute cardinality and the percentage of

missing data affects index size.

Attribute Cardinality

For cardinality less than 10 there is not much room for compression and the index

size is equal for both types of bitmap encoding and is not sensitive to the percent of

missing data. For equality encoded bitmaps, as the attribute cardinality increases the

compression ratio improves considerably, however, at the same time, bitmaps index

size increases linearly with cardinality. For VA-Files the index grows very slowly

with cardinality given our current quantization strategy. Index sizes are presented

25 (a) (b)

Figure 3.3: Index Size Versus (a) Cardinality and (b) Percent of Missing Data

for attributes with 10% missing data in Figure 3.3(a). As can be seen, BRE does not benefit from WAH compression.

With real data, compression rate is highly variable with respect to attribute cardi- nality. Since real data can be far from uniform, an attribute that has low cardinality but frequently has one value can achieve high compression ratios. With our set of real data, those attributes which have cardinalities of between 1 and 10 and are not missing any data have a compression ratio between 0.002 and 1.03 using equality encoding and between 0.001 and 0.82 using bitmap range encoding. The wide range is attributable to the bit density (ratio of 1’s) in the bit columns. As the bit density approaches 1 or 0, the compression ratio improves. Therefore, if one particular value is frequent, then the bit density for that value’s column is close to 1 yielding good compression ratio for that column and the bit density for all other bit columns is close to 0, which results in good compression ratio for them.

26 Percent of Missing Data

For equality encoded bitmaps, as the percent of missing data increases the com- pression ratio decreases making the index smaller. Range encoding does not get significant compression using WAH code. VA-File is not sensitive to the presence of missing data and its size is independent of it. In any case the index size for VA-Files is much smaller than bitmaps. Index sizes are presented for cardinality 50 in Figure

3.3(b).

Good compression is also obtained on the real dataset when an attribute has a high occurrence of missing data. The missing data bit column has a bit density close to 1 and all other columns are close to 0. This leads to very good compression ratios for equality encoded bitmaps (between 0.01 and 0.09 for each of the 8 attributes in our real data set which have more than 90% missing data) and decent compresison ratios for range encoded bitmaps (between 0.11 and 0.44). Overall, this real data set had an equality encoded bitmap compression ratio of 0.17 and a range encoded bitmap compression ratio of 0.70.

3.5.3 Query Execution Time

To measure the effect of the various parameters over the query execution time of the 100 queries we needed to have control over the global query selectivity, i.e. the number of records that match the given query. The following formula relates Global

Selectivity (GS), Attribute Selectivity (AS =(v2 −v1 +1)/Ci) and Percent of Missing

Data (Pmi ) of all the attributes involved in the queries:

k

GS = ((1 − Pmi )ASi + Pmi ) i=1 Y

27 , where k is the number of dimensions involved in the query. In order to simplify this formula we assume equal attribute selectivity on all the attributes in the query.

By doing this, individual attribute selectivities are easy to compute but we lose some precision on the global query selectivity. To measure query execution time we fixed the global query selectivity to 1 percent. Plugging in different values for the param-

k eters into GS = [(1 − Pm)AS + Pm] we compute the attribute selectivity for each attribute in the query. Note that the granularity of attribute selectivity is limited by

Ci. In general, our estimate was very close to 1 percent but sometimes the actual

global query selectivity went up to 3 percent. Note that when we make the global

selectivity constant and increase the percent of missing data, the attribute selectivity

decreases. We tested the effect of attribute selectivity, percent of missing data, and

query dimensionality against query execution time.

Attribute Cardinality

Figure 3.4(a) shows the query execution time of 100 queries over attributes with

10 percent missing data and various cardinalities. Also in this case the execution time for BRE and VA-Files remains somewhat constant with BRE being faster than

VA-Files. For BEE, the execution time is linear since the number of bitmaps used to answer the queries depends on the cardinality of the attribute and its selectivity.

Percent of Missing Data

Figure 3.4(b) shows the results of these experiments for attributes with cardinality

10. For equality encoded bitmaps, the execution time decreases when the percent of missing data increases. This is because when we make the global selectivity constant and increase the percent of missing data, the attribute selectivity decreases and the

28 (a) (b) (c)

Figure 3.4: Query Execution Time Versus (a) Cardinality, (b) Percent of Missing Data, and (c) Query Dimensionality

number of bitmaps used in the query execution depends on the attribute selectivity for this kind of encoding. For range encoded bitmaps, the execution time remains somewhat constant. The small variations are due to the possibility of using between 1 and 3 bitmaps per dimension over the query execution. It turns out that as the percent of missing data increases the number of bitmaps used per dimension gets closer to 3.

For VA-Files, the execution time is also somewhat constant. The variations are due to the actual global selectivity for cardinality 10 and 8 dimensions in the query. For cardinality 10 and 50 % missing data the global selectivity is 0.84%, for 30 and 40 is 1.28%, but for 20 is 1.7%. In general, BRE executes range queries faster than the other two. The only case in which BEE performs better than BRE is at 50% missing when the attribute selectivity is 10% and the range query becomes a point query.

Query Dimensionality

Figure 3.4(c) shows the query execution of 100 queries over attributes with cardi- nality 10 and 30 percent of missing data. For all indexes the execution is linear in the number of query dimensions. BRE grows very slowly since we are only using between

1 and 3 bitmaps per query dimension. BEE grows much faster since as we increase

29 the number of dimensions with this percent of missing data the attribute selectivity

get closer to 50 %. For smaller percents of missing data and same cardinality the

attribute selectivity is greater than 50 %, around 70 % so effectively we only access

the 30 % of the bitmaps and therefore the execution time does not increase linearly.

For VA-Files the execution time also increases with the query dimensionality.

Results on Real Data

Experiments using this real data set yielded several conclusions. For this data set, the bitmap solutions were significantly faster than the VA-File solution (3 to 10 times faster). This was because the skewness of this particular data set allowed for very good compression of the bitmaps and while the VA-file implementation had to operate over about 500,000 vector approximations of the records, the bitmap implementations performed bit operations over substantially fewer words. The average compression ratio for the equality encoding bitmaps was 0.17 (with 23 attributes compressing to less than 0.1 times their original size). The average compression ratio for the range encoding bitmaps was 0.7 (with 18 attributes compressing to less than 0.5 times their original size and only 3 attributes not compressing at all).

Also of note is that whereas the presence of missing data can introduce a degra- dation of a couple of orders of magnitude in hierarchical multiple-dimension indexes as shown in the motivating example, there is not a large degradation associated with the presence of missing data using these techniques. Performance can be as high as two times slower with our techniques, and this is attributable to extra bit operations required to handle the missing data.

In our experiments with real data, the range encoded bitmaps performed faster than the equality encoded bitmaps. In these experiments we used range queries over

30 20% of the queried attribute possible values and would expect this result since range

encoded bitmaps are tailored for range queries.

3.6 Conclusions

The extensions proposed in this chapter are easy to apply and allow the effective indexing of missing data. As opposed to traditional hierarchical indexing structures and previously proposed missing data indexing techniques, bitmaps continue to ex- hibit linear performance for query execution time with respect to database and query dimensionality.

This is also the first work to compare and contrast bitmap indexes and VA-Files.

These techniques exhibit a trade off between execution time and indexing space. The bit operations used to evaluate queries for bitmaps are fast, but the space required to represent an exact bitmap can be much higher than a corresponding exact VA-file.

The range encoded bitmaps typically offer the best time performance but do not compress as much as the corresponding equality encoded bitmaps. Range encoded bitmaps perform faster because there is a limit on the number of bit operations that must be performed to evaluate a query for each dimension.

Equality encoded bitmaps perform a maximum of C/2+1 bit operations per query dimension and can perform faster than range encoded bitmaps for point queries or range queries with small ranges. Equality encoded bitmaps can be compressed much more than range encoded bitmaps.

VA-files offer the least size to represent the same information offered by bitmaps, but the operations performed are not bit operations, they usually do not operate as fast as the range encoded bitmaps.

31 There are several areas in which the techniques proposed here could be improved.

The biggest weakness of the range encoded bitmaps is the inability to compress them.

Next we explore the effect of data reorganization on the compression ratio.

32 CHAPTER 4

Improving Bitmap Index Compression by Data Reorganization

4.1 Introduction

Bitmap indexing has been recently used to index scientific data efficiently [58,

47, 76, 77, 62]. Typical queries in this domain require scanning a large portion of the data. Appends of new data from experiments and simulations is common, but in general, once the data is stored it is rarely updated. Bitmap indexes provide fast query response by utilizing bitwise logical operations which are well supported by hardware.

With bitmap indexes, one binary vector is created for each value in the attribute domain. Each position in the vector represents one tuple in the dataset. In their most popular form, known as simple encoding or equality encoded bitmaps [45], a set bit in a given position indicates that the tuple falls into the value or range represented by the bitmap vector. All other bitmap vectors for the same attribute have the corresponding bit set to 0.

In order to reduce the number of bit vectors, binning was proposed [31, 62]. With binning, data are partitioned or quantized into several bins, i.e. one bin represents

33 several values of the attribute domain. The number of bins per each attribute could

vary, and each bin is represented by a bitmap vector.

To execute a query, logical operations are performed over the bitmap vectors. For point queries, all relevant bitmaps are ANDed together. For range queries, all the bitmaps corresponding to the range queried are ORed together, and then resultant bitmaps from different attributes are ANDed together. Equality encoded bitmaps are very efficient for point queries. However, for range queries, a potentially large number of bitmaps need to be read and operated in order to answer a query. To reduce the number of bitmaps involved in answering a query, several bitmap encoding schemes have been developed, such as range [16], interval [16], and workload and attribute distribution oriented [31].

In addition to binning, bitmaps are also compressed to reduce the amount of space they require and effectively utilize them. However, general purpose compression tech- niques are not suitable as they need to decompress the bitmaps before executing the query. For this reason, several compression techniques based on run-length encoding have been developed in order to reduce bitmap index size and retain the advantages of fast bit operations [5, 77, 4, 62]. These techniques exploit uniform segments of a sequence, thus their performances depend directly on the presence of such uniform segments. Their effectiveness varies for different organizations of the database tuples, since ordering of tuples affect uniform segments in the columns. The problem of re- ordering database tuples to achieve higher compression rates was introduced in [50] and it was proved to be NP-complete [49] by reducing it to the traveling salesperson problem (TSP), which is a well-studied combinatorial optimization problem.

34 Recently, traveling salesperson problem solutions have been applied to the re- ordering of boolean matrices [29]. This approach adapts classical TSP heuristics by means of instance-partitioning and sampling. The work by Richards [52] discusses data compression and gray-code sorting. In [50], the authors proposed an in-place

Gray Code Ordering (GCO) algorithm to reorganize the data. GCO was proved to be optimal when all cells of a bitmap table are full, i.e., there exist tuples with all pos- sible combinations of the attribute bins. However, this is an unrealistic assumption, as the number of cells is exponential in the number of columns.

In this chapter, we propose a new ordering mechanism and study the effect of column ordering in compression of ordered data. Our ordering strategy is called adaptive code ordering (ACO). ACO exhibits the same performance in terms of exe- cution time and memory requirements than GCO and performs better than GCO in terms of number of runs after reordering. We prove that ACO algorithm is optimal for more general cases than GCO. For example, ACO is optimal not only when all the tuples in the boolean table are present (as it is the case with GCO) but also when all tuples are present in the equality encoded bitmap table, for which GCO is not optimal. The number of possible distinct values in a full bitmap table is given by the product of the cardinalities of all the attributes. We also evaluate the effect of column ordering to further improve query runtime performances beyond the gains due to reduced total table size, by favoring frequently accessed columns during the ordering process. Our techniques remain as a preprocessing step before compression, only to improve the performance, without affecting algorithms used for compression and query processing. Our algorithm is also in-place and runs in linear time to the size of the database. Our adaptive code ordering is shown to improve compression rates

35 significantly and these improvements translate directly into improved query execution

times.

The rest of this chapter is organized as follows. In the next section, we present the related work. Section 4.2 introduces the adaptive code ordering tailored for the tuple reordering problem and several metrics to decide the order in which the columns should be evaluated. Experimental results are presented in Section 4.3. Finally, we discuss future work and conclude with Section 4.4.

4.2 Adaptive code ordering

The goal of reordering the bitmap data is to produce longer uniform segments such that the performance of the run-length encoders used to compress the bitmap indexes is improved. In this section, the problem of reorganizing bitmap tuples is described and an integrated approach to reorder a table that involves both columns and tuples ordering is proposed. Our heuristic is suitable for large datasets and it is called adaptive code ordering (ACO). In addition, formulas to estimate the size of a

ACO reordered compressed bitmap are presented.

4.2.1 Tuple Reordering

The goal is to increase the performance of run-length encoding by reordering tuples so that longer uniform segments, and thus fewer number of blocks are generated. Run- length encoding packs each uniform segment into a block and stores the length of the block. Thus the storage size is determined by the number of such blocks. Consider two consecutive tuples in the bitmap table. If the tuples are on the same bin for an attribute, they will be packed to the same block. If not, a new block should start.

36 t1 001 001 t1 001 001 t1 001 001 t2 001 010 t2 001 010 t2 001 010       t3 010 001 t5 010 100 t4 010 010 t  010 010  t  010 010  t  010 001  4   4   3   t  010 100  t  010 001  t  010 100  5   3   5   t  100 001  t  100 001  t  100 100  6   6   8   t  100 010  t  100 010  t  100 001  7   7   6   t  100 100  t  100 100  t  100 010  8   8   7   (a) Lexicographic Order Ta- (b) GCO Reordered Table (c) ACO Reordered Table ble

Figure 4.1: Example for tuple reordering on an equality encoded bitmap table.

Efficiency can be enhanced by reordering tuples so that they fall into the same bins

as much as possible.

The following is a slightly modified definition which is more general and therefore applicable to all bitmap encodings, than the original problem presented in [50].

Let diff(ti, tj) denote the number of columns that tuple ti and tuple tj fall into different bins, and let πi denote the ith tuple in ordering π. Observe that diff(πi, πi+1)

gives how many new blocks start after the ith tuple in the reordered file when run-

length encoding is used. For example, using the table in Figure 4.1(a), diff(t1, t2) =

2 since tuples t1 and t2 fall into different bins for the last attribute (differ in two

columns). Formally, the tuple reordering problem is defined as follows.

Tuple reordering problem Let π be an ordering of m tuples so that πi denotes

the ith tuple in the ordering. Tuple reordering problem [50] is the problem of finding

an ordering π that minimizes

m−1

diff(πi, πi+1). (4.1) i=1 X

37 In Equation 4.1, diff values are summed over all consecutive tuples to attain

the number of new runs that start for the whole table. The first tuple requires

starting a run for each column, therefore the number of blocks can be computed as

m−1 n + i=1 diff(πi, πi+1), where A is the number of attributes. Thus an ordering that minimizesP Equation 4.1 also minimizes the number of blocks in the reordered table.

For instance, Equation 4.1 returns 2+4+2+2+4+2+2 = 18 for the lexicographic order in Figure 4.1(a), which means with the addition of the number of columns, there will be 6 + 18 = 24 blocks in the compressed table. For the fundamental GCO in the same figure, Equation 4.1 returns 2+4+2+2+2+2+2 = 16, which means the reordered table will have 22 blocks after compression. For the ACO reordered table in Figure 4.1(c) however, Equation 4.1 returns 2 + 2 + 2 + 2+2+2+2=14, which means the reordered table will have only 20 blocks after compression.

Note that the problem has different characteristics depending on the bitmap en- coding used. For equality encoded bitmaps, diff is always even and at most 2d,

independent of the number of columns in the bitmap. For range encoded bitmaps,

diff can be any number from 0 to the total number of columns.

4.2.2 Heuristics for Tuple Reordering

In this section, the necessary conditions for an ordering algorithm to be considered

effective are described and Gray code ordering is briefly described. Then, the proposed

strategy for tuple reordering is described and the optimality of the technique under

several conditions is shown. Several criteria to decide the order in which the columns

should be evaluated are provided as the proposed heuristic favors the earlier columns,

i.e. the impact of our solution decreases with the number of columns.

38 Desirable properties of a reordering algorithm are:

• Efficient. At atmost linear in the order of the number of tuples.

• In-place. Should not use any auxiliary memory.

• Locality. Sufficient to apply the reordering to portions of the database to

improve compression This provides scalability, since the method can be applied

to databases of arbitrary sizes by parts.

Reordering using Gray code ordering [50] exhibits all three properties. We will show that the proposed method has the same complexity and memory requirements and performs better than GCO.

Gray code ordering (GCO)

A Gray code is an encoding of binary numbers so that adjacent numbers differ only by a single digit. For instance (000, 001, 011, 010, 110, 111, 101, 100) is a binary Gray code. Binary Gray code is often referred to as the “reflected” code, because it can be generated by the reflection technique, as described below.

1. Let S =(s1,s2,...,sn) be a Gray code.

2. First write S forward and then append the same code S by writing it backward,

to produce

(s1,s2,...,sn,sn,...,s2,s1).

3. Append 0 at the beginning of the first n binary numbers, and 1 at the beginning

of the last n binary numbers.

39 As an example, take the Gray code (0, 1). Write it forward, then add the same

sequence backward to obtain (0, 1, 1, 0). Then add 0’s and 1’s to get (00, 01, 11, 10).

This new sequence can be used as an input to our algorithm. After the reflection

step the sequence is (00, 01, 11, 10, 10, 11, 01, 00). The first digits are appended to

attain (000, 001, 011, 010, 110, 111, 101, 100). It is important to note that Gray codes

are not unique, and different orders on the same group of numbers might satisfy the

Gray code property. The term fundamental Gray code is used to refer to a Gray code

generated by the reflection technique described above with using (0, 1) as the initial

sequence.

In [50], GCO was shown to be optimal when all cells are present in the boolean table. In practice tables are full very rarely, and only for a small number of attributes and small cardinality. Bellow are some definitions that would be used in the rest of this section.

Full table A boolean table T with n columns is considered full if it has m = 2n

distinct tuples.

A boolean table T with 30 columns (3 attributes with cardinality 10) would need

230 =1, 073, 741, 824 distinct tuples to be considered full. Nevertheless, the encoding

of the bitmaps reduces considerably the number of tuples required to consider the

bitmap table full. For example, for the most commonly used encoding, i.e. equality

encoded bitmaps, the maximum number of distinct tuples is 103 = 1, 000 for the

aforementioned table T as there is only one 1 per attribute.

Full equality encoded bitmap table An equality encoded bitmap table T that

d encodes d attributes with cardinalities {c1,c2,...,cd} has n = i=1 ci columns and d P m = Πi=1ci distinct tuples is considered to be full.

40 For range encoded bitmaps the number of tuples required to consider the bitmap

table full is the same as in the case of equality encoded bitmaps, but the number of

columns is smaller.

Full range encoded bitmap table A range encoded bitmap table T that encodes

d d attributes with cardinalities {c1,c2,...,cd} has n = i=1(ci − 1) columns and d P m = Πi=1ci distinct tuples is considered to be full.

The fact is that GCO is not optimal when the equality encoded bitmap table is full

nor in the general case for range encoded bitmaps. For this reason, we propose a new

adaptive code ordering (ACO) algorithm that is more robust to the missing cells. The

adaptive code ordering proposed in the subsequent section is an in-place algorithm

and thus optimal in terms of memory requirement. We prove that ACO produces the

same ordering as GCO if the table is full so it is optimal in that case. We also prove

that ACO produces an optimal ordering when the bitmap table is full. ACO can

be applied to the whole database, since it has a regular access pattern and requires

a small number of passes over the bitmap table. Alternatively, conventional TSP

solution techniques can be adopted for the tuple reordering problem. However, these

techniques almost invariably require additional storage, which is often super-linear in

the number of tuples.

4.2.3 Adaptive code ordering (ACO)

In this section, an adaptive code ordering (ACO) is proposed. ACO is tailored

to the reordering of bitmap tables and depends on the data itself imposing no pre-

determined rank. For the simple encoding, a table T = A1, A2, ..., Ad with m tuples where each attribute Ai is divided into ci bins, the number of columns in the bitmap

41 ACO (A, start, end, n, b)

1: i ← start 2: j ← end 3: first ← S(A, start − 1, b) 4: while i < j 5: Decrement j until S(A, j, b)= first 6: Increment i until S(A, i, b)=1 − first 7: if i < j then 8: Swap the ith and jth tuples on the table 9: end if 10: end while 11: if b < n then 12: ACO(A, start, j, n, b + 1) 13: ACO(A, j +1, end, n, b + 1) 14: end if

Algorithm 1: An adaptive code sorting algorithm. ACO (A, start, end, n, b) sorts numbers between indexes start–end in A according to their least significant b bits using an adaptive code order. S(A, i, j) denotes the jth significant bit of the ith number in table A.

d table of T is given by n = i=1 ci. Each tuple is represented by a binary string

S = b1b2...bc1bc1+1...bn−1bn. P

The pseudocode of ACO is given in Algorithm 1. ACO recursively orders the

bitmap table one column at a time. The variable first dictates the order in which the

segment is ordered, if first = 0 then the order is 0s first, 1s last, and vice versa. The

algorithm looks at the previous bit in the current column and make first equal to

that symbol (Line 3). This is the essence of the adaptive nature of our algorithm. In

the case where there is no previous bit, i.e. (start = 0), then first is initialized to 0.

The segment between start and end is then ordered in first order (Lines 4-10). The

variable j holds the position where the run changes (the position at where the run of

42 aGCO t 1 0 1 0 t 1 0 1 0 i t 0 1 0 1 aGCO 1 first = 0 1 4 t 0 1 0 1 (A,1,8,4,1) swap i swap 4 (A,1,2,4,2) t2 1 0 0 1 t2 1 0 0 1 t2 1 0 0 1 j t3 0 1 1 0 t3 0 1 1 0 Decrement j t3 0 1 1 0 t3 0 1 1 0 j t2 1 0 0 1 i t 0 1 0 1 until 0 t 0 1 0 1 t 1 0 1 0 4 4 j 1 t 1 0 1 0 Increment i 1 until 1 S1 S2 S3 S4 . . .

t 0 1 0 1 t 0 1 0 1 t 0 1 0 1 t 0 1 0 1 aGCO 4 4 4 first = 1 4 t 0 1 1 0 swap t 0 1 1 0 t 0 1 1 0 (A,3,4,4,3) t3 0 1 1 0 . . . 3 3 3 t 1 0 1 0 t 1 0 0 1 i t 1 0 0 1 t1 1 0 1 0 1 2 Decrement j 2 t 1 0 0 1 t 1 0 1 0 until 1 t 1 0 1 0 t2 1 0 0 1 2 1 j 1 Increment i until 0 S8 S7 S6 S5

Figure 4.2: Illustration of Algorithm 1.

first symbol ends). The next column is ordered recursively before and after position j (Lines 12-13).

It is worth stressing that Algorithm 1 is an in-place algorithm, and similarly to GCO, ACO also tries to maximize the bit-level similarity between consecutive numbers. Contrary to GCO, ACO determines the order of a segment based on the previous bit, i.e. the algorithm tries to extend the previous run given the present data.

Figure 4.2 illustrates this algorithm over a table A with 4 rows and 4 columns.

The first call of ACO is done over the entire table (ACO(A,1,4,4,1)) (S1). Variable j is decremented until the bit in the jth position is equal to 0 (j = 4) and variable

th i is incremented until the bit in the i is equal to 1 (i =1) (S2). The tuples t1 and t4 are swapped) (S3). Since i < j we continue decrementing j and incrementing i to

swap tuples t2 and t3 (S4). We call ACO over the two halves for the second column

(ACO(A,1,2,4,2) and ACO(A,3,4,4,2)) but there is no change as the column is already

43 ordered and no more splits are created because there is only one symbol in each

segment. ACO is called for the first segment on the third column (ACO(A,1,2,4,3)),

and a split is created at position 1. ACO is called for the second segment on the third

column (ACO(A,3,4,4,3)) (S5). Variable first is set to 1 because the bit in second

position of the third column is equal to 1. j is decremented and i is incremented

(S6) and tuples t2 and t1 are swapped together (S7). The final ACO ordered table is shown in S8.

In the following lemma, we prove that our algorithm produces the fundamental

Gray code ordering if the table is full, i.e. the number of rows is 2n where n is the number of columns in the table, so it is optimal given this scenario.

Lemma 4.2.1. If the table is full, Algorithm 1 orders numbers in A to be Gray code

sorted, when initially invoked with ACO (A, 1, m, n, 1), where m is the number of

tuples, and n is the number of columns. In other words, the ACO would produce the

same ordering as the fundamental Gray code ordering if the bitmap table is full.

Proof. The proof is based on induction on the number of bits. First observe that

recursive calls respect the previous orderings, since after one pass, the recursive calls

only operate on the segment of tuples that all start with the same bit prefix.

The inductive basis is for n = 1, when the correctness of the algorithm is trivial.

It is also easy to see that numbers that start with 0 should precede those that start

with 1 for adaptive code sorting. By the inductive hypothesis, the numbers that start

with 0 are sorted correctly by the algorithm according to their last n − 1 bits, and adding 0 does not affect their code precedence. Similarly, numbers that start with 1 are code sorted recursively according to their last n − 1 bits, however putting 1 at the

44 beginning requires the reflected order, which is achieved by looking at the previous bit in the table (Line 3). The previous bit is guaranteed to be 1 as the table is full.

The adaptive nature of our algorithm, allows ACO to achieve optimality in others, more realistic cases, such as when all distinct tuples exist in the equality encoded and the range encoded bitmaps. First, optimality of ACO for the equality encoded (EE) bitmaps is proved in the following theorem.

Theorem 4.2.2. The adaptive code ordering provides an optimal solution for the tuple reordering problem, when the equality encoded bitmap table is full.

Proof. The algorithm orders identical tuples consecutively. Since the hamming dis- tance between tuples is always even with equality encoded bitmaps, it suffices to prove that for each consecutive rows diff(πi, πi+1) is at most 2 for all i. First observe that recursive calls respect the previous orderings, since after one pass, the recursive calls only operate on the segment of tuples that all start with the same bit prefix. For a given attribute, there are never two set bits at the same time for a given tuple. The distance between tuples for one attribute is either zero or two.

For a contradiction, let us assume that, after reordering, the difference between two consecutive tuples πi and πi+1 is greater than 2 for some i. This means that the tuples differ on at least two attributes. Suppose diff(πi, πi+1) = 4. Let us denote by bj and bk two columns of the attributes for which πi and πi+1 differ (bj and bk belong to different attributes). The two other columns for which πi and πi+1 differ are omitted as they would be the duals of bj and bk. Without loss of generality, let us assume that bj comes before bk in the order in which the columns were evaluated. When column bj was ordered the algorithm made two recursive calls for the next column on two

45 t1 00 00 t1 00 00 t1 00 00 t2 00 10 t3 00 11 t2 00 10       t3 00 11 t2 00 10 t3 00 11 t4  10 00  t8  11 10  t6  10 11        t  10 10  t  11 11  t  10 10  5   9   5   t  10 11  t  11 00  t  10 00  6   7   4   t  11 00  t  10 00  t  11 00  7   4   7   t  11 10  t  10 11  t  11 10  8   6   8         t9  11 11  t5  10 10  t9  11 11  (a) Lexicographic Ordered (b) GCO Reordered Table (c) ACO Reordered Table Table

Figure 4.3: Example for tuple reordering on a range encoded bitmap table.

parts (before i and after i). Now, when bk was ordered, πi and πi+1 must have been ordered by two different calls of ACO, as they fall into two different segments. Since the code looks at the previous bit (the kth bit in tuple πi) to decide which symbol to put first, and the symbol from πi+1 differs from the symbol in πi, this implies that

there was only one symbol in the segment starting at position i + 1. However, this

contradicts the assumption that the bitmap table was full.

Next, optimality for the range encoded (RE) bitmaps is proved. With this encod-

ing, the distance between consecutive tuples tr and tr+1 that fall into different bins

for the same attribute Ai, ranges from 1 to ci − 1. The maximum distance between

two adjacent tuples diff(tr, tr+1) = n for a RE bitmap table with n columns. The

optimal ordering is produced when the distance between all pair of adjacent rows is

1.

It is worth pointing out that GCO would produce the same ordering as ACO when

the columns of the attributes are evaluated in increasing order of their ranges (See

Appendix for the formal proof), however, in other cases GCO may not be optimal.

46 For example, consider the range encoded bitmap table presented in Figure 4.3, which

has 2 attributes divided into 3 bins and has 4 columns. The number of tuples for

this full bitmap table is 9. The number of runs produced by GCO ordering is 16.

However, the optimal number of runs is 12, which is produced by ACO ordering.

Optimality of ACO for a full range encoded bitmap table is proved in the next

theorem.

Theorem 4.2.3. For a bitmap table T that is full under range encoding, ACO gives

the best performance.

Proof. It suffices to prove that for each consecutive rows diff(πi, πi+1) is at most 1 for all i. First observe that recursive calls respect the previous orderings, since after one pass, the recursive calls only operate on the segment of tuples that all start with the same bit prefix. For a contradiction, let us assume that, after reordering, the difference between two consecutive tuples πi and πi+1 is greater than 1 for some i.

Let us denote by bj and bk, two columns for which πi and πi+1 differ. Without loss

of generality, let us assume that bj comes before bk in the order in which the columns

were evaluated. When column bj was ordered the algorithm made two recursive call

for the next column on two parts (before i and after i). Now, when bk was ordered,

πi and πi+1 must have been ordered by two different calls of ACO, as they fall into

two different segments. Since the code looks at the previous bit (the kth bit in tuple

πi) to decide which symbol to put first, and the symbol from πi+1 differs from the

symbol in πi, this implies that the segment starting at i + 1 in column k had only one

symbol: the complement of the symbol in position i. Now, there are two cases:

• bj and bk belongs to the same attribute. Without loss of generality let us assume

that the range encoded by bj is [0,r1] and the range encoded by bk is [0,r2] such

47 that r1 < r2 (it is easy to see that the result follows for r2 < r1 as well). Note

that in this case, the symbol in position i in both columns needs to be the

same, as it is not possible for range encoded bitmaps to have both, 0...1 and

1...0, since it is not possible for a value to be in [0,r1] and not in [0,r2]. Let us

assume that consecutive values are x...x and 1-x...1-x, which implies that both

0...1 and 1...0 are missing. However, since the range encoded bitmap table is

full, one of the two must be present.

• bj and bk belong to different attributes. In a full range encoded table, the four

values 0...0, 0...1, 1...1, and 1...0 would be present.

By these results, Algorithm 1 gives an optimal solution when bitmap tables are full, however in practice some cells could be missing and the solution may not be optimal. However, the adaptive nature of our Gray code ordering produces better results than the fundamental Gray code ordering.

Figure 4.4 illustrates how adaptive code ordering reorganizes the database tuples.

In this figure, each bar represents one column in the bitmap. Shown is a full bitmap table with three attributes with cardinality 3. The white portion of the bar represents the continuous sequence of 0s while the shaded portion represents the continuous sequence of 1s. The left part of the figure represents the numerical (lexicographic) sorting of the tuples, the same organization achieved by physically ordering the tuples by all the attributes at the same time. The right part represents the adaptive code ordering. It can be seen that with adaptive code ordering many runs concatenate to form longer runs than using numeric/lexicographic ordering, making the adaptive

48 Figure 4.4: Illustration of the adaptive code ordering effectiveness. On the left is the lexicographic (numeric) order of an equality encoded bitmap table with 9 columns (3 attributes with cardinality 3). On the right is the adaptive code ordering of the same table. White and shaded blocks represent runs of 0s and 1s, respectively. Each horizontal split indicates the segments created by the reordering algorithm at each column. Adaptive code produces less and longer runs than lexicographic order.

code ordering more effective. Note that the number of runs in each column increases as one goes further to the right of the bitmap table. This means that the first columns evaluated would have a very small number of runs, compressing better, but the last columns would potentially have a large number of runs. For this reason, we consider different criteria to select the order in which the columns should be evaluated.

4.2.4 Criteria for Column Ordering

An optimal solution for the TSP formulation of the tuple reordering problem will

not be affected by the ordering of the columns. Column ordering does not make a

49 difference in the overall compressed size if the bitmap table is full (as the solution

from ACO ordering is optimal). Also, column ordering would not make a significant

difference in the overall compressed bitmap size if the expected compressed size is the

same for all the columns, i.e. if the number of set bits is the same for all the columns.

However, assuming that the bitmap table is not full and real bitmap tables follow different distributions than uniform in many cases, several criteria for column ordering are presented.

As can be seen in Figure 4.4, ACO aligns the rows so that the runs in the first few columns are perfect, but after some columns, there is no improvement in compression and the size of the compressed column is the expected size, i.e. the size of the randomly ordered column. This naturally raises the question of how to order the columns for maximum compression rates and/or faster query execution times.

In this section two approaches for deciding in which order to evaluate the bitmap columns in the adaptive code ordering algorithm are explored. For the first approach, the bitmap data itself is used to decide which column to evaluate first. For the second approach, the columns are evaluated by their access frequency extracted from the query history files.

Reordering by Bitmap Data

In this section, a solution to the column ordering problem is provided in the absence of query histories. For efficiency considerations, the column selection criteria restricted to only those that remain invariant during the ordering process. The most prominent property is the set bits, i.e., the number of 1s in the column denoted by s. The number of set bits in column i is denoted by si. Consider the set of columns

that correspond to the bins for an attribute. For the basic bitmap encoding, only

50 one of these columns would have a set bit for each tuple, the rest would be 0s. As a

consequence, a run of 1s in a column corresponds to runs of 0s in all other columns for

the same attribute. Therefore, selecting the columns with more set bits first would

be a nice greedy strategy for equality encoding bitmaps. For range encoded bitmaps

on the other hand, all columns have a set bit for tuples that fall to the smallest

bin. In this case, ordering by column for the smallest bin, say column i, generates runs of si tuples on all columns for this attribute. Therefore, one can anticipate that selecting the columns with fewer set bits (which are most likely the smaller values) would produce better results.

As another criterion, compressibility is defined as a function of bit density of the column. The bit density of a column bi is computed as

s d = i i m

where si is the number of set bits and m is the total number of bits in the column.

The intuition is that the more even the number of 0s and 1s in the column (when the bit density is closer to 0.5 from either side), the harder it would to compress the column as there would be more chances of having interrupted, short runs. It is intuitively to put the harder to compress columns first, since the later columns, which

are easier to compress, would compress well because the majority of the bits have the

same value. Compressibility of a column bi can be expressed as:

1 C = | − d | i 2 i

where di is the bit density of bi. Therefore, the bigger the number Ci, the easier it is to compress the column.

51 Using any of these criteria, the columns can be evaluated in two directions: in-

creasing and decreasing order. When the columns are ordered by the number of set

bits in decreasing order, the columns with more number of set bits are evaluated

first. When the columns are ordered by their compressibility in increasing order,

the columns that are potentially harder to compress are evaluated first. Note that

this ordering would not necessarily improve the overall compression. The average

run length after Gray code ordering will decrease exponentially if the bits are uni-

formly distributed. In this case, the first few difficult columns will compress well

but the remaining columns would be forced to have more segments. Nevertheless,

this evaluation order would decrease the maximum compressed column size of the

bitmap table, since the harder to compress columns would be first. This lower bound

on the maximum compressed bitmap column size would translate into better query

performance since the worse-case query, i.e. one querying the biggest columns, would

perform faster since the query time depends on the size of the columns involved in the

query. Similarly, ordering the columns by compressibility in decreasing order, would

evaluate the easier to compress columns first, extending the clustering power of ACO

to the later columns.

Reordering by Query History Files

A query history file contains a cumulative history of the queries that have been executed by the users. Form this file, the access frequency for each column of the bitmap index can be derived. In order to achieve faster query runtime, frequently accessed columns should be evaluated early in the ordering. Therefore, in the presence of query histories, one can sort the columns based on frequencies of access to minimize average query runtime.

52 Generally, real world queries present a skew distribution, i.e. few columns are

accessed very often while many others are rarely accessed. In order to represent

this phenomenon, a query history file following a Zipf distribution was randomly

generated. The classic case of Zipf’s law is a 1/n function. Given a set distributed

frequencies, sorted from most common to least common, the second most common

frequency will occur 1/2 as often as the first one. The third most common frequency

will occur 1/3 as often as the first one. The nth most common frequency will occur

1/n as often as the first one.

4.2.5 Expected Size of the ACO Reordered Bitmap Table

The expected size of a compressed bitmap table using WAH is given in terms of

s the density of the bitmap table. The bit density is computed as dB = N where s is the number of set bits and N is the total number of bits, which in our case is N = mn.

The expected size of a random bitmap is given by the formula [79]:

N m (d ) ≈ 1 − (1 − d )2w−2 − d2w−2 R B w − 1 B B  where w is the word size, 32-bits in our case.

For reordered bitmaps, a new bit density that reflects the long runs created during reordering needs to be computed. Note that half of the the number of runs r in a

r bitmap are runs of 1s. However, 2 cannot be considered as the new number of set bits s′ since one run may need more than 1 word to compress. In the general case

and for long runs, 3 WAH words are needed to compress a run: two literals and one

fill word. The two literal words take care of the beginning and end of the run, and

the fill word compresses the middle of the run. Note that this is a safe estimation.

For several cases, such as when a run is at the beginning or the end of the column, or

53 the size of the run is less than w − 1, less number of WAH words would be needed to compress a run. By multiplying by 3 the number of runs of 1s, the number of set bits in a reordered bitmap can be safely estimated. The new set bits s′ is then obtained

as: 3 s′ = min(⌈ r⌉,s) 2

where r is the number of runs in the reordered bitmap. Note that, by taking the

minimum between s and the compressible runs, the size of the reordered bitmap is

never greater than the original bitmap.

In order to compute the bit density, a new value for the number of bits in the bitmap, N, is computed as:

N ′ = N − s + s′

The bit density of a reordered bitmap can be then computed as:

s′ d′ = B N ′

Estimating the expected size of the compressed reordered bitmap reduces to de-

termine the number of runs in the reordered bitmap. The following formulas to

determine the number of runs in the reordered bitmap are based on a full equality

encoded bitmap table.

Let us first consider the basic case where all the columns of an attribute are ordered

together, i.e. they are evaluated one after the other. Let us revisit the example in

Figure 4.4. As can be seen, the number of runs grows exponentially for different

attributes but somewhat linear for the columns of the same attribute.

Note that for each segment in a column, the number of splits produced, if the

bitmap table is full, is equal to the cardinality of the attribute. For a column, the

54 number of segments is given by the product of the cardinalities of the previously

ordered attributes.

The formula to compute the number of runs of an attribute Aj after all its columns have been reordered is then given by:

r(Aj)= cj + 2(cj − 1) ck k

columns) for the ith attribute ordered. Even though the formula grows unbounded

as we move along in the ordering, the number of runs is bounded by the number of

tuples m in the table. In the worse case, there would be at most m runs in a column.

However, since for equality encoded bitmaps there is only one 1 per tuple (m ones in total), the maximum number of runs for an attribute Aj is:

rmax(Aj)= cj + 2(m − 1)

where cj is the attribute’s cardinality. cj indicates that a new run starts with each column and the number of runs is reduces by 2 as there is one column that starts with 1 and one column that ends with 1 (it is not possible for all columns to be 0 under this bitmap encoding).

For simplicity of the analysis, let us make the cardinality for all the attributes the same, i.e. c. Assuming all columns of an attribute are ordered together, the total expected number of runs for all the columns of Aj, (the j-th attribute) after all its columns are ordered is:

j−1 r(Aj)= c + 2(c − 1)c

55 The number of attributes for which adaptive code is guaranteed to have a signifi- cant effect is k such that c + 2(c − 1)ck−1 = c + 2(m − 1) and obtain:

m − 1 k = log +1 c c − 1

k ≈ logc m

The total number of runs for the whole reordered bitmap table can then be esti-

mated as k d j−1 rACO = (c + 2(c − 1)c )+ (2(m − 1) + c) j=1 j k X =X+1 k = kc +2 (cj − cj−1)+(2(m − 1)(d − k − 1) + c(d − k − 1)) j=1 X = kc + 2(ck − 1) + (2(m − 1)(d − k − 1) + c(d − k − 1)) where d is the number of attributes. However, when d

d rACO = dc + 2(c − 1)

Relaxing the assumption that all columns of an attribute are ordered together, the number of runs for the ith column can be computed as:

r(bi)= Q(A(bi))+1+ min(Q(Aj)+1,cj) ∈ − Aj PYA(bi) where A(bi) is a function that returns the attribute to which column bi belongs to,

Q(Aj) is the number of columns of Aj that are already ordered, and P is the set of attributes for which at least one column has been reordered.

The impact of using the adaptive code algorithm increases with the number of tuples in the bitmap table and decreases with the number of columns. The number of columns depends on the number of attributes and the number of bins (cardinality)

56 of each attribute. Increasing these terms increases the number of cells in the bitmap

table, making the table more sparse. Nevertheless, even when the bitmap table has

a lot of missing cells, adaptive code ordering imposes bit-level similarity between

consecutive tuples very effectively as evidenced by our experimental results.

4.3 Experimental Results

In this section we evaluate the performance of our adaptive code ordering over synthetic datasets and real datasets from various application domains. All tables show improved performance of WAH compression algorithm [79] over compressed bitmaps of the same random-ordered data. As we will present in detail, the performance improvements are not only in terms of bitmap size but also in terms of faster query execution times. Our algorithm performs better than GCO while maintaning the same complexity. Through out this section the proposed approach would be represented as

ACO or aGCO interchangeably.

4.3.1 Performance of adaptive code ordering

In this section, we evaluate the performance of the adaptive code ordering. First,

we present the improvements in compression for equality encoded and range encoded

bitmaps. Then we show that these improvements come only at a minor cost, and

finally we demonstrate how our techniques can scale to larger datasets.

Compression Size

In this section, the effectiveness of our methods based on the improvement factor

from [50] is presented. The improvement factor is computed as the ratio of the

compressed bitmap table size of the original data to the compressed bitmap table size

57 Compressed size Compr Compr Bitmap table (32-bit KWords) Ratio Ratio Improv Dataset cols rows Original Reord WAH ACO Factor HEP1 122 2, 173, 762 3, 180.85 549.83 2.61 15.07 5.79 HEP2 907 2, 173, 762 20, 615.24 6, 956.64 2.99 8.86 2.96 landsat 10x2 20 275, 465 161.27 22.44 1.07 7.67 7.19 landsat 10x4 40 275, 465 286.44 40.01 1.20 8.61 7.16 landsat 60x4 240 275, 465 1, 662.47 1, 049.78 1.24 1.97 1.58 landsat 10x8 80 275, 465 515.50 165.73 1.34 4.16 3.11 histogram 10x3 30 112, 360 94.37 14.25 1.12 7.39 6.62 histogram 20x3 57 112, 360 182.35 66.94 1.10 2.99 2.72 histogram 65x3 186 112, 360 622.94 473.93 1.05 1.38 1.31 irvector16 kmx2 32 19, 997 9.26 .827 2.16 24.18 11.20 irvector16 kmx4 64 19, 997 25.79 9.42 1.55 4.25 2.74 irvector16 ew 64 19, 997 17.10 3.86 2.34 10.36 4.43 irvector32 kmx2 64 19, 997 20.97 6.12 1.90 6.53 3.43 irvector32 kmx4 128 19, 997 47.21 28.77 1.69 2.78 1.64

Table 4.1: Improvement in compression of real data sets for Equality Encoded bitmaps

of the reordered data. Thus, an improvement factor of 5 means, compressed reordered

data takes 5 times less space than the compressed original.

For reference, the compression ratio of the original data (WAH) and the reordered data (ACO) is also included.

Table 4.1 and Table 4.2 reveal the effectiveness of our adaptive code reordering algorithm on equality encoded and range encoded bitmaps, respectively. For these experiments, different data sets from various applications are used. In both tables, the first three columns give the name of the problem, the number of columns, and the number of tuples in the bitmap tables, respectively. The next two columns present the sizes of the compressed bitmap tables for the original and reordered data, respectively.

The next two columns present the compression ratio for WAH compressed bitmaps

58 Compressed size Compr Compr Bitmap table (32-bit KWords) Ratio Ratio Improv Dataset cols rows Original Reord WAH ACO Factor HEP1 110 2, 173, 762 3, 302.09 432.96 2.26 17.26 7.63 landsat 10x4 30 275, 465 195.83 23.46 1.32 11.01 8.35 landsat 60x4 180 275, 465 1, 132.47 675.24 1.37 2.30 1.68 histogram 10x3 20 112, 360 62.18 8.34 1.13 8.42 7.46 histogram 65x3 130 112, 360 409.02 302.94 1.12 1.51 1.35 irvector16 4 48 19, 997 15.31 4.83 1.96 6.21 3.17 irvector32 4 96 19, 997 30.11 17.83 1.99 3.37 1.69

Table 4.2: Improvement in compression of real data sets for Range Encoded bitmaps

of the original data and the reordered data, respectively. The last column presents the improvement factor.

The tables present several bitmap tables generated from four different applica- tions. HEP come from real high energy physics applications. Landsat are features from satellite images. Histogram comes from an image database with 112,361 im- ages. Images are collected from a commercial CD-ROM and 64-dimensional color histograms are computed as feature vectors. The data set histogram is partially correlated. Irvector are composed of document feature vectors from 20 newsgroups based on TF/IDF (Term Frequency-Inverse Document Frequency) followed by reduc- tion based on SVD (Singular Value Decomposition).

As seen in Table 4.1, compression rates are magnified when the tuples are reordered with respect to adaptive code ordering in all problem instances from all applications.

For this experiment the order of the columns is the original order, i.e. no column ordering criteria is used to evaluate the columns. Comparing the results for these

59 data sets, one can see that as discussed in the previous section, improvements are more significant when the number of columns is smaller. Fewer number of columns means more room for improvement with our reordering algorithm. Nevertheless, improvements are significant even for larger numbers of columns.

As an alternative to adaptive code ordering, we also try the 2-switch heuristic on the TSP graphs for tuple reordering. As expected the runtime was orders of magnitude slower compared to ACO. For instance, adaptive code ordering on HEP1, which has 122 columns and 2,173,762 rows took only 37.4 seconds, whereas the 2- switch heuristic on the TSP graph took over 1,600 seconds. We have observed some improvement in the compression (about 1%), but the huge gap in run time was daunting. We have observed similar results in the other data sets.

Execution time for adaptive code ordering

In the next set of experiments, we looked at the affordability of our adaptive code ordering as a preprocessing step, and our results are presented in Figure 4.5. This

figure shows the combined execution time of ACO and WAH compression for different number of rows. As can be seen, adaptive code ordering takes only 25% of the time for compression. Therefore we claim that adaptive code ordering should be considered as a natural preprocessing step for run length encoding based compression algorithms.

Scalability

We tested the scalability of our algorithm by measuring the run time as a function of the bitmap table size. For these experiments we used HEP1 data. The experiments were run on a Pentium 4 PC with 2.25GHz CPU and 512MB of memory.

60 Figure 4.5: Execution Time for ACO and WAH Compression.

Figures 4.6(a) and 4.6(b) show the average times of five runs on different problems of the same size. That is, the run time of the algorithm for 1,000 rows is reported as the average run time for 5 randomly selected row sets of size 1,000.

(a) Scalability vs. Number of Rows (b) Scalability vs. Number of Columns

Figure 4.6: Algorithm scalability (a) on the number of rows and (b) on the number of columns

61 3500 7000 WAH WAH 3000 6000 aGCO aGCO 2500 aGCO RR 5000 aGCO HH 2000 4000

1500 3000

1000 2000 Index Size (KWords) Index Size (KWords) 500 1000

0 0 1 3 5 7 9 0 5 10 15 20

Attributes Attributes

(a) Low dimensions (b) Higher Dimensions

Figure 4.7: Performance of ACO and WAH for different column orderings and varying numbers of attributes

Figure 4.6(a) studies the effect of the number of rows in the run time. For these runs, we used the first 100 columns and varied the number of rows from 1, 000 to

50, 000. In Figure 4.6(a), the x-axis is the number of rows, and the y-axis is the run time in milliseconds, and the results clearly show the linear relation between the number of rows, and the runtime. Similarly, the effect of the numbers of columns on the run time can be observed in Figure 4.6(b). Here, we fix the number of rows as

50, 000 and vary the number of columns from 10 to 80. All results confirm the linear relation between the runtime of our algorithm and the bitmap table size.

We also test the performance of ACO in terms of solution quality for varying number of columns over uniform random and Zipf distribution data. We fixed the number of rows 2,000,000 and tested the performance of our algorithm by changing the number of attributes and the number of bins per attribute. Figure 4.7(a) shows the compressed bitmap size of the uniform random data in thousands of 32-bit words

62 80

70

60 Uncompressed WAH 50 aGCO 40

30

20

Column Size (KWords) 10

0 1 11 21 31 41 51 61 71 81 91 101 111 121

Column Number

Figure 4.8: Performance comparison of ACO and WAH for HEP data. Columns are ordered in their compressed size.

Figure 4.9: Performance comparison for varying cardinality

63 100000 100000

Uncompressed 10000 10000 aGCO Uniform aGCO Zipf 1000 1000

100 100 Uncompressed

10 10 aGCO Uniform Column Size (Log Words) Column Size (Log Words) aGCO Zipf

1 1 0 10 20 30 40 50 0 50 100 150 200 250

Column Number Column Number

(a) Attributes with cardinality 10 (b) Attributes with cardinality 50

Figure 4.10: Performance of ACO for different data distributions and varying cardi- nality (a) attributes with cardinality 10 and (b) attributes with cardinality 50.

for attributes with cardinality 10. We vary the number of attributes from 1 to 10 to obtain number of columns between 10 and 100. The improvement factors obtained are 6853.34, 1571.35, 6.47, and 1.73, when the columns are 10, 20, 50, and 100, respectively. Note that the set bits and/or compressibility would be about the same for all the columns in a uniformly distributed bitmap. In this case, we evaluate the effect of ACO when we take all the columns for each attribute together (ACO), one column of an attribute at a time in a round robin fashion (ACO RR), half of the

columns of each attribute first and then the second half in the same order of the

attributes (ACO HH). As can be seen in Figure 4.7(a), when we pick the columns in

a round robin fashion, the compressed size of the first attributes is not as good as

if we would have picked all the attribute columns together but the compressed size

of the last attribute is smaller in ACO RR than in ACO. ACO HH gives the fairest

64 performance as earlier attributes compress better than ACO RR and later attributes better than ACO.

Figure 4.7(b) shows the compressed bitmap size in thousands of 32-bit words for 5 attributes with varying cardinality of 2, 10, and 50 to obtain 10, 50, and 250 columns, respectively. The improvement factors obtained are 4924.96, 11.47, 2.35 for uniform random, and 4664.68, 16.30, and 3.02 for Zipf distribution data. As the number of columns increases, the room for performance improvement decreases, depending exponentially in the number of attributes and somewhat linear in the number of bins per attributes, as can be seen in Figure 4.10. This Figure 4.10(a) shows the column size in logarithmic scale for uniform random distributed data with 5 attributes and cardinality 10, Figure 4.10(b) shows the same experiment for 5 attributes with cardinality 50.

Adaptive Code Ordering (ACO) versus Gray Code Ordering (GCO)

We compare the performance of ACO and GCO in terms of the number of runs.

Consistently ACO produces less number of runs than GCO. In our experiments, smaller number of runs has always translated into smaller number of WAH words.

Table 4.3 shows the number of compressed words and number of runs for uniform and Zipf distribution datasets, for 5 attributes with cardinality 10, ordered in the natural order of the columns. As can be seen, the number of runs in GCO and ACO is independent on the data distribution. Uniform data distribution provides the worse case scenario for compression.

For the first attribute (columns 1-10) ACO and GCO perform the same. For the second attribute (columns 11-20), ACO produces 18 runs less than GCO. For the third, 198. For the fourth, 1,998. For the fifth, 19,998. In average, for uniform

65 Uniform Distrib Attribute GCO ACO Size (Words) Runs Size (Words) Runs 1 47 28 45 28 2 405 208 367 190 3 3965 2008 3553 1810 4 39455 20008 35445 18010 5 261337 200008 241500 180010

Zipf Distrib Attribute GCO ACO Size (Words) Runs Size (Words) Runs 1 47 28 47 28 2 397 208 365 190 3 3945 2008 3559 1810 4 33648 20004 29929 18006 5 172044 172378 157537 152996

Table 4.3: Performance of ACO vs. GCO.

distribution of 50 columns (5 attributes with cardinality 10), ACO produces 444 less

runs for each column than GCO.

In terms of execution time, ACO and GCO have exactly the same time complexity.

4.3.2 Effects of column ordering

Table 4.4 shows the average and maximum compressed column size for HEP1 and

HEP1 RE datasets when using the column ordering criteria presented in the previous

section. The ”Avg Size” and ”Max Size” columns correspond to the average and the

maximum among the compressed column sizes.

While the overall compression ratio is not significantly affected by the order in which the columns are evaluated, the maximum column size, which translates to the worst case query execution time, is reduced considerably. For the basic encoded

66 HEP1 HEP1 RE Method Avg Size Max Size Avg Size Max Size WAH 26, 072 68, 719 30, 018 69, 534 ACO Alone 4, 507 35, 327 4, 609 38, 026 Set Bits - Decreasing 4, 863 18, 495 5, 211 28, 100 Set Bits - Increasing 4, 932 39, 407 3, 926 19, 044 Compressibility - Increasing 4, 863 18, 465 4, 505 21, 105 Compressibility - Decreasing 4, 932 39, 394 3, 990 37, 962

Table 4.4: Adaptive code with column ordering. HEP data.

bitmaps (HEP1), the maximum bitmap size is reduced to half when using the com-

pressibility criteria in increasing order as opposed to the Gray code ordering alone and

it is almost 4 times smaller when compared to the original WAH compressed bitmaps.

This result will automatically translate to improved query times since it has been al-

ready reported that query run times are linearly dependent on the compressed bitmap

table sizes.

For range encoded bitmaps (HEP1 RE), ordering the columns by the set bits cri- teria in increasing order improves both average size and maximum size, and therefore overall compression ratio.

In general, when compressibility in increasing order is used as the column order- ing criteria, the first and last columns compress better than the middle ones. This produces the smaller maximum size, i.e. the smaller worse size for a single bitmap column. When compressibility in decreasing order is used, the performance of Gray code is extended further than if the original order of the columns is used.

67 4.3.3 Improvements in Query Execution Times

The last set of experiments measure the impact of the improved compressed bitmaps in the execution time of the queries. For this experiment HEP1, which has 12 attributes and 2,173,762 rows, was the dataset used. A set of 100 random queries was generated varying the dimensionality of each set. The queries are point queries over

1, 2, 4, 6, 8, 10, and 12 attributes and range queries over 12 attributes changing the number of attribute domain values in the interval to be 1, 2, 3, 4, and 5 producing queries of 12, 24, 36, 48, and 60 columns. Figures 4.11(a) and 4.11(b) show the results of executing point and range queries, respectively.

In both cases the query execution time using the reordered bitmaps requires less time than the corresponding queries using WAH compressed bitmaps over the original dataset. For the smallest performance improvement, measured when the most number of bitmap columns is queried (half of the range for each attribute), the speed up is over 4 times for ACO bitmaps. For point queries, the minimum speed up is 5 times faster for ACO bitmaps, with at least one third of the queries being an order of magnitude faster. In addition, for point queries, ordering columns with the set bits criteria yields improvements up to 40% consistently, over using the original order of columns. For range queries, as the number of columns increases and more rows match the query criteria, the execution time for the reordered bitmaps become very close, but still remarkably faster than the compressed bitmaps that do not use reordering.

4.3.4 Column Ordering by Query History Files

To measure the effect of ordering the columns by query history files (QHF) on query execution time, a 100 queries QHF following a Zipf distribution were randomly

68 (a) Point Queries (b) Range Queries

Figure 4.11: Query execution times for (a) point queries and (b) range queries.

generated. The bitmap table was compressed using WAH and ran the queries over the

original WAH compressed bitmap and the QHF reordered bitmaps. The compressed

size of the QHF reordered bitmap was 5.5 times smaller than the original WAH

compressed bitmap and only 3% bigger than the Gray code reordered bitmap. The

execution time for the 100 queries over the original WAH compressed bitmaps was 9.5

seconds and the execution time for the same queries over the QHF reordered bitmaps

was 0.063 seconds. This is a speedup of over 150 times.

4.4 Conclusions

We propose a new reordering algorithm for bitmap data called Adaptive Code

Ordering (ACO) to improve the compression performance of run-length encoders and achieve faster query execution times. Our algorithm is inspired by ”reflected”

Gray codes and exhibits the same time and space complexity while producing smaller number of runs in the reordered data. ACO can be thought as a hybrid between lexicographic order and Gray code ordering. The performance of our algorithm is

69 also affected by the order in which we process the columns. We proposed two criteria for ordering the columns using the number of set bits and the compressibility of the columns. By evaluating easier to compress columns first, we extend the clustering power of the ordering, and by evaluating the harder to compress columns first, we reduce the worse case number of runs in the bitmap table.

70 CHAPTER 5

Direct Access over Compressed Bitmap Indexes

5.1 Introduction

While space is gained by compressing the bitmap, the ability of directly locating the row identifier by the bit position in the bitmap is lost. The fifth bit in the com- pressed bitmap does not necessarily correspond to the fifth record. This is acceptable when queries do not involve the row identifier, or a range of identifiers, as part of the query, i.e. all records are considered as potential answers to the query. However, many queries are executed over a subset of data, which is usually identified by a set of constraints. For example, a typical scientific visualization query focuses on a certain region or time range. In a , queries are specified over the dimensions

(or the higher level dimensions in the hierarchy) such as date (or month, quarter, year), account (or customer, customer class, profession), or product (or product type, vendor). In many cases, queries pose multiple constraints and the result of executing one of the constraints is a list of candidate row identifiers to be further processed. In a multi-dimensional range query, the rows that do not satisfy a condition in one of the dimensions do not need to be processed in other dimensions. The performance of the current bitmap encoding schemes would suffer under all these scenarios where

71 relevant rows are provided as part of the query. Moreover, handling queries with

WAH that asks for only a few rows needs extra bit operations or decompression of

the bitmap.

Queries that only ask for a few rows are very common. The row number could represent time, product, or spatial position. In fact, the row number can indicate any attribute as long as the data set is physically ordered by it. For example, consider a data warehouse where the data is physically ordered by date. A query that asks for the total sales of every Monday for the last 3 months would effectively select twelve rows. Similarly, a query that asks for the sales of the last seven days is asking for seven rows. Moreover, the row number in the bitmap can represent more than one attribute. For example, we could map the x, y, and z coordinates of a data point to a

single integer by using a well-known mapping function or filling curve and physically

order the points by three attributes at the same time. When users ask for a particular

region, a small cube within the data space, we can map all the points in the query to

their index and evaluate the query conditions over the resulting rows.

While many other approaches, including compressed bitmaps, compute the answer

in O(N) time, where N is the number of points in the grid, we want to be able to

compute the answers in the optimal O(c) time, where c is the number of points in

the region queried. Our goal is to provide a structure that enables direct and efficient

access to any subset of the bitmap, just as the decompressed bitmaps can, and which

does not take as much space, i.e. it is stored in compressed form.

We propose an approximate bitmap encoding that stores the bitmaps in com-

pressed form while maintaining efficient query processing over any subset of rows or

72 columns. The proposed scheme inserts the set-bits of the bitmap matrix into an Ap- proximate Bitmap (AB) through hashing. Retrieval is done by testing the bits in the

AB, where any subset of the bitmap matrix can be retrieved efficiently. It is shown that there would be no false negatives, i.e., no rows that satisfy the query constraints are missed. False positive rate can be controlled by selecting the parameters of the encoding properly. Approximate query answers are tolerable in many typical applica- tions, e.g., visualization, data warehousing. For applications requiring exact answers, false positives can be pruned in a second step in query execution. Thus, the recall is always 100% and the precision depends on the amount of resources we are willing to use. For example, a visualization tool can allow some margin of imprecision, and exact answers can be retrieved for a finer granularity query.

Contributions of this work include the following:

1. We propose a new bitmap encoding scheme that approximates the bitmaps

using hashing of the set bits.

2. Proposed scheme allows efficient retrieval of any subset of rows and columns.

In fact, retrieval cost is O(c) where c is the cardinality of the subset.

3. Two ways of specifying the AB parameters: setting a maximum size, in which

case the AB is build to achieve the best precision for the available memory; or

setting a minimum precision, where the least amount of space is used to ensure

the minimum precision.

4. The proposed scheme can be applied at three different levels. Building one AB

for each attribute makes compression size independent of the cardinality of the

attributes. Building one AB for the whole data set makes the compressed size

73 independent of the dimensionality of the data set. Building one AB for each

column makes each AB size dependent on the number of set bits.

5. The approach presented in this chapter can be combined with other structures

to improve even further the query execution time and the precision of the results

using minimal memory.

6. The approximate nature of the proposed approach makes it a privacy preserving

structure that can be used without database access to retrieve query answers.

5.2 Related Work

Our solution is inspired by Bloom Filters. Bloom Filters are used in many appli- cations in databases and networking including query processing [40, 19, 13, 39, 41],

IP traceback [61, 60], per-flow measurements [18, 33, 32], web caching [69, 23, 22] and loop detection [73]. A survey of Bloom Filter (BF) applications is described in [11].

A BF computes k distinct independent uniform hash functions. Each hash function

returns an m-bit result and this result is used as index into a 2m sized . The

array is initially set to zeros and bits are set as data items are inserted. Insertion

of a data is accomplished by computing the k hash function results and setting the

corresponding bits to 1 in the BF. Retrieval can be done by computing the k digests on the data in question and checking the indicated bit positions. If any of them is zero, the data item is not a member of the data set (since member data items would set the bits). If all the checked bits are set, the data item is stored in the data set with high probability. It is possible to have all the bits set by some other insertions.

This is called a false positive, i.e., BF returns a result indicating the data item is in the filter but actually it is not a member of the data set. On the other hand, BFs do

74 Figure 5.1: A Bloom Filter

not cause false negatives. It is not possible to return a result that reports a member item as a non-member, i.e., member data items are always in the filter. Operation of a BF is given in Figure 5.1.

5.3 Proposed Scheme

The goal is to encode bitmaps in a way that provides fast and integrated querying over compressed bitmaps with direct access and no need to decompress the bitmap.

Bitmaps can be thought as a special case of boolean matrices. In general, the solution can be applied to boolean matrices. First, we describe how the proposed scheme can be applied to encode boolean matrices and then describe how it can be used to retrieve any subset of the matrix efficiently. Next, we describe the particular solution and its variations for bitmap encoding.

5.3.1 Encoding General Boolean Matrices

Consider the boolean matrix M in Figure 5.3 as an example. To compress this boolean matrix, we encode it into a binary array AB using multiple hash functions.

75 For each bit set in the matrix, we construct a hashing string x as a function of the row and the column number. Then, k independent hash functions are applied over x and the positions pointed by the hash values are set to 1 in the binary array. The insertion algorithm is provided in Figure 5.2.

Collisions can happen when a hash function maps two different strings to the same value or two different hash functions map different strings to the same value.

Figure 5.4 presents the 32-bit AB that encodes M with F (i, j)= concatenate(i, j),

k = 1, and H1(x)= x mod 32.

Definition: A query Q = {(r1,c1), (r2,c2), ..., (rl,cl)} is a subset of a boolean

matrix M and query result T = {b1, b2, ..., bl} is a set of bits such that bi = 1 if

M(ri,ci) = 1 and bi =0 if M(ri,ci)=0.

To obtain the query result set T using AB we process each cell (ri,ci) specified in

the query. We compute the hash string x = F (ri,ci) and for all the k hash functions

obtain the hash value ht = Ht(x). If all the bits pointed by ht, 1 ≤ t ≤ k, are 1, then

insert(M) 01 for i = 1 to columns 02 for j = 1 to rows 03 if M(i,j) == 1 04 x=F(i,j) 05 for t=1tok 06 b= Ht(x) 07 set AB[b] to 1

Figure 5.2: Insertion Algorithm

76 123456 1 000101 2 010010 3 000000 4 001001 5 100000 6 000010 7 010000 8 001001

Figure 5.3: Example of a Boolean Matrix M

0100 0001 0001 0101 0010 0100 1000 1000

Figure 5.4: AB(M) for F (i, j)= concatenate(i, j), k = 1, and H1(x)= x mod 32.

the value of (ri,ci) is reported as 1. If a bit pointed by any of ht values is 0, the value

of (ri,ci) is reported as 0. The retrieval algorithm is given in Figure 5.5.

As a query example, consider Q1 = {(3, 1), (3, 2), (3, 3), (3, 4), (3, 5), (3, 6)} over

M. This query is asking for the values in the third row. The answer to this query is

T1 = {0, 0, 0, 0, 0, 0}. Answering this query using the AB in Figure 5.4 would produce

′ T1 = {0, 0, 1, 0, 0, 0}. The third element in the answer set is a false positive set by

(6, 5). Similarly, Q2 = {(1, 6), (2, 6), (3, 6), (4, 6), (5, 6), (6, 6), (7, 6), (8, 6)} which is

′ asking for column 6, would produce the approximate result T2 = (1, 0, 0, 1, 0, 0, 1, 1), where the seventh element is a false positive. A second step in the query execution can prune the false positives later. The scheme is guaranteed not to produce false negatives.

77 We want to retrieve any subset of M efficiently. A structure that stores M by

rows would retrieve a row efficiently but not a column, and vice versa. By using

this encoding we can retrieve any subset efficiently, even a diagonal query for which

other structures would need to access the entire matrix. The time complexity of the

retrieval algorithm is O(c), where c is the cardinality of the subset being queried. For

Q1 and Q2 the cardinalities are 6 and 8 respectively. Using the hash values, AB can be accessed directly to approximate the value of the corresponding cell. retrieve(Q) 01 T = ∅ 02 for each (i,j) in Q 03 in = true 04 x=F(i,j) 05 for t=1tok 06 if AB[Ht(x)] == 0 07 in = false 08 T = T ∪{0} 09 break loop 10 if in == true 11 T = T ∪{1} return(T)

Figure 5.5: Retrieval Algorithm

5.3.2 Approximate Bitmap (AB) Encoding

Consider the boolean matrix in Figure 5.6. This matrix represents the bitmaps

for a table T with three Attributes A, B, C and 8 rows. Each attribute is divided

into 3 bins. Columns 1, 2, and 3 correspond to Attribute A; Columns 4, 5, and 6

correspond to Attribute B; and the rest to Attribute C. Note that only one bit is set

78 for each attribute column. This bit corresponds to the value of that attribute for the

given row. Bitmaps are stored by columns, therefore there are nine bitmap vectors

for table T .

123456789 A1 A2 A3 B1 B2 B3 C1 C2 C3 1 100001001 2 010010010 3 001100100 4 001001001 5 100100100 6 100010100 7 010010010 8 001001001

Figure 5.6: Example of a bitmap table

In the case of bitmap matrices we can apply the encoding scheme at three different levels:

• One AB for the whole data set.

• One AB for each attribute.

• One AB for each column.

Construction of the AB for bitmaps is very similar to the construction for general boolean matrices. Depending on the level of encoding, we define different hash string mapping functions and different hash functions to take advantage of the intrinsic characteristics of the bitmaps.

79 Hash String Mapping Functions

The goal of the mapping function F is to map each cell to a different hash string x.

Mapping different cells to the same string would make the bitmap table cells map to the same cells in the AB for all the hash functions, increasing the number of collisions.

First, we assign a global column identifier to each column in the bitmap table.

Then, we use both the column and the row number to construct x = F (i, j) where i

is the row number and j is the column number.

When we construct one approximate bitmap per data set, we define F (i, j) =

i << w||j, where w is a user-defined offset. This string is in fact unique when w is

large enough to accommodate all j.

In the case when we construct one approximate bitmap per column we use F (i, j)=

i, since the column number is already encoded in the AB itself.

Hash Functions

The hash functions used to build the AB have an impact in both the execution

time and the accuracy of the results. We explore two hash function approaches. In

the first one, we compute a single very big hash function and use k partial values from

the hash output to decide which bitmap bits to set. The motivation is to prevent

the overhead introduced when computing several hash functions. In addition, such

functions usually have hardware support which makes them very fast. In the second

approach, we try k independent hash functions and use their hash values to set the

bits in the approximate bitmap.

Single Hash Function. The main advantage of using a single hash function is

that the function is called once and then the result is divided into pieces and each

80 Hash Function H0 H1 ... H10 Bits 159 − 144 143 − 128 ... 15 − 0 SHA Output 0100100010001010 1000010100100001 ... 0000010101110011

Table 5.1: Single Hash Function 160-bit output split into 10 sets (k=10) of 16 bits (AB Size = 216).

piece is considered to be the value of a different hash function. Table 5.1 illustrates this approach. However, the hash function selected should be secure in the sense that no patterns or similar runs should be found in the output so that the mapping of the inputs will point to different bits of the AB, and the chances of collisions in the AB would be smaller. For the single hash function approach we adopt the Secure

Hash Algorithm (SHA), developed by National Institute of Standards and Technology

(NIST), along with the NSA, for use with the Digital Signature Standard (DSS) [54].

Independent Hash Functions. Hash functions can be designed to take into

account the characteristics of the bitmaps being encoded: each row number appears

exactly once per attribute and only one column per attribute is set per row. Some

hash functions would perform better than others. For example, a hash function that

ignores the column number such as F (i, j)= i, would perform poorly when encoding

the bitmaps as one AB per data set or one AB per attribute. Using this function,

all bits are set in the insertion step since each row i has at least one 1 in it. The

retrieval algorithm would always find all the bits set and the answer would have a

false positive rate of 1, i.e. every cell considered in the query would be reported as

an answer. We define two hash functions, namely Column Group and Circular Hash,

in Section 5.5.2.

81 Bitmap Query Processing with AB

Generally, the type of queries executed over bitmaps are point and range queries over columns. Other type of queries would require more bitmap operations or decom- pression of the final bitmap because of the column wise storage and the run-length compression which prevents the direct access of a given bit in the bit vector. Our technique can support these types of queries as well because the compression and storage of the bitmap is not done column wise. For comparison with the current bitmap techniques we define rectangular queries over the AB and logic for interval evaluation.

Definition: A query Q = {(A1, l1,u1), ..., (Aqdim, lqdim,uqdim), (R,rl, ..., rx)} is

a bitmap query over a bitmap table B and query result T = {b1, b2, ..., bx} is a set of bits such that qdim ui

bk = ( B(k, Aij)) i=1 j li ^ _=

As an example, consider the query Q3 = {(A, 1, 2), (R, 4, 5, 6, 7, 8)} over bitmap table

in Figure 5.6. This query is asking for the rows between 4 and 8 where Attribute A

falls into bin 1 or 2. Traditional bitmaps would apply the bit-wise OR operation over

the bitmaps corresponding to Attribute A value 1 and Attribute A value 2. Then, it

would have to scan the resulting bitmap to find the values for positions 4 through 8 or

perform a bit-wise AND operation with the resulting bitmap and an auxiliary bitmap

which only has set positions 4 through 8 to provide the final answer. The exact answer

of this query would be T = {0, 1, 1, 1, 0} which translates into row numbers 5, 6, and

7. To execute this query using AB we first find the approximate value for the first

cell of the query (4, 1). If the value is 1 then we can skip the rest of the row since

82 the answer is going to be 1. If the value is 0, we approximate the value for the next

cell (4, 2) and OR them together. We then continue processing rows 5, 6, 7, and 8 similarly. Q3 is a one dimensional query because is only asking for one attribute.

Q4 = {(A, 1, 2), (B, 2, 3), (R, 4, 5, 6, 7, 8)} is a two dimensional query asking for the rows between 4 and 8 where Attribute A falls into bin 1 or 2 and Attribute B falls into bin 2 or 3. Here each interval is evaluated as a one dimensional query and the result is ANDed together. The retrieval algorithm for bitmap queries using AB is given in Figure 5.7.

5.4 Analysis of Proposed Scheme

In this section, we analyze the false positive rate of the AB and provide the

theoretical foundation to compute the size and the precision for each level of encoding.

5.4.1 False Positive Rate

In this section we analyze the false positive rate of the AB and discuss how to

select the parameters. We use the notation given in Table 7.2.1.

Symbol Meaning N Number of rows in the relation c Number of columns in the relation s Number of bits that are set k Number of hash functions n AB Size m Hash Function Size (log2n) α AB Size Parameter (⌈s/n⌉)

Table 5.2: Notation

83 retrieve(Q) 01 T = ∅ 02 for each (i) in R 03 andpart = true 04 for A = 1 to qdim 05 orpart=false 06 for j = A(lA) to A(uA) 07 x=F(i,j) 08 in = true 09 for t=1tok 10 if AB[Ht(x)]==0 11 in = false 12 break loop 13 orpart=orpartORin 14 if orpart == true 15 break loop 16 andpart = andpart AND orpart 17 if andpart == false 18 T=T ∪ {0} 19 break loop 20 if andpart == true 21 T=T ∪ {1} 21 break loop return(T)

Figure 5.7: Retrieval Algorithm for Bitmap Queries

Assuming the hash transforms are random, the probability of a bitmap cell being

1 1 set by a single hash function is n , and not being set is 1 − n . After s elements are inserted into the AB, the probability of a particular bit being zero (or non-set) is given as

1 ks − ks (1 − ) ≈ e n n

84 The probability that all the k bits of an element are already marked (false positive)

is

1 ks k − ks k (1 − (1 − ) ) ≈ (1 − e n ) n

Since most of the large scientific data sets are read-only, we know the parameter s, and we have control over the parameters k and n. Ideally we want a data structure whose size depends only on number of set bits s. For sparse matrices, such as most bitmaps, the size of the data structure should be small. Assume that we use a AB whose size n is αs, where α is an integer denoting how much space is allocated as a multiple of s. The false positive rate of the AB can be expressed as

1 ks k − ks k − k k (1 − (1 − ) ) ≈ (1 − e αs ) = (1 − e α ) αs

Figure 5.8: False Positive Rate vs. α

85 Figure 5.9: False Positive Rate vs. k

False positive rate is plotted in Figure 5.8. As α increases false positive rate goes down since collisions are less likely in a large AB. For a fixed α, false positive rate depends on k. The value of k that minimizes the false positive rate can be found by taking derivative of false positive rate and setting it to zero. The change in false positive rate for some values of α is given in Figure 5.9.

5.4.2 Size vs. Precision

In this section, we formulate the size and the precision of AB, as a function of the input parameters.

Let D = {A1, A2, ..., Ad} be a data set with N records and let Ci be the cardinality of attribute Ai. The number of set bits s under various scenarios can be computed as follows:

86 • One AB for the whole data set: In this case we only construct one approximate

bitmap and the number of set bits s = dN. False positive rate of this filter is

−k k (1 − e α1dN ) .

• One AB for each attribute: In this case we construct d approximate bitmaps

and the number of set bits s = N for each one. The false positive rate of each

−k k filter is (1 − e α2N ) . Compared with above scenario, we can use a filter that has

α1 α2 = N and still achieve the same false positive rate.

d • One AB for each column: In this case we construct i=1 Ci approximate bitmaps, one for each attribute value pair, and the numberP of set bits s may be

different for each bitmap depending on the number of records that have that

particular value.

Let the size of the approximate bitmap be 2m. Using the notation from the previous section, we can derive m to be

m = ⌈(log2(sα))⌉ (5.1)

Precision (P ) is defined as a function of the false positive rate (FP ) as follows

− k k P =1 − FP =1 − (1 − e α )

Given a maximum filter size 2mmax , α can be computed using equation 5.1. Largest possible filter size is chosen since large filters are preferable for their low false positive rate. We then compute the minimum k that maximizes P (minimizes FP ).

Given a minimum precision Pmin, the corresponding α can be computed as a function of the number of hash functions k. The result is given below.

87 Uncompressed WAH Size Compr. Dataset Rows Columns Bitmaps Setbits Bitmap Size (bytes) (bytes) Ratio Uniform 100,000 2 100 200,000 1,290,800 922,868 0.71 Landsat 275,465 60 900 16,527,900 31,993,200 30,103,376 0.94 HEP 2,173,762 6 66 13,042,572 18,512,472 12,021,328 0.65

Table 5.3: Data Set Descriptions

−k α = − ln(1 Pmin) ln(1 − e k ) Then the approximate bitmap size can be computed using α.

5.5 Experimental Framework

5.5.1 Data Sets

For the experiments we use two real and one synthetic data sets. Table 5.3 gives a description for each data set. Uniform is a randomly generated data set following a uniform distribution, HEP is data from high energy physic experiments, and Landsat is the is the SVD transformation of satellite images. The uniformly distributed data set is a representative data set for bitmap experiments, since generally data needs to be discretized into bins before constructing the bitmaps and it is well known that having bins with the same number of points is better than having bins with the same interval size. The intuition is that evenly distributing skewed attributes achieve uniform search times for all queries. Effectively, any data set can be encoded into uniformly distributed bitmaps by dividing the attribute domain into bins that roughly have the same number of data points.

88 # Dataset ABs α = 2 α = 4 α = 8 α = 16 Uniform 1 65,536 131,072 262,144 524,288 Landsat 1 4,194,304 8,388,608 16,777,216 33,554,432 HEP 1 4,194,304 8,388,608 16,777,216 33,554,432

Table 5.4: AB Size (in bytes) as a function of α. One AB per data set

# α = 2 α = 4 α = 8 α = 16 Dataset ABs Avg Total Avg Total Avg Total Avg Total Uniform 2 32,768 65,536 65,536 131,072 131,072 262,144 262,144 524,288 Landsat 60 131,072 7,864,320 262,144 15,728,640 524,288 31,457,280 1,048,576 62,914,560 HEP 6 1,048,576 6,291,456 2,097,152 12,582,912 4,194,304 25,165,824 8,388,608 50,331,648

Table 5.5: AB Size (in bytes) as a function of α. One AB per attribute

5.5.2 Hash Functions Single Hash Function (SHA-1)

SHA is a cryptographic message digest algorithm similar to the MD4 family of

hash functions developed by Rivest (RSA). It differs in that the entire transformation

was designed to accommodate the DSS block size for efficiency.

The Secure Hash Algorithm takes a message of any length and produces a 160-bit message digest, which is designed so that finding a text which matches a given hash is computationally very expensive.

# α = 2 α = 4 α = 8 α = 16 Dataset ABs Avg Total Avg Total Avg Total Avg Total Uniform 100 574 57,344 1,147 114,688 2,294 229,376 4,588 458,752 Landsat 900 6,809 6,127,616 13,617 12,255,232 27,234 24,510,464 54,468 49,020,928 HEP 66 67,986 4,487,048 135,972 8,974,096 271,943 17,948,194 543,885 35,896,388

Table 5.6: AB Size (in bytes) as a function of α. One AB per column

89 We utilize SHA-1 [54] algorithm in our single hash function implementation. SHA-

1 was a revision to SHA to overcome a weakness, and the revision corrected an

unpublished defect in SHA.

Independent Hash Functions

The purpose of using different hash functions is to measure the impact of the hash function in the precision of the results. Here we describe the hash functions we use in the experiments. The rest of the hash functions used in this work come from the General Purpose Hash Function Algorithms Library [48] with small variations to account for the size of the AB.

• Column Group. This hash function splits the AB into groups. The number of

groups is the cardinality of the attribute. The group number is selected based

on the column number and the offset is computed using the modulo operation

over the row number. We only use this hash function when we construct one

AB per data set or one AB per attribute. H(i, j)= jn +(i mod n).

• Circular Hash. This hash function constructs a unique number using the row

and column number and maps the number to a cell in the AB using the modulo

operation. In the case when we have one approximate bitmap per column only

the row number is used. H(x)= x mod n.

5.5.3 Queries

As the query generation strategy we used sampling. For sampled queries there is at least one row that match the query criteria. This fact is important for AB experiments because if the number of actual query results is 0, the precision of the

90 AB would always be 0. Queries producing 2 or 1, 000 false positives would have the

same precision. The input parameters for the query generator are:

• Number of queries (q): The total number of queries to be generated. We set

this parameter to 100.

• Data set (D): This is the data set for which we generate the queries. D =

{A1, A2, ...Ad}. The number of attributes d, each attribute Ai where 1 ≤ i ≤ d,

the cardinality of each attribute Ci, and the number of rows N, can all be

derived from D.

• Query dimensionality (qdim): The number of attributes in the query. 1 ≤

qdim ≤ d

• Attribute selectivity (sel): The percentage of values from the attribute cardi-

nality that constitute the interval queried for that attribute.

• Percent of rows (r): The percentage of rows that is selected in the queries. A

hundred percent indicates that the query is executed over the whole data set.

The range for the rows is produced using the row number, i.e. the physical

order of the data set. The lower value l is picked randomly between 1 and N.

The upper bound u is computed as l +(r ∗ N), but in the case when this value

is greater than N, we make u = N.

For the query generation, we randomly select q rows from the data set. Let us

identify those rows by {r1,r2, ..., rq}. Each query qj is based on rj. qdim distinct attributes are selected randomly for each query. For each Ai picked where {1 ≤ i ≤ qdim}, the lower bound li is given by the value of Ai in rj. The upper bound ui is

91 Data set qdim sel r Uniform 2 6 .1,.5,1,5,10 Landsat 2 20 .04,.2,.4,2,4 HEP 2 25 .005,.01,.05,.1,.5

Table 5.7: Parameter Values for Query Generation

computed as li +(sel ∗ Ci), but in the case this value is greater than Ci, we make ui = Ci.

5.5.4 Experimental Setup

To generate the approximate bitmaps we set α to be a power of 2 between 2 and

16, and k to be between 1 and 10. The results presented in the next section are using the largest α for the given data set for which the AB Size is smaller or comparable to the WAH bitmap size.

Figure 5.7 gives the list of parameters for query generation. We adjust the pa- rameters to have 2 dimensional queries of 4 columns each, and varying the number of rows from 100 to 10,000 (100,500,1K,5K,10K) for all data sets.

5.6 Experimental Results

In this section we measure size, precision, and execution time of the Approxi- mate Bitmaps. We compare size and execution time against the WAH-Compressed

Bitmaps [77]. Results are given varying the appropriate parameters:

• The AB size, changing α and m.

• The number of hash functions k.

92 (a) Different hash functions (b) Different # of hash functions

Figure 5.10: AB Size vs. Precision trade-off. (m =log2n)

• The number of rows queried.

The results presented in the following subsections are averages over 100 queries of the same type using independent hash functions instead of SHA-1. At the end of this section, SHA-1 results are shown and discussed. We show that setting the experimental parameters such that the AB size is less than or at least comparable to the WAH-compressed bitmaps we obtain good precision (more than 90% is most cases) and execution time more than 1,000 times faster than WAH.

5.6.1 AB Size

Tables 5.4, 5.5, and 5.6 show the AB sizes in bytes for the cases when we construct one AB per data set, one AB per attribute, and one AB per column, respectively. The size of each AB is calculated based on the discussion in Section 5.4.2, i.e. finding the lowest power of 2 that is greater or equal to sα (refer to Equation 5.1). For example, in Table 5.4 the value for Landsat data for α = 4 is calculated as follows. The number

93 of set bits (s) for Landsat is 16,527,900. The lowest power of 2 that is greater or equal

to sα is 67,108,864 in bits, and 8,388,608 in bytes. Note that this is also the size we obtain for HEP data, since we are restricting ourselves to powers of 2. Similarly, in

Table 5.5 the Landsat data has 60 ABs. In this case, the parameter s is equal to the number of rows, which is 275,465 (given in Table 5.3). Therefore, the lowest power of 2 we are seeking per an AB (for α = 4) is 2,097,152 in bits, and 262,144 in bytes.

However, for 60 ABs we have 15,728,640 bytes in total. The values in Table 5.6 are calculated similarly with the only difference that the number of set bits s varies for each column. Recall that the number of set bits for each column would be the number of rows that fall into that column. Therefore the size is not the same for all ABs.

Table 5.6 presents the average and total AB size.

• For Uniform data, the best size is obtained when constructing one AB per

column. The reason is that the difference between the data set dimensionality

(2) and the attribute cardinality (50) is enough to achieve the same precision (α)

by constructing one AB per column or one AB for the whole data set. However,

the former uses less space than the latter, e.i. 458,752 bytes in Table 5.6 vs.

524,288 bytes in Table 5.4.

• For Landsat and HEP data sets, which have more attributes, constructing one

AB for the whole data set requires less space than constructing one AB per

column (refer to Tables 5.6 and 5.4).

Experiments show that AB can always be constructed using less space than WAH.

The following are the α for which AB compresses better or comparable to WAH compressed bitmaps.

94 Figure 5.11: Precision as a function of α

• For uniform data, when we construct one AB per column for α = 16, the AB

total size is less than half of the total size required by the WAH-compressed

bitmaps.

• For HEP data, α = 4 produces AB whose total size is about two thirds of the

WAH-compressed bitmaps. For α = 8 the AB size is comparable to the WAH

size needing only one third more of the space and the AB size is still smaller

than the uncompressed bitmaps. Since the theoretical false positive rate for

α = 4 is high (given in Figure 5.9), the results for HEP data for α = 8 are also

included in the following subsections.

• For Landsat data, α = 8 produces ABs whose total size is about half the size

of the WAH-compressed bitmaps.

5.6.2 Precision

Figure 5.10(a) supports our initial claim that the selection of the hash functions have a significant impact in the accuracy of the results if only one hash function is

95 Figure 5.12: Precision as a function of the number of hash functions (k).

Figure 5.13: Precision as a function of the # of Rows.

utilized. The Figure sketches that the precision varies for the same m with different

hash functions. As m increases, the precision also increases since there are less colli- sions as a hash function can map elements to more values. For H1, the precision is 1 when there are enough bits to accommodate all rows. However, as we increase k the false positive rate decreases and the impact of using one set of hash functions over another is not evident as all of them perform similarly with only small variations.

Figure 5.10(b) shows the precision as a function of the number of hash functions, i.e.,

96 k. As can be seen, the precision increases for larger k. However, after certain k, the improvement is not very significant and it may even start to decrease. The optimal k can be computed using the theoretical foundation given in Section 5.4.

Figure 5.11 shows the precision for all the data sets as α increases. Note that the precision increases steadily as α increases and it is very close to 1 for larger α.

Figure 5.12 shows the precision as a function of the number of hash functions (k) for a specific α of each data set. Up to the optimal k, the precision increases as k increases. After the optimum point, the precision starts decreasing because a large number of hash functions produces more collisions.

Figure 5.13 shows that the precision is independent of the number of rows queried remaining constant for each data set. The small variations in the precision are caused by having different columns queried in each case. For uniform data and queries of

10K rows, on the average, the number of tuples retrieve by each query is 59 and AB returns only 3 more tuples on the average. For queries of 100 rows, the results are much more accurate, i.e., 100 different queries using WAH returned 170 tuples and the same queries using AB returned 179 tuples. For landsat data and queries of 10K rows, the number of tuples returned by each query is 723 and the number of tuples returned by AB is 821, and for queries of 100 rows, the average matches per query is 8.98 for WAH and 9.85 for AB. For HEP and queries of 10K rows, WAH retrieved

3861 and AB retrieved 4039 tuples, and for queries of 100 rows, the average number of results is 42 for WAH and 44 for AB.

97 Figure 5.14: Exec. Time as a function of α.

Figure 5.15: Exec. Time as a function of k.

5.6.3 Execution Time

Figure 5.14 shows the CPU clock time in milliseconds (msec) for different α. As

α increases the execution time decreases because the false positive rate gets smaller.

Execution time is the same for all the data sets being proportional to the number of tuples returned that match the query.

Figure 5.15 shows the CPU clock time in msec as a function of the number of hash functions (k). As k increases the execution time increases linearly since the time for

98 (a) all data sets (b) uniform, α=16

(c) landsat, α=8 (d) hep, α=8

Figure 5.16: Exec. Time Performance: WAH vs. AB

an iteration of the used independent hash functions are very close, and the total time consumed by the k functions is calculated by adding them.

Figure 5.16 depicts the CPU clock time by varying the number of rows queried.

The reason for the execution time of HEP data being longer than the other data sets, in Figure 5.16(a), is that the number of tuples that match the query for HEP data is higher than the others (as given in the last part of the previous section). For WAH, only the time it takes to execute the query without any row filtering is measured. This is the reason why the WAH execution time is constant for any number of rows. The

99 final answer can be computed by performing an extra bitmap operation or by locating

the relevant rows in the compressed bitmap. On the other hand, AB execution time

is linear in the number of rows queried. In Figure 5.16(b), for a query targeting 10K

rows of the uniform data, AB execution time is 76.09 msec, which is two thirds of

WAH execution time. In Figure 5.16(c), AB time is one fourth of WAH time for the

same number of rows. In Figure 5.16(c), AB time is one tenth of WAH for the 10K

rows. In addition, for lower number of rows, the AB execution time improvement over

WAH is in 2 orders of magnitude for uniform and landsat data sets, and 3 orders of

magnitude for HEP data set. Experiments show that, for all the data sets, executing

a query that selects up to around 15% of the rows by using AB is still faster than

using WAH-compressed bitmaps.

5.6.4 Single Hash Function vs. Independent Hash Functions

In terms of precision, SHA-1 results are very similar to the results obtained by using the independent hash functions, and that is sketched in Figure 5.17. However, as the main purpose of SHA-1 is to have a secure hash function, the computation cost is very expensive and thus SHA-1 is slower than the other hash functions used in this work.

5.7 Conclusion

We addressed the problem of efficiently executing queries with no overhead over compressed bitmaps. Our scheme generates an Approximate Bitmap (AB) and stores only the set bits of the exact bitmaps using multiple hash functions. Data is com- pressed and directly accessed by the hash function values. The scheme is shown to be especially effective when the query retrieves a subset of rows.

100 Figure 5.17: SHA-1 comparison with our other hash functions: Precision as a function of the number of hash functions, k.

The Approximate Bitmap (AB) encoding can be applied at three different levels one AB per data set, one AB per attribute, and one AB per column. Precision of the AB is expressed in terms of α. For the same α, the precision is the same for

all levels. The size of the AB depends on the number of set bits and α. Skewed

distributions in the bitmaps would benefit from constructing one AB per column.

For high dimensional databases and high cardinality attributes one AB per data set

offers the best size. In all cases, allocating more bits to the AB increases the precision

of the results since the number of collisions will be less in a larger size AB. In general,

precision increases with higher number of utilized hash functions, however, after an

optimal point the precision starts to degrade. That point can be driven using the

theoretical analysis given provided in this chapter. Another advantage of AB is that

the precision is not affected by the number of rows that a query asks.

We compared our approach with WAH, the state-of-the-art in bitmap compression

for fast querying. The size parameter of AB, for the experiments, is chosen in such

a way that the total space used by our technique remains less than or comparable

101 with the space used by WAH. AB achieves accurate results (90%-100%) and improves the speed of query processing up to 3 orders of magnitude compared to WAH. The performance can be further improved by incorporating hardware support for hashing, such as in SHA-1.

102 CHAPTER 6

Similarity Search over Bitmap Indexes

6.1 Introduction

Bitmap indexes are used in data warehousing and scientific database domains in order to efficiently query large-scale data repositories. Typically, bitmap indexes encode both categorical and continuous data. For categorical data with relatively small number of categories or when categories are queried independently, one bitmap is built over each category. For continuous attributes, or attributes with a high number of distinct values, binning is used to combine a range or set of values together and one bitmap is built for each bin. Queries over the bitmap index are executed using fast logical bitwise operations supported by hardware. Bitmap indexes constitute a very effective index structure for multi-dimensional data, as an index over a single dimension or a set of dimensions is independent of all other dimensions yet can be efficiently combined with other indexes to resolve queries.

Bitmaps have successfully been employed in both research and industrial database management systems. They have been implemented in major commercial products such as Oracle [5], Informix [44], and Sybase [67]. They have also been utilized by many applications including scientific data management and visualization [75, 64, 65].

The reasons that bitmaps have made such a significant impact are twofold. First, a

103 single bitmap represents a single logical function which clearly denotes, without ambi-

guity, those objects that match the function. A minimal set of bitmaps that compose a

query function can be retrieved and executed on an as-needed basis. This logical inde-

pendence avoids the performance breakdown of multi-dimensional index structures.

Secondly, the representation and combination of bitmaps are naturally compatible

with the current computer architectures, as opposed to expensive set operations re-

quired by the combination of result sets derived from other access structures.

Although bitmap indexes allow for efficient resolution of selection or aggregation

queries, their use has been mainly limited only to these query types. The utility

and applicability of bitmap indexes could be greatly enhanced by providing support

for more complex type of queries, such as similarity searches. Similarity searches

are the core of multimedia retrieval, information retrieval and several data mining

applications. Point and range queries are insufficient for data exploration and data

mining tasks. Due to the sparseness of data in high dimensions, a point or range

query may not return any results. Similarity queries can direct data exploration by

finding non-empty results.

Typically, in similarity searches, objects are represented using suitable features that capture the most important characteristics of the objects. As objects become more complex, the number of features required to represent the objects well is on the order of tens to hundreds. Once the objects are represented in the vector space, the similarity search is reduced to computing the distance between query and object vectors and retrieving the closest objects.

This kind of distance function, however, is not well-suited for categorical or high

dimensional data. In the case of categorical data, if two points do not fall into the

104 same category as the query it is not meaningful to penalize/reward them based on

how close the category id (number) is from the category queried. With respect to

high dimensional data, even when the data is continuous, high dimensional spaces

are very sparse and the distance between the different points becomes relatively the

same [9]. This makes the nearest neighbor problem ill-defined for Lp-norms. Therefore

it is essential to choose measures which lead to greater contrast between the different

points. The understanding of meaningfulness of the nearest neighbor is critical in

developing distance measures which use only a small fraction of the least noisy infor-

mation available in high dimensional data in order to measure similarity [3]. Several

previous studies have used fractional or localized distances, i.e. distance functions

that only involve the dimensions close to the query, which results in a more mean-

ingful comparison and improves performance by reducing the noise from the distant

dimensions [3, 26, 14, 68, 25].

In this chapter, we propose to perform similarity searches over bitmap indexes using a localized similarity function. Our similarity function produces meaningful results for high dimensional and categorical data, and the similarity computation is performed exclusively using bitwise operations. The proposed approach, called

BitmapNN, can be applied over existing bitmap indexes without any changes to the index structure and without requiring decompression of compressed bitmap struc- tures. In addition, we describe complex bitmap operations, such as column weighting and query widening. These operations are used as building blocks to construct en- hanced query flexibility and functionality including relevance feedback and defining similarity over tuples with missing data.

105 We conduct performance comparisons against several techniques that operate over either the original data or over an index built on original data including the recent

FKNM [68], R-Trees, and sequential scan. To measure the benefit of using the effi- cient bitwise operators, we compare to a VA-file structure modified to use the same scoring function and to yield identical results as our proposed technique. BitmapNN maintains accuracy of the results and significantly outperforms other approaches in terms of query execution time.

The primary contributions of this paper can be summarized as follows:

• We enable similarity searches over existing bitmap indexes by using a localized

similarity function computed using bitwise operations over bitmaps. This search

yields meaningful results in high dimensional spaces and for categorical data.

The proposed technique operates solely over the index and does not retrieve

data objects from disk, avoiding expensive I/O-costs.

• We provide a formal description of the bitmap operations required to perform

the similarity queries.

• We introduce additional bitmap operations that are used to build complex query

functionality. This extended functionality includes user relevance feedback, sim-

ilarity searches over incomplete data and weighted similarity searches.

• We compare performance gains against existing techniques. These gains are

significant when compared to existing techniques both in terms of accuracy and

efficiency.

106 Raw Data Projection Index Bit-Sliced Index (BSI) Attrib 1 Attrib 2 Attrib 1 Attrib 2 1 0 1 0 Tuple Attrib 1 Attrib 2 b1 b2 b3 b1 b2 b3 b b b b t1 1 3 1 0 0 0 0 1 0 1 1 1 t2 2 1 0 1 0 1 0 0 1 0 1 0 t3 1 1 1 0 0 1 0 0 0 1 0 1 t4 3 3 0 0 1 0 0 1 1 1 1 1 t5 2 2 0 1 0 0 1 0 0 1 1 0 t6 3 1 0 0 1 1 0 0 1 1 0 1

Figure 6.1: Simple bitmap example for a table with two attributes and three bins per attribute.

6.2 Background

In this section we provide background and preliminaries on bitmap indexes, cur- rent similarity/distance functions used for high dimensional data spaces, and data partition strategies.

6.2.1 Bit-sliced Indexes

Bit-sliced indexes (BSI) [45, 53] can be considered as a special case of the encoded bitmaps [80]. With the bit-sliced index the bitmaps encode the binary representation of the attribute value. Therefore, only ⌈log2 bins⌉ bitmaps are needed to represent

all values. Figure 6.1 shows the projection index and the bit-sliced index for a table

with two attributes, each partitioned into three bins. The first tuple t1 falls into the

first bin in attribute 1, and the third bin in attribute 2. Since each attribute has

three possible values, the number of bits in the BSI is 2. The first tuple t1 has value

1 for attribute 1, therefore only the bit-slice corresponding to the least significant

bit, b0 is set. For attribute 2, since the value is 3, the bit is set in both BSIs. BSI

arithmetic for a number of operations is defined in [53]. We adapt two of these bit

107 slice operations as components of the BitmapNN process, one operation to sum two

bit slices into a new bit slice and another to generate a new bitmap to indicate those

objects with the largest k scores.

6.2.2 Similarity Functions for High Dimensional Spaces

Let us consider a high dimensional dataset with d dimensions. Each data object is represented by a set of coordinates, i.e. one value for each dimension i. Similarity search algorithms compute a distance function between a query object Q and all other data objects and retrieve the k objects with smallest distance to the query objects as the answers to the query. Metric distance functions, such as the Lp norms (e.g.

Manhattan and Euclidean distance), fail to capture accurate similarity in high dimen- sional datasets because a large difference in one or few dimensions would dominate the distance function.

As described in [25], for high-dimensional applications where human cognition is the target judge of object similarity, it is more important to closely match a subset of attributes rather than provide some least total distance measurement over all the attributes. With this motivation, there have been several proposed distance functions to capture cumulative object attribute similarity rather than total object dissimilarity.

One such function is the localized distance function [3], which is described as:

1 p p |xi − yi| PIDist(X, Y, kd)= 1 −  mi − ni  ∈ i SX[X,Y,kd]     where kd is the number of ranges for each dimension, S [X,Y,kd] is the set of dimen- sions for which the two objects lie in the same range, and mi and ni are the upper and lower bounds of the corresponding range in dimension i.

108 This function accumulates benefit for each attribute for which a data object maps

to the same quantization as the query object. It does not differentiate between data

and query objects that do not map to the same quantization. Therefore, a data point

is not excessively penalized for a few dissimilar attributes. Note that this function

is not a distance metric as neither the identity nor the triangle inequality properties

hold.

The Frequent K-N Match (FKNM) [68] is another method capturing cumulative

object attribute similarity rather than total object dissimilarity. Frequent K-N Match

uses the distance function proposed in [25], called Dynamic Partial distance Function

(DPF), which can be defined as follows: 1 p p dDPF (X,Y )= δi δ ∈ ! iX∆m where δi = |xi − yi| and ∆N are the smallest N distances among all δi. For this distance function, determining the best value of N is a challenge. The Frequent K-N

Match approach proposes to vary N over a range (between 1 and d) and take the most frequent matches as the final answer.

However, this method produces two somewhat undesirable properties as the value

of k increases: the set of NN retrieved for k = k0 is not guaranteed to be retrieved for

k>k0 and the rank of the nearest neighbor objects are not preserved. To empirically

demonstrate this, we ran the frequent k-N-match algorithm for k = 10 and k = 20

over three datasets from the UCI machine learning repository [1]. These datasets were

also used by [68] and are commonly used in classification research. The data includes

the Wisconsin Diagnostic Breast Cancer (wdbc) data set, the image segmentation

data set, and the ionosphere data set. The wdbc data contains 569 30-dimensional

points with 2 classes. The segmentation data has 210 19-dimensional points with 7

109 categories. The ionosphere data contains 351 34-dimensional points with 2 classes.

For wdbc data, in 151/569 of the results, the 20-NN results were not a superset of the

10-NN results (i.e. an object identified as a 10-NN is not identified as a 20-NN). For the segmentation data set this occurs for 32 out of 210 results and for ionosphere 25 out of 351 results.

6.2.3 Partition Strategies

The following list is not an exhaustive list of the partition (binning) strategies proposed in the literature, but a short example of the kind of partition strategies that are commonly used in bitmap index creation and/or are suitable as underlying partitions to better reflect similarity between objects:

• Equi-Width. This is the simplest partition strategy in which the attribute

domain is split into ranges of equal size. The drawback of this partition is that

it does not account for data distribution. The number of data objects among

bins could be very skewed and some ranges may be empty.

• Equi-Populated. This strategy balances the number of data objects in each

partition by creating ranges with roughly the same number of data objects. This

technique is commonly used because it creates a nicely uniform distribution of

the data objects over the bins.

• Lloyd’s clustering. Ranges are determined using Lloyd’s algorithm [37]. The

objective this technique tries to achieve is to minimize the sum of the square

of the error between the given numbers and the mean of the interval they fall

110 into. This can be formulated for k intervals, Ii to Ik, as:

k 2 (o − mean(Ii)) i=1 o∈I X Xi

, where mean(Ii) is the average of the numbers that fall into the interval i.

This powerful technique is an iterative process which quickly converges at least

to a local minimum. In order to ensure that it is not trapped in a poor local

minimum, however, one must initialize it carefully. We adopt a splitting ap-

proach which starts with a 0-split quantizer and gradually increases the number

of splits used in the quantization. The advantage of the splitting approach is

that it does not suffer from poor initialization, as the optimal 0-split quantizer

simply maps the whole real line to the mean of the distribution.

• Class-attribute interdependent maximization

(CAIM). CAIM is a supervised discretization algorithm proposed in [34]. This

algorithm generates discretization schemes with the highest interdependence

between the class attribute and the discrete intervals. It has been shown to

improve the accuracy of the subsequently used machine learning algorithms. It

is very fast and the number of intervals it generates is usually less than the

number of partitions generated by other supervised algorithms.

• User defined. Users can define meaningful ranges over the data to classify

objects as similar or not. This would produce more meaningful results according

to user criteria and domain specific knowledge.

• Query driven. This strategy uses the values queried to decide on the bin

boundaries.

111 6.3 Proposed Approach

In this section we first describe the object ranking algorithm and similarity func- tion used for BitmapNN. We then present the query execution algorithm and intro- duce some additional BitmapNN operations used as building blocks to enhance the functionality of BitmapNN including relevance feedback and incomplete data support.

The cost of executing queries using the proposed approach is also analyzed.

6.3.1 Object Ranking

We use a ranking algorithm to determine the top-k objects associated with a query. First, a metric distance function is computed over the data objects that allows for efficient pruning of non-relevant results. Then a query cognizant ranking of those objects that lie on the border of relevance is performed.

Candidate Selection The similarity metric used by BitmapNN during initial object ranking is based on Hamming distance. The Hamming distance between two objects X and Y can be defined as:

d 0 if x and y lie on the same bin Hamm(X,Y )= i i i=1 (1 otherwise X where d is the dimensionality of the dataset.

The range of the Hamming distance is the integer range between 0 and d. Ham- ming distance is a metric distance function over the bitmap quantized data.

Since the distance function is a metric, we can use the computed distances to prune the result set during query execution. Any data object that yields a distance less than the kth best object will be clearly in the top-k result set. Any data object that yields a distance greater than the kth best object will be clearly outside the top-k

112 result set. Only the set of objects that yield the same distance as the kth best object,

such that their inclusion in the set of results would make the cardinality of the result

set greater than k, are ambiguous. We perform a second distance computation over

just the ambiguous results.

Note that in our actual implementation we rank objects by maximizing the simi-

larity score (d−Hamm(X,Y )) rather than minimizing the distance. These operations

yield identical results.

Query Result Refinement We want to efficiently rank the ambiguous results in

a non-arbitrary manner. Since the set of objects under analysis is likely a small subset

of the entire data set, we gain minimal pruning advantage from making the second

distance function a metric. The goal of the second computation is to meaningfully

rank the objects under analysis based on their relevance with respect to the query.

The distance function used to evaluate the set of objects during result refinement is:

d 0 if xi = yi DBNN (X,Y )= 1+ 1 otherwise i=1 ( 2f(i) X where d is the dimensionality of the dataset and f(i) is a function that retrieves the

ordinal position of the query matching bitcolumns sorted by some criteria, such as

reverse order of popularity. If f(i) is query invariant, then this function is a metric, but will not be able to adjust relevance based on query characteristics. If f(i) is dependent on the query, then the function is not a metric, but can allow for object relevance to be related to query characteristics. Since the primary goal of the query result refinement is to differentiate a small set of objects in a meaningful way, we use a query dependent function. In contrast, PIDist would be equivalent to the

113 BitmapNN(Q,B,k)

1: if (k > COUNT(EBM) or k < 0) 2: Error (”k is invalid”) 0 3: S = B1,Q[1] 4: for (i = 2; i ≤ d; i++) 5: S = SUM BSI(S, Bi,Q[i]) 6: G = ∅ 7: E =EBM 8: for (i = s − 1; i ≥ 0; i–) 9: X = G OR (E AND Si) 10: if ((n =COUNT(X))>k) 11: E = E AND Si 12: else if (n

Algorithm 2: Query execution using BitmapNN. Q is the quantized query and B is the set of all bitmaps. k is the desired number of results.

Hamming distance for categorical or quantized data. Therefore, it would not be able

to meaningfully discriminate among ties.

6.3.2 Similarity Function

If bitmapped objects are distinguishable from each other, then the set of top-

k objects determined using the object ranking algorithm is equivalent to the top-k

114 SUM BSI(A,B)

1: S0 = A0 XOR B0 //S0iff only one bit on A0 or B0 2: C = A0 AND B0 //C is ”Carry” bit-slice 3: for(i = 1; i 0) 7: Ss = C //Put Carry into last bit-slice of S

Algorithm 3: Addition of 1-slice BSI and another BSIs. Given two BSIs, A = As−1...A0 and B = B0, a new sum BSI, S = A + B is constructed using the preceding pseudo-code.

objects that would be returned by computing the following similarity function over

all the data objects, and returning the objects with the highest k scores:

d 1 1+ 2g(i) if xi = yi SimBNN (X,Y )= i=1 (0 otherwise X where d is the dimensionality of the dataset, and g(i) is a function that retrieves the

ordinal position of column i in the reverse order than that of f(i) from the distance

function computation.

They are equivalent because the second term of the similarity score can never

sum to be greater than 1 over all the attributes, and therefore an object with fewer

matches can never overtake the score of an object with greater number of matches.

Maximizing the second term in the similarity score will also maintain the same order

of those objects that have the same number of matches as the kth best object as

minimizing the second term of the distance function.

The intuition behind this function is that two data objects are close in one di-

mension if and only if they fall into the same partition. For categorical data, objects

are penalized equally if they do not fall into the bin queried. As well, those objects

115 that fall on the border of relevance are ordered based on characteristics of the query.

While the simplicity and structure of this similarity function make it ideally suited

for efficient computation using our object ranking algorithm over bitmap indexes, this

measure of similarity also produces meaningful results in high dimensional spaces as

is evidenced by our experimental results over labeled data. We measure accuracy

performance for a set of k−NN classification tasks and compare results against other previously proposed similarity techniques using the class stripping technique [3, 68].

This technique considers a retrieved object to be correct, i.e. similar to the query, if both objects have the same class label. Results are reported in Section 6.4.

In the next section, the BitmapNN query execution algorithm is described.

6.3.3 BitmapNN Query Execution

th Let us denote by Bi,j the equality encoded bitmap corresponding to the j parti- tion for dimension i, and by Bj, the bitmap corresponding the jth bit in a bit-sliced

index (BSI). A BSI A = As−1...A0 has s slices and can represent values from 0 to

s 2 − 1. Bi,j can be thought of as a 1-slice BSI encoding only two values, 0 and 1.

In order to compute the similarity function we need to add the dimensions where

the query object and the data object fall into the same category, i.e. the corresponding

bitmap for the query object has a 1 in the bit corresponding to the given data object.

A bitmap then indicates which objects fall into the same bin as the query object.

The objects that fall into a greater number of bins are considered more similar to

the query than other objects. The maximum similarity between a data object and

a query is d, i.e. the object falls into the same bins as the query in all dimensions.

116 On the contrary, if an object does not fall into any of the partitions where the query falls, then the similarity between this object and the query is zero.

Given a query Q, the query vector is first quantized and the bitmaps that corre- spond to the quantized query are determined, Q[i] denotes the bin in which the query falls for dimension i.

The BitmapNN query execution algorithm pseudo-code is given in Algorithm 5.

We identify three main parts in our algorithm:

• Compute the score for all data objects given query Q (Alg 1: lines 3-5). BSI S

holds the sum of all the dimensions where each data object and the query are

similar. Initially, S has 1 bit slice which corresponds to the bitmap B1,Q[1] (Alg

2: line 1). Then the bitmaps for the other quantized dimensions (Alg 2: lines

2-3) are added using the algorithm presented in Algorithm 3. This method was

specifically optimized for BitmapNN, where one of the BSIs has only 1 slice,

in order to achieve better performance. Note that S has at most ⌈log2(d + 1)⌉

slices as the similarity is at most d.

• Obtain two bit vectors G and E (Alg 1: lines 6-17). G has set bits for the k′ ≤ k

objects whose score is greater than the kth best value. E is a bit vector that

identifies those objects that have the same score as the kth best value.

• The result set refinement (Alg 1: lines 20-24) is performed over the equally

scored objects (bitmap E). The bitmaps queried are sorted by their frequency,

i.e. the number of 1s in the bitmap, a bitstring is generated by concatenating

query results for each object in E based on the sorted attributes (this forms BSI

117 T), and TopK algorithm is performed over the bitstrings to select the remaining

objects needed to reach k.

EBM is an Existence Bitmap that ensures that only existing rows are returned as results. In our experiments EBM is equal to an all-1 bitmap. The set bits in the

final bitmap F indicate the answer set.

As an illustrative example, consider a table with four dimensions, two bins per dimension, and 10 objects. The bin value for each attribute is presented in columns

2-5 of Figure 6.2. We will show how BitmapNN finds the top 3 most similar ob- jects to query object Q = o3 in Figure 6.2. The bitmaps where the query fall are

{b1,1, b2,1, b3,1, b4,2}. Now, these bitmaps are added to obtain the score of the similar- ity function. The maximum similarity is d which in this example is 4, so in order to represent them at most ⌈log2(4 + 1)⌉ = 3 bit slices would be needed. To obtain

these 3 bit slices, we follow the steps of Algorithm 3. Addition of the bitmaps is

done using XOR and AND bitwise operations. Columns 6-8 of Figure 6.2 show the

BSI corresponding to the addition of the first two, three, and then four dimensions.

The addition is performed over the current BSI S and the next bitmap corresponding

to the quantized dimension of the query. The bitwise operations used to generate

each bit slice in the BSI S are detailed at the bottom of the figure. For example, to

add the first two dimensions, bitmaps b1,1 and b2,1 are added by performing two bit-

wise operations. The addition of these two bitmaps would produce two slices (values

between 0 and 2). The least significant slice in S is found by XORing b1,1 and b2,1

(the least significant bit in the sum is 0 if both bits have the same value), while the

most significant slice (the carry bit) is obtained by ANDing b1,1 and b2,1 together (the

second bit in the result is only 1 if both operands were 1). The final BSI (d1 +d2 +d3)

118 corresponds to the similarity score, i.e. the number of dimensions where the query and the data object fall into the same bin. Next, we construct bitmaps G and E.

All the set bits in the bitmap E correspond to data objects with the same similarity, this corresponds to o4 and o5. The function g(i) orders the query matching bitmaps according to frequency. In this example, o4 yields a bitstring of ’1010’ and o5 yields

’0110’, so o5 is dropped from the results. The output of the algorithm is the last column in Figure 6.2.

Attrib B X=d1+d2 Y =X+d3 Z=Y +d4 (1) (2) (3) (4) (5) (6) (7) Id 1234 Q1Q2Q3Q4 1 0 1 0 2 1 0 S G E F o1 1111 0001 0 0 0 0 0 0 1 1 0 0 0 o2 1111 0001 0 0 0 0 0 0 1 1 0 0 0 o3 2221 1111 1 0 1 1 1 0 0 4 1 0 1 o4 1211 0101 0 1 0 1 0 1 0 2 0 1 1 o5 2212 1100 1 0 1 0 0 1 0 2 0 0 0 o6 1111 0001 0 0 0 0 0 0 1 1 0 0 0 o7 1111 0001 0 0 0 0 0 0 1 1 0 0 0 o8 1111 0001 0 0 0 0 0 0 1 1 0 0 0 o9 2121 1011 0 1 1 0 0 1 1 3 1 0 1 o10 1112 0000 0 0 0 0 0 0 0 0 0 0 0 (1) (3) 1 0 (5) 1 0 Bitwise Q1 AND Q2 Y XOR (Y AND Q3) Z AND (Z AND Q4) (2) (4) 0 (6) 1 0 Operations Q1 XOR Q2 Y XOR Q3 Z XOR (Z AND Q4) (7) 0 Z XOR Q4

Figure 6.2: Query Example using BitmapNN.

6.3.4 Additional BitmapNN Operations

In this section, we define additional BitmapNN operations so that domain knowl- edge can be incorporated into generating more refined similarity results. For example, depending on the nature of the data it may or may not be appropriate to consider data in bins adjacent to the query bin as a ’near’ match. However, depending on the

119 data/domain it may be desirable to weight the adjacent bins lower than the queried

bin or weight some attributes higher than others if it is known that they contribute

more to the user defined similarity.

We refer to these operations as Column Weighting and Query Widening. When performed during the result refinement stage of BitmapNN, the EBM is set to be the candidate bitmap (E, for example) restricting the data objects considered in the analysis and considerably reducing the cost of these operations.

1 2 3 1 0 1 0 2 0 0 1 3 1 1 1

Figure 6.3: Sample weight matrix used in query rewriting to perform column weight- ing.

Column Weighting

The original scoring function assumes no domain knowledge and that all attributes

contribute equally in defining similarity of a data object to a query object. However,

attributes may vary in terms of importance with respect to indicating similarity. The

column weighting operation allows different weights to be assigned to attributes and

enables a richer similarity model.

The similarity score for an attribute with column weighting is defined as:

wi if xi and yi lie on the same bin Simi(xi,yi)= (6.1) (0 otherwise

where wi is a non-negative integer value.

The overall object similarity remains as the sum of similarity over the attributes.

120 Consider W = {w1,w2, ..., wd} the set of user defined weights on a query Q. The

original BitmapNN corresponds to a weight vector of all 1s. A weight wi = 0 indicates

that attribute i does not contribute to the similarity score computation. Column

weighting is achieved by rewriting Q into at most ⌈log2max(wi)⌉ sub-queries, where

max(wi) refers to the maximum weight in W . Each sub-query is defined as a row

′ in a boolean weight matrix W with ⌈log2max(wi)⌉ rows and d columns, defined as

follows: 1 if the ith bit in the binary representation ′ wi,j =  of weight wj is set (6.2) 0 otherwise

Figure 6.3 shows a sample weight matrix for a query Q with weight vector W = {1, 5, 3}. We use each row in W ′ as the weight vector for query Q and compute the

similarity score using SUM BSI over the dimensions with weight 1 and ignoring the

dimensions with weight 0. To combine the scores for each sub-query Qi, the corre- sponding BSI is shifted by appending (i − 1) All-zeros bitmaps as the least significant bits in the BSI. Finally, the BSIs are added together and BitmapNN proceeds to compute the candidate set (bitmaps G and E) over the combined BSI.

Note that when the weights are power of two, the sub-queries are executed over dis-

tinct subsets of the attributes and therefore it is as efficient as the original BitmapNN.

Query Widening

The basic similarity search algorithm disregards data that does not fall in the

same bin as defined by the query. However, data in either relevant or adjacent bins

to the query may be important in refining a similarity search. Query widening allows

potentially relevant objects to be retrieved and allows for greater similarity search

functionality.

121 With query widening, the original query is modified to incorporate potentially

relevant information by ORing other bitmaps for the given attribute. The revised

query is re-executed over the entire bitmap or a subset of data objects currently

under analysis. Different behaviors can be derived by performing query widening

before or after a candidate set has been established.

In order to enable advanced functionality, for each attribute we can store two

flags that indicate whether query widening is allowed for that attribute and whether a bitmap is stored for missing data values. For example, in the case of continuous attributes that have been quantized using equi-populated partitions, we can widen the attribute by ORing the two bitmaps that correspond to the bins adjacent to the query bin. In the case of missing data, we can OR the query bin with the missing data bin.

Query widening can also be combined with column weighting to assign ‘partial credit’ to near matches for data where such scoring is appropriate. For example, a weight of 1 can be assigned to an adjacent bin match while some greater weight can be applied to a query bin match.

6.3.5 Enhanced BitmapNN Functionality

In this section we describe how column weighting and query widening, together with the baseline BitmapNN algorithm, can be combined to offer a powerful model for similarity query resolution. In particular, we define approaches for supporting user relevance feedback and meaningful similarity searches in the presence of missing data.

122 Relevance Feedback

Relevance feedback is frequently used in information retrieval systems. Typically, a new query is constructed based on the user initial query given a set of relevant and/or non-relevant examples from the query results. The idea is that in the vector space the query is shifted closer to the relevant objects and apart from the non- relevant objects. While such an algorithm works for metric spaces, it does not work for non-metric functions. The reason is that when averaging values to generate the relevance feedback query, the generated query may not be similar to either the original query nor the relevant objects.

A relevance feedback solution that uses a localized similarity function should main- tain the same sense of similarity and discriminate between important or common features across the relevant data objects to give preference to those attributes that contribute more to relevance. In addition, a bitmap-based solution should only per- form bitwise operations between bitmaps.

With these considerations, we propose a user relevant feedback mechanism using query widening and column weighting operations. We generate a relevant bitmap

R using the positive examples provided by the user. We AND the relevant bitmap with each bitmap in the query and compute the count, i.e. the number of relevant objects that fall into that bin. By doing this, we reshape the query so that it places more importance on those attributes that tend to be better predictors of relevance.

We increase the score of these attributes so they will make a greater contribution in the revised similarity search. Similarly, we can decrease the score of those attributes that match the query over the set of objects marked as non-relevant (we can decrease an integer score of 1 for these attributes in effect by increasing the integer score

123 of other attributes or ignoring them all together). We then execute the resulting

integer weighted relevance feedback query to find objects that more closely match

those attributes identified as important for defining similarity. In our experiments we

performed relevance feedback using positive relevant examples. The cost of execut-

ing BitmapNN with Relevance Feedback is roughly two times the cost of executing

BitmapNN alone, as two queries are executed instead of one.

Similarity over Incomplete Data

Incomplete data refers to attributes that have missing values. Missing data rep- resents a problem in real world data sets. Similarity for tuples where attributes are missing data is challenging to define and also to compute. Data can be missing for a variety of reasons and the fact that a data element is missing in itself leaves no hints to the probability that the missing data element in fact matches the query.

In the absence of any knowledge about the relationship between data missingness and query matching, we can utilize query widening to refine results. Given a missing data semantic such that the missing value for an attribute is unknown and could match the queried value, we prefer a data object with score X and some missing values over an object with the same score X and known values that differ from the query bins. One bitmap is stored for each attribute to indicate whether the data in that attribute is missing for a given tuple. Query widening can be used to query both the value of interest and the bitmap corresponding to a missing value. In this way, equally scored objects can be differentiated based on the number of missing (and therefore potentially matching) data elements in the tuple.

124 If more detailed knowledge is available about the relationship between data miss-

ingness and query matching, then query widening can be combined with query weight-

ing in order to approximate this relationship. For example, if missing data implies a

matching value 1/4 of the time. A query can be rewritten with a score of 1 for the

missing data bitmap columns and a score of 4 for the desired query values.

6.3.6 Analysis of the BitmapNN approach

In this section we analyze the cost of computing the similarity search using the baseline BitmapNN approach.

To compute the similarity function, Algorithm 3 is called d − 1 times. The cost

of Algorithm 3 in terms of the number of slices in the bitsliced index is 2s bitwise

operations. Since s = ⌈log2(d + 1)⌉ the cost of computing the similarity function is

given by: d

2⌈log2i⌉ i=2 X bitwise operations.

In the case where d is a power of 2, then the number of bitwise operations is equal

to 2(d (log2d − 1) + 1).

To determine the answer set (bitmaps G and E), 2⌈log2(d +1)⌉ + 1 bitwise oper-

ations need to be performed in the worst case.

Note that the total cost is only dependent on the size of the bitmap and the di-

mensionality d, and completely independent from the number of splits per dimension.

For verbatim (uncompressed) bitmaps, the size of each bitmap is equal to the number

of objects in the dataset, n bits. So the complexity of BitmapNN in terms of bitwise

operations is O(nd log d).

125 Dataset Rows Attributes Classes wdbc 569 30 2 segmentation 210 19 7 ionosphere 351 32 2 census-income 299,285 40 2 musk 6,598 166 2 landsat 275,465 60 - irvector 19,997 32 - uscensus 2,458,285 68 - ur random 100,000 20 -

Table 6.1: Dataset statistics.

6.4 Experimental Results

In this section, we perform empirical measurements of the BitmapNN technique in terms of quality of the results, query execution time, and index size.

For quality of the results, we use several labeled datasets from the UCI machine learning repository [1]. The data includes the Wisconsin Diagnostic Breast Cancer

(wdbc) data set, the image segmentation data set, the ionosphere data set, the census- income data set, and the musk clean2 dataset. For other experiments we also use three other real datasets: landsat, irvector, and uscensus, and one synthetic data: ur random. Landsat is satellite image feature vectors, irvector is document vectors

from an information retrieval application, uscensus also comes from UCI repository [2]

and it is a discretized version of raw data obtained from the (U.S. Department of

Commerce) Census Bureau website using their Data Extraction System. rows. The

ur random data set is uniformly distributed random attributes. The statistics of these

datasets are provided in Table 6.1. Implementations were created in Java 1.6 and the

126 experiments run in a Dell Optiplex GX745 Intel Core2CPU 6600 at 2.4GHz with 2GB

RAM running Windows XP.

6.4.1 BitmapNN Quality of the results

K=10 Dataset Partition BitmapNN PIDist FKNM knn wdbc 8 0.935 0.932 0.911 0.915 wdbc CAIM 0.942 0.953 0.911 0.915 segmentation 8 0.739 0.735 0.732 0.726 segmentation CAIM 0.790 0.812 0.732 0.726 ionosphere 8 0.899 0.899 0.895 0.837 ionosphere CAIM 0.894 0.899 0.895 0.837 census-income 8 0.978 0.988 0.903 0.902 musk 8 0.926 0.919 0.915 0.925

K=20 Dataset Partition BitmapNN PIDist FKNM knn wdbc 8 0.918 0.917 0.905 0.907 wdbc CAIM 0.937 0.943 0.905 0.907 segmentation 8 0.619 0.613 0.632 0.609 segmentation CAIM 0.718 0.726 0.632 0.609 ionosphere 8 0.864 0.868 0.862 0.801 ionosphere CAIM 0.863 0.870 0.862 0.801 census-income 8 0.974 0.978 0.897 0.897 musk 8 0.908 0.903 0.904 0.908

Table 6.2: Classification Accuracy for BitmapNN, PIDist, Frequent K-N Match (FKNM) Function, and Euclidean Distance (knn).

We measure the quality of the results using BitmapNN by using the class stripping technique. This technique considers a retrieved object to be correct, i.e. similar to the query, if both, the data object and the query object, have the same class label.

127 For each data set, similarity search is performed for a set of queries for k = 10 and k = 20. The number of correct answers is then counted and divided by the total number of answers. For the first three datasets (wdbc, segmentation, and ionosphere) we use all data objects as queries. In [68], FKNM approach was shown to be more accurate than PIDist when using equi-depth partitions, which is the original setting used by [3] when running 100 random selected queries. Since all data points were used as queries in this experiment the results are different than [68] and FKNM does not outperform PIDist. For the last two datasets (census-income and musk) we ran 500 random queries. For building the bitmaps we used equi-depth partitioning and Lloyd’s algorithm with the specified number of partitions. We also use CAIM algorithm. The average number of partitions per dimension found using CAIM were

1.87 for wdbc, 2.87 for segmentation, and 1.91 for ionosphere.

We run Frequent k-N-match algorithm ranging n from 1 to d. Table 6.2 shows the classification accuracy for k=10 and k=20. For reference, results for similarity search using Euclidean distance (knn) are included as well. Even though no domain knowledge has been applied to tune the BitmapNN similarity function, BitmapNN yields very competitive results compared to the other techniques in the worst cases.

In many cases BitmapNN performs significantly better.

Note that the presented results for FKNM do not exactly match those found in [68]. FKNM used 100 random samples of data points as queries which makes their reported results non repeatable. We used all the data points as queries. Although the results differ slightly, the overall trend is the same as their findings. By using all available data as queries we allow repeatable and more fair comparisons.

128 10000

1000 Seq Scan R-tree Mod. R-tree Euc. 100 BNN - Comp BNN - Verbatim 10

Average Query Average Time (ms) 1

0.1 0 5 10 15 20 Query Dimensionality

Figure 6.4: Query Execution Time comparison as query dimensionality varies (log scale). Data: ur random, k: 1

6.4.2 BitmapNN Query Execution Time

Figure 6.4 shows the query execution time as dimensionality increases over the ur random dataset for a k=1 similarity query where the query point is a random point in the data space. BitmapNN is compared to sequential scan and data-hierarchical in- dex structure techniques that are appropriate for low-dimensional similarity queries.

Results are shown for BitmapNN, sequential scan using Euclidean distance, I/O- optimal R-tree traversal using Euclidean distance, R-tree using the same scoring function that we use in BitmapNN. The BitmapNN technique is applied over both verbatim (uncompressed) bitmaps and bitmaps compressed using WAH, where the bitmaps are built over 10 equi-width partitions for each attribute. Results are shown in logarithmic scale and show the average over time over 100 queries. Sequential scan is on the order of 3 orders of magnitude more costly than BitmapNN. The R-tree techniques suffer from the curse of dimensionality and they approach the perfor- mance of sequential scan when the dimensionality is only 20. In this particular data

129 Figure 6.5: Query Execution Time as the dimensionality of the query varies, Access structures designed for high-D. Data: ur random, k: 1

set, the WAH-compressed bitmaps add computational complexity while offering little

compression and accordingly this scheme is more costly than the verbatim bitmaps.

Clearly, sequential scan and R-tree are not good methods for similarity search at higher dimensions. In Figure 6.5, a number of techniques that maintain dimen- sional independence during similarity computations and do not suffer from the curse of dimensionality are considered. The same queries are compared to two VA-file im- plementations and the FKNM technique. The VA-file implementations include one that uses Euclidean distance as a similarity score while the other uses the same score as BitmapNN. Note that both VA-file techniques include only the first phase of VA-

file similarity search. The verbatim bitmaps yield the best execution times at under

7ms per query for queries of 20 dimensions. Even for the uscensus data, similarity search over the verbatim bitmaps and the whole dataset exhibits subsecond query performance.

130 300

250

200

150

100

BNN − Verbatim Average Query Time (ms) 50 BNN − Comp VA−file PIDIST Rel Feedback 0 10 20 30 40 50 60 70 80 90 100 k

Figure 6.6: Query Execution Time as k varies. Data: landsat, Partitions: 16

300

250 BNN − Verbatim BNN − Comp VA−file PIDIST 200

150 Time (ms)

100

50 4 6 8 10 12 14 16 Splits

Figure 6.7: Query Execution Time as the number of partitions varies. Data: landsat, k: 20

Figure 6.6 shows average query execution times over the landsat data where

Lloyd’s algorithm was performed to quantize the index structures into 16 partitions for each dimension as k increases. The query is derived from a a random point in the data set. For this data, BitmapNN (BNN) over uncompressed bitmaps yielded results on the order of 90ms per query. BitmapNN over compressed bitmaps resulted in query times on the order of 175ms. The VA-file, adapted to compute Hamming

Distance instead of more expensive bound computations, produces query execution

131 times of about 280ms. As expected, query execution time is not affected by k, which is not true for traditional-NN computations over high-dimensional data where results are progressively retrieved. Time to execute a sample relevance feedback query us- ing BitmapNN is also shown. Figure 6.7 shows the performance of the techniques is independent of the number of partitions used per dimension (splits).

Despite the potentially large number of bit operations required to perform BitmapNN, query execution time is faster than competing alternatives that provide equivalent re- sults. We compare performance to VA-File (i.e. the first phase of VA-File query execution). VA-File computation is more expensive than the bitwise operations per- formed using bitmaps. The reason being that it is not possible to avoid decoding the bitstring that represents a data object in the VA-File in order to compare it to the query and obtain the similarity score. We ran 100 nearest neighbor queries randomly sampled from the dataset.

BitmapNN (BNN) shows scalable performance as the dimensionality of the data set and therefore query dimensionality, increases. Figure 6.8 shows the query exe- cution time as the number of queried dimensions changes over the landsat data set using k=20. The graph shows slightly greater than linear growth as dimensionality increases.

Verbatim bitmaps perform 4X faster than the modified VA-File for the Landsat data set. BNN maintains its advantage over the modified VA-File for different data sets and follows the same behavior as parameters change. As a performance compari- son example, using the irvector32 data set while finding the k=20 most similar objects over 4 partitions per dimension, BNN verbatim yields an average query execution time

132 180

160

140

120

100

80

60

40 Average Query Time (ms) 20 BNN − Verbatim BNN − Comp 0 10 20 30 40 50 60 Query Dimensionality

Figure 6.8: Query Execution Time comparison of BitmapNN as query dimensionality varies. Data: landsat, k: 20

9000

8500

8000 BNN−Verbatim 7500 BNN−Comp 7000

6500

6000

5500 Avg. Bitcolumn Size (words) 5000 4 6 8 10 12 14 16 Number of Partitions

Figure 6.9: Average Bitcolumn Size as the number of partitions increases. Data: landsat

of 3.6ms, the compressed BNN produces query execution time of 11.9ms and the VA-

File results in 22.7ms. This represents a speedup of 6.3 between BNN-verbatim and

VA-File.

6.4.3 Index Size

In general, the total size of the VA-File would be smaller than the total size of the bitmap index as only ⌈log2(si +1)⌉ bits are used when defining si splits for Attribute i.

133 For bitmaps, each partition is represented by one bitmap. However, when executing

the query the entire VA-File needs to be accessed. For bitmap indexes, only d bitmaps need to be retrieved. As a function of the number of splits, if each dimension has s

1 partitions, then only s portion of the index needs to accessed to answer the query. When four splits or sixteen splits are used, only 25% and 6.25% of the bitmap index

needs to be accessed, respectively, as opposed to the entire VA-File in both cases.

To show the effect of compression over the bitmap size, the average bitcolumn size

as the number of partition increases is plotted in Figure 10. As can be seen, as the

number of splits increases, the bitmaps become more sparse and compress better.

It is worth noting that the space requirements of FKNM are always bigger than

the original data if each attribute is represented with 32 or less bits. In our imple-

mentation of FKNM, we assume that the index can fit into memory. We store each

attribute in a sorted list in ascending order. The row ids are saved together with the

values in each sorted list to be able to back reference. Hence, the index size of FKNM

is almost two times the size of the dataset. For bigger datasets, the disk based FKNM

would need to store two copies of each sorted list, one in ascending order and another

one in descending order, to be able to retrieve the previous and next element in the

sorted list efficiently. In this case, the required space for the whole index becomes

four times the size of the dataset. For landsat data, the size of the dataset is approx-

imately 64MB. The total size of the verbatim bitmap with 8 splits per dimension is

less than 16MB. The total size of FKNM is 126MB.

134 6.4.4 Comparison with other bitmap-based approaches

To the best of our knowledge, this is the first work for meaningful high dimen- sional similarity search that exclusively uses bitmap indexes and bitwise operations.

Uniquely, this approach requires no new encoding mechanism (works directly on the commonly used bitmap encoding) and requires no access to the original data for candidate pruning. However, since the approaches proposed in [28] and [14] are bitmap-based, we believe they deserve some discussion and experimental comparison with BitmapNN.

In [28] (Jeong), the bitmap indexing proposed consists of representing each object by a bitstring of length d and processing the query row wise comparing the bitstrings for pruning. The critical issue with this approach is the way the bitstring is generated, where the median of an objects values is used to decide the bit value for each attribute and important information about the actual attribute values is lost. As an example, consider data points A=[1,0] and B=[.5,.49] which yield the same bitstring (10), while point C=[.5,.51] yields (01), which is counterintuitive. To answer a NN-query, the query is quantized into its corresponding bitstring representation and XORed with all the data bitstrings. The points with smallest number of ones are considered as candidates and the actual distance is computed in a second step to retrieve the final

NN set.

In [14], complex similarity queries, i.e. queries with more than one reference object, are answered using bitmap indexes with what the authors call a Grid Bitmap

(GB) Index. Data is quantized using K-means clustering. kNN is translated into projected range queries. In series, a result bitmap is ANDed with matching bitmaps for subsequent dimensions. If there are less than k 1s set in the result bitmap, then

135 the adjacent matching bitmaps are ORed together and ANDed to the current result.

For the final candidate set, the real distance is computed and the k closest points are

returned. The distance function used in this paper is also the one proposed in [3].

With this approach, the first dimensions dominate the similarity score and subsequent

dimensions are only used to filter out the current candidate set. With this approach

there is no accuracy measure or guarantee in the quality of the results. In fact, GB

would never retrieve an object that does not match the first dimension.

With these considerations, we expect BitmapNN to outperform GB and Jeong’s

bitmap in terms of accuracy as there are no guarantee in the quality of the results

for these two approaches.

Dataset BNN GB Jeong Wdbc 0.92 0.49 0.39 Ionosphere 0.81 0.36 0.56 Segmentation 0.74 0.09 0.03

Table 6.3: Classification Accuracy for BitmapNN (BNN), GB, and Jeong’s bitmap approaches.

Table 6.3 shows the accuracy results obtained over wdbc, ionosphere, and seg- mentation data. As can be seen, the accuracy results of these two approaches are in fact, quite poor.

6.5 Extended query types supported by BitmapNN

BitmapNN can seamlessly support powerful variations to the baseline similarity search by combining BitmapNN operations with traditional bitmap index function- ality. A number of these query types are described below:

136 6.5.1 Weighted Similarity Search

Data domain characteristics may indicate that certain attributes are more impor- tant than others in defining data object similarity. As shown in the sample column scoring weight matrix (Figure 6.3), the column scoring operation can be used to score the data objects for arbitrary integer weights on the attributes.

This operation involves running and accumulating a query for each row in the query matrix. The execution time required is dependent on the number of rows in the weight matrix and is roughly a factor of ⌈log2max(wi)⌉ compared to the baseline

BitmapNN operation, where max(wi) is the largest integer weight. The maximum

d space required for the weighted sum bitsliced S, is ⌈log2( i=1 wi)⌉. P 6.5.2 Projected Similarity Search

Bitmaps maintain dimensional independence between attributes. It is important

that a quantized query is executed only over the attributes of interest. Therefore,

object similarity within a given subspace can easily be computed without any special

adaptation. In contrast, in other multidimensional indexes, similarity computations

can not be easily computed over a subset of the indexed dimensions.

6.5.3 Constrained Similarity Search

The BitmapNN operations naturally combine with traditional bitmap function-

ality. Bitmaps are naturally suited to efficiently perform projection queries over a

subset of the attributes. Queries that are a combination of projection over a subset

of attributes and similarity over another subset of attributes are easily performed

by first performing the projection query to define those points that fall within the

constraints, then perform BitmapNN over this subset of points. Other techniques

137 that can not incorporate constraint pruning during the operation can not easily be

adapted toward this problem.

6.5.4 Complex Similarity Search

Complex similarity searches refer to queries with more than one reference point.

For example, a query that asks for images similar to a set of multiple examples.

BitmapNN can support such type of queries. Each query is quantized and the cor- responding bitmaps for each dimension are ORed together. BitmapNN is performed over the resulting bitmaps as described in previous sections. The end result are those objects with the greatest total similarity to the set of query objects.

6.6 Related Work

Several recent works have used distance functions or techniques that are not Lp

metrics to measure similarity in high dimensional spaces. The reason is that localized

similarity functions provide a more meaningful criteria in high dimensional spaces

where all points are almost equidistant. In [68], the k-N-match problem is introduced.

The similarity function used in this paper is the one proposed in [25], where only the

smallest N distances between all dimensions are considered to compute the similarity function. Since the method is so sensitive to N, the authors propose the use of a

frequent k-N-match algorithm, where the most frequent k objects appearing in the

k-N-match solutions for a range of N values.

The IGrid index was introduced in [3]. Here equi-depth ranges are defined over

each attribute and inverted lists are used to perform similarity searches. This dis-

tance function used in this paper is P IDist. All the previously described functions

are non-metric, therefore pruning based on the triangular inequality is not possible.

138 This makes hierarchical indexes, which have being traditionally used for similarity

searches, perform poorly. That is the reason why we only compare with the Vector

Approximation (VA)-File.

The Vector Approximation (VA)-File was first proposed in [70]. VA-Files execute

NN-queries in two steps. The first step uses distance bounds to compute a set of candidate points from the quantized data. The candidates are then retrieved from disk and the distance metric is computed over the actual data in a second step to return the true nearest neighbors to the query. In [71], an approximate VA-File query execution model is proposed, where the second step is omitted and the query is answered using only the quantized data.

6.7 Conclusion

We propose BitmapNN, an efficient method for similarity searches over existing bitmap index structures that takes advantage of fast bitwise operations and incorpo- rates a similarity function that is efficient and effective for high dimensional queries.

The localized similarity function effectively used by BitmapNN is more robust at high dimensions than Lp metrics, which are dominated by the largest differences be- tween object attribute values. BitmapNN would perform best in medium to high dimensional spaces and is not restricted to continuous data.

We compare performance gains achieved by BitmapNN in terms of both query ac- curacy and efficiency. Accuracy gains are demonstrated by improved quality of query results and classification accuracies. Our proposed approach outperforms sequential scan by several orders of magnitude, outperform other hierarchical index approaches, and can be applied to high-dimensional data without performance degradation.

139 BitmapNN requires no changes to commonly used bitmap structures, hence can be directly integrated to the current systems, e.g. with Oracle, as a search algorithm and as a building block for their data mining tools. This is an important property since it is still possible to execute point, range, and aggregate queries, and many other type of queries using the same structure. Integrating similarity search with bitmaps provides a new direction of research and development with high potential of impact.

While the experimental results shown are for a baseline similarity search algorithm, where all attributes are treated considered to contribute equally toward similarity and no credit is applied for near matches, we provide building block operations for defining more advanced query functionality. In addition to similarity search, we describe how these operations can be used to build more advanced functionality such as relevance feedback and query resolution over missing data. Such functionality greatly enhances the usefulness and applicability of bitmap indexes.

140 CHAPTER 7

Update Conscious Bitmap Indexes

7.1 Introduction

Traditionally, the use of bitmaps has been restricted to read-only or read-mostly

(data warehouses) data environments. Even though commercial RDBMS, such as

Oracle, support updates to bitmap indexed tables, the general consensus is that updating bitmapped indexes take a lot of resources and time and the recommendation is to use nightly batch updates, drop the index, apply the changes, and rebuild the bitmaps [12, 20, 35].

The reason that the use of bitmaps has been confined to largely static domains is that bitmaps were not originally designed to handle updates efficiently. The main goal of a bitmap index is to execute queries fast and efficiently for large volumes of data. Within a bitmap index, a single bitmap is a bit vector that identifies the tuples that have attribute-values that match the value, category, or range associated with the bitmap. The entire bitmap index is made up of multiple bitmaps for each indexed attribute. The main reason why updates are so costly is that all the bitmaps, often hundreds of them, need to be updated when a new row is inserted. Insertion is the most common update operation.

141 As a sample application, consider High Energy Physics (HEP) experiments con- sisting of accelerating sub-atomic particles to nearly the speed of light and forcing their collision. Sequences of events notable for physicists are stored with all the de- tails. The number of events stored and analyzed in one year is on the order of several millions and the number of attributes per each event is above 100 [57]. Bitmap in- dexes provide efficient query support for such data [64]. New events are periodically appended to the database as more experiments are performed. Bitmap update is done periodically in batches. This means that the new data is not accessible from the index structure until the next scheduled update of the bitmap index is done. In order to update the bitmap index, the new data needs to be encoded into the bitmap, compressed, and appended to the existing bitmaps.

For the most commonly used bitmap encoding, where bits are only set for tuples that match the bitmap value, and in the case where updates were performed in real time, the insertion process would need to access all the bitmaps for the table, add a

‘0’ for those bitmaps that are not relevant for the value(s) inserted and a ‘1’ on the relevant bitmaps. Considering this domain, the number of bitmaps per table makes this insertion process prohibitively costly.

However, if it were possible to propagate database changes to only those relevant bitmaps and still maintain consistency between index and database, then the cost of bitmap updates would be greatly reduced. By updating bitmaps efficiently, we can expand the domain of applications for which bitmap indexes can be useful. We could essentially make the same number of changes to the bitmap index structure as would be necessary for an equivalent set of B+-tree indexes over the attributes and maintain index-data consistency.

142 In this chapter, we propose a bitmap index design for equality encoded bitmap

indexes for which new records can be added by updating a single bitmap per indexed

attribute. Using compressed bitmaps, we add an extra word as the last word of each

bitmap. This extra word is a compressed large run of 0’s that pads the number of

rows encoded in the bitmap to a fixed number, much larger than the actual number

of rows in the table. When we insert a new row, we set the bit in the corresponding

position of the relevant bitmaps, and compress the last words accordingly. Deletions

are handled by maintaining a compressed existence bitmap (EB) [45]. When a record

is deleted, the appropriate bit is set to 0 in the EB. Query execution is performed in

the same manner as traditional bitmaps with one additional AND bit operation with

the existence bitmap. Updates of existing data are handled by performing a delete

operation followed by an insertion.

These changes expand the applicable domain of bitmap indexes. For record inser- tions, the number of changes that need to be made to the index structure are reduced by a factor equal to the average cardinality of the indexed attributes of the dataset.

These changes make bitmap indexes more feasible for dynamic database applications and makes them more attractive for high cardinality data domains.

The rest of this chapter is organized as follows. Section 7.2 presents the update conscious bitmap approach. Section 7.3 provides the cost model for bitmap updates and query execution. Section 7.4 shows the experimental results and the performance comparison with traditional bitmap indexes. We conclude in Section 7.5.

143 7.2 Update Conscious Bitmaps (UCB)

7.2.1 The General Idea

Let us consider the update operations over a traditional bitmap indexed attribute

A of table T . We denote as B = {b1, b2, ..., bC } the set of bitmaps over the domain of

A and as n the number of tuples in T . The simple uncompressed bitmap encoding would have n bits in each bitmap. To insert a new tuple with value corresponding to bi, one would need to add a ‘1’ at the end of bitmap bi and a ‘0’ at the end of all other bitmaps.

To delete a record, we use the Existence Bitmap (EB) [45]. Originally, the EB was proposed to handle non-existent rows. The bits corresponding to non-existing tuples are set to zero in all bitmaps. Therefore, the EB needed to be used (ANDed together with another bitmap) after a NOT operation to make sure that only existing rows are reported as answers.

In the uncompressed bitmaps, to update a row we need to change two bitmaps.

The old value bitmap to unset the corresponding bit and the new value bitmap to set the corresponding bit to 1. However, given the amount of space they require, it is not feasible to store the bitmaps in an uncompressed form. Several compression techniques have been proposed for bitmap indexes. While these techniques effectively reduce the space required by the bitmaps, they have the disadvantage that there is no longer a direct mapping between the bit position in a bitmap and the position of the row in the table. In this case, for example, updates become even more expensive because we need to scan (decompress) the bitmap in order to locate the corresponding bit in the two bitmaps. This is the reason why we handle updates by performing a delete followed by an insert operation.

144 133 bits 1,20*0,4*1,78*0,30*1 31-bit groups 1,20*0,4*1,6*0 62*0 10*0,21*1 9*1 Orig (hex) 400003C0 80000002 001FFFFF 000001FF UCB (hex) 400003C0 80000002 001FFFFF 7FC00000 BFFFFFFA Literal Word Fill Word Literal Word Last Word Pad Word

Figure 7.1: WAH compressed bit vector for original bitmaps and Update Conscious Bitmaps (UCB).

We modify the current equality encoded bitmaps in the following ways:

• To each bitmap we add a number of non-set bits, called pad-bits, that would

be used for the new tuples. By padding the bitmaps with zeros we can insert a

new tuple by setting the corresponding bit only in the bitmap that corresponds

to the inserted value.

• We maintain a total bit counter, TBC, for the number of non-pad bits currently

in the bitmaps.

• We also pad the Existence Bitmap (EB) with non-set bits to indicate non-

existing rows. As in the original approach, the EB is ANDed with the resulting

bitmap after a NOT operation to make sure that only existing rows are reported

as answers.

In the following section we detail the update conscious bitmap index approach.

From here on, we refer to the equality encoded WAH-compressed bitmaps as original or traditional bitmaps.

145 Symbol Description W0 The pad word in a UCB W1 The last non-pad word in a UCB W2 The previous to the last non-pad word in UCB th Wi.value or just Wi The literal value of the i word th Wi.isFill Indicates whether the i word is a fill or not th Wi.fill The fill value of the i word th Wi.nWords The number of words encoded by the i fill word w The word size, we set it to 32 TBC The global total bit counter word[i] The literal word with only the ith bit set. T1 The total number of words encoded in each bitmap, including w−1 the pad word, in our examples T1 = 2 − 1 T2 The number of words encoded by the pad word, T2 = W0.nW ords T0 The number of words encoded by the non-pad words in this bitmap, T0 = T1 − T2 ′ T0 The number of literal words needed to encode TBC rows, ⌈TBC/(w − 1)⌉

Table 7.1: Notation list and their meaning.

7.2.2 Bitmap Creation

We based our implementation of the Update Conscious Bitmaps on equality en- coded bitmaps using Word-Aligned Hybrid (WAH) compression. WAH has been proved to provide faster query execution (2X-20X faster) than Byte-aligned Bitmap

Code (BBC) with a smaller compression ratio (compressed bitmaps are typically

40-60% larger) [66]. WAH is currently considered the start-of-the-art in bitmap com- pression due to its simplicity and efficiency.

Bitmaps are created as usual and compressed using WAH. For the rest of this paper the word size is set to w = 32 bits. The number of words encoded in each bitmap is T0 = ⌈n/(w − 1)⌉, where n is the number of rows in the table. We divide by w − 1 because with WAH compression, one bit is reserved to indicate the type

146 of word, whether a literal or fill word. The compressed size of each bitmap can be

smaller than T0 words, however, a total of T0 words are encoded in each bitmap. We then add a final word, called a pad-word, which is the fill representation of words of

w−2 0s. With one pad-word, each bitmap encodes T1 = 2 − 1 words. The pad-word

would be the 0-fill word of T2 = T1 − T0 words. In case more bits are needed, one

could add one or more pad-words to encode T1 more words for each pad-word. For

the rest of paper we assume that T1 words are enough to encode all the rows in the table, i.e. at any given time we have less than T1 ∗ (w − 1) rows in the indexed table

(including the rows that would be inserted in the future).

Figure 7.1 shows an example of a compressed bit vector for both the original

WAH compressed bitmap (3rd line) and the update conscious bitmap (4th line).

It is worth noting that we changed the encoding of the active word in the WAH

implementation so incoming bits of zeros do not require a bitmap update. In the

original implementation the active word would be represented as 000001FF, in which

case, adding a zero would update this word to be 000003FE. However, by encoding

the rows from the most significant bits, the active word 7FC00000 in the UCB would

remain unchanged when we add a zero. The last word in the update conscious bitmap

(BFFFFFFA), represents the pad-word, i.e. a 0-fill word for a run of 33, 285, 996, 358

zeros. The total bit counter (TBC), not shown in this example, is set to 133.

7.2.3 Insertions

Let us consider the insertion of a new row where attribute A is appended with the

value encoded by bitmap bi. In this case, only bi needs to be accessed to perform the

insertion. The basic idea is to use the TBC and the number of words in the pad word

147 to decide how many zeros we need to put before the new 1 in the current bitmap.

We also update the number of words encoded by the pad word so the total number of words in the bitmap remains the same. We change the EB to include the new row and increment the TBC by 1.

We have two cases when inserting a new row for attribute A depending on whether the inserted value is encoded by an existing bitmap. In the first case, when the value is already encoded by a bitmap bi, we compute the number of words encoded by the non-pad words of bi, T0 = T1 − T2. Note that even when T0 is initially the same for all the bitmaps, as we insert new rows, T0 remains the same for all the unchanged bitmaps. If all the bitmaps were updated on an insertion then the number of words

′ encoded by the bitmaps would be T0 = ⌈TBC/(w − 1)⌉ and T0 would always be equal

′ ′ to T0. In the case when T0 = T0 we only need to change the bit corresponding to

′ position (TBC mod (w − 1)) in the active word. In the case when T0 < T0, which is the case when many insertions have been performed in other bitmaps, we add a

′ 0-fill word to make T0 = T0 − 1, add an active word with the bit corresponding to position (TBC mod (w − 1)) set to 1, and update T2 = T1 − T0. The details of the implementation are given in Algorithm 4. To simplify the pseudocode, we introduce some variable names and their corresponding values in Table 1. The extra steps found in the pseudocode which are not described above are necessary to guarantee that the resulting bitmap is correctly compressed in WAH format.

As an illustration, consider the following example. In Figure 7.2 we present the last 7 compressed words for the bitmap index over the attribute sex, which only have two possible values M or F , for a table with 200 rows. Both bitmaps have 6 regular words, 1 active word and 1 pad-word (BFFFFFF8). The TBC is 200, and the number

148 UpdateBitmap (Bitmap Index B, Total Bit Counter TBC)

1: read W2, W1, and W0 from B ′ 2: if T0 = T0 3: W1 = W1 | word[TBC mod (w − 1)] 4: if (TBC mod (w − 1) = 0) and (W1 is all 1s) 5: if W2.isFill and W2.fill=1 6: W2.nWords++ 7: delete W1 8: else if W2 is all 1s 9: W2.isFill = true, W2.fill=1, W2.nWords=2 10: delete W1 ′ 11: else if T0 < T0 ′ 12: W0.nWords=(T1 − T0) ′ 13: nWords= T0 − T0 14: if nWords = 1 15: add word[TBC mod (w − 1)] before W0 16: else if nWords = 2 17: add words 0 and word[TBC mod (w − 1)] before W0 18: else 19: W .isFill = true, W .fill=0, W .nWords=(nWords-1) 20: add words W and word[TBC mod (w − 1)] before W0

Algorithm 4: Insertion of a new row - Update Conscious Bitmap indexes.

′ of words needed is 7 (T0 = ⌈200/31⌉). Now consider the insertion of a new row where

sex = M. TBC is updated to 201. We only access the bitmap corresponding sex = M

and update the active word (W1) to be 3FCF0000 (since the number of words needed

′ remains to be 7, i.e. T0 = T0). After 16 more insertions of rows where sex = M,

TBC = 217, and W1 = 3FCFFFFF. When a new insertion comes where sex = M,

′ TBC = 218 and the words needed are now 8 (T0 < T0), so we need to add one word

and subtract 1 from the total words in W0. Since (218 mod 31 = 1), the new word is

40000000 and W0 is updated to be BFFFFFF7.

149 Bitmap Value W3 W2 W1 W0 Initially M 7E95B5FA 6FFFE3AF 3FCE0000 BFFFFFF8 (last 102 rows) F 016A4A05 10001C50 40300000 BFFFFFF8 Insert 1 row M 7E95B5FA 6FFFE3AF 3FCF0000 BFFFFFF8 (sex = M) F 016A4A05 10001C50 40300000 BFFFFFF8 Insert 16 rows M 7E95B5FA 6FFFE3AF 3FCFFFFF BFFFFFF8 (sex = M) F 016A4A05 10001C50 40300000 BFFFFFF8 Insert 1 row M 6FFFE3AF 3FCFFFFF *40000000* BFFFFFF7 (sex = M) F 016A4A05 10001C50 40300000 BFFFFFF8 Insert 1 row M 6FFFE3AF 3FCFFFFF 40000000 BFFFFFF7 (sex = F) F 10001C50 40300000 *20000000* BFFFFFF7

Figure 7.2: Insertion operations for Update Conscious Bitmap Index over Attribute sex.

Now consider the insertion of a new row where sex = F . TBC = 219 and the number of words needed is 8. Since there are only 7 words encoded in the bitmap

′ (T0 < T0), we need to add one word to the bitmap and update the word count in

W0. Since (219 mod 31 = 2), the new word is 20000000 and W0 is updated to be

BFFFFFF7.

For the second and special case of insertions, when the value being inserted is not encoded by an already existing bitmap, i.e. the new row is the first row with such a value, we do not need to access any of the bitmaps for attribute A, we just add a new bitmap with at most two non-pad words. The first word corresponds to the fill words for (almost) all the previous rows and the second one is the active word with a 1 in the position corresponding to (TBC mod (w − 1)). We also add as many pad-words as needed, e.g. 1 in our experiments.

150 7.2.4 Deletions

We propose to handle deletes by unsetting the corresponding bit in the EB and leaving the rest of the bitmaps unchanged. Since the bit corresponding to the deleted row remains set in some bitmaps, when we execute the query we need to AND the

EB with the final result to ensure that no deleted row is retrieved as an answer.

7.3 Bitmap Index Cost Model

The cost to modify a set of indexes associated with a database update is the cost to propagate all database changes to the set of indexes such that all possible operations over the indexes exactly reflect the state of current database. This section describes the bitmap modifications that need to occur so that the indexes are up to date after a record insertion for both traditional bitmaps and our update conscious bitmaps.

In our final update cost estimates, we assume that reads and writes dominate the cost of updates compared to the bit comparisons and bit operations that need to be performed. We also assume that I/O operations have a consistent cost, but they may in fact differ based on disk locality. The experimental results will show to what extent these assumptions are valid.

7.3.1 Insertions Traditional Bitmaps

For traditional bitmaps, when a new row is inserted every bitmap for each indexed attribute needs to be updated in order to maintain consistency with the database.

Assuming the bitmap index is not in memory and bitmap Bi is a pointer to a binary file that contains the compressed bitmap, the insertion routine involves reading

151 at most the last two words of the bitmap file, making the appropriate modifications,

and writing the updated file to disk. All the bitmaps, including the existence bitmap

(EB), need to be accessed and updated.

The cost to perform this update is:

tins =(tbm + 1) ∗ (rt + ut + wt)

where tins is the estimated time to update the complete bitmap index upon a record insert, tbm is the total number of bitmaps excluding the EB, rt is the time to read

the last words of the bitmap file, ut is the time to update the last words of the bitmap to be consistent with the record insertion, and wt is the time to write the bitmap file.

Update Conscious Bitmap

For update conscious bitmaps, when a new row is inserted, only one bitmap for

each indexed attribute needs to be updated to maintain consistency with the database.

With the same assumptions as with the traditional bitmaps, insertions using UCB

involve reading at most the last three words of the bitmap file, making the appropriate

modifications, and writing the updated file to disk. One bitmap per indexed attribute,

as well as the existence bitmap (EB), needs to be accessed and updated.

The estimated cost associated with a record insert is:

tins =(atts + 1) ∗ (rt + ut + wt) where atts is the number of attributes.

Assuming that the read, write, and bit manipulation operations require similar times for the traditional and update conscious bitmaps, the speedup for the insertion of a new row is:

speedup =(tbm + 1)/(atts + 1)

152 7.3.2 Deletions Traditional Bitmaps

The brute force schema for handling row deletes in traditional bitmaps would be to remove the bit corresponding to the deleted row in all bitmaps. However, this is clearly very inefficient since it involves decompressing all the bitmaps and altering the (w-1)-bit groups created to compress the bitmaps. A more clever way would be to use the EB to flag the deleted row as inexistent. There would then be two alternatives. The first one is to keep the rest of the bitmaps unchanged in which case we would need to exclude the deleted rows during query execution by AND-ing the

final bitmap with the EB. The second one is to change the corresponding bits to 0 in the relevant bitmaps in which case query execution is not affected since deleted rows are 0s in all bitmaps. In order to maintain the query execution performance of traditional bitmaps, we chose the second alternative for traditional bitmaps. When a row is deleted, the bitmaps corresponding to the values of each indexed attribute in the row need to be updated, i.e. the set bit needs to be unset, and then, the bit corresponding to the deleted row in the existence bitmap needs to be unset as well.

Now, recall that since the bitmaps are compressed, the bitmaps need to be scanned

(decompressed) and the change in the bit value could mean increasing or decreasing the size of the bitmap by at most three words.

The estimated cost to perform a delete is:

tdel = atts ∗ (rtbc + dt + ut + ct + wtbc)

where tdel is the estimated time required to update the entire index upon a record

deletion, rtbc is the time it takes to read the bitmap file, dt is the time required

153 to decompress/scan the bitmaps to locate the relevant bit, ct is the time to update the file, and wtbc is the time needed to write the file to disk. These times will vary depending on the compressed length of the bitmap.

Note that one could choose the first alternative to handle deletes with traditional bitmaps, in which case the deletion cost and the query execution cost would be the same for both, traditional and update-conscious bitmaps.

Update Conscious Bitmaps

For the update conscious bitmaps, we decided to only modify the existence bitmap and handle the extra set bits during query execution by ANDing the final results with the existence bitmap in case there are deletions. Assuming the existence bitmap is not in memory, the operation requires reading the existence bitmap, scanning it, changing the bit associated with the deleted row to ‘0’, compressing the bitmap, and writing it to disk. The estimated cost is:

tdel = rtbc + dt + ut + ct + wtbc

Assuming similar times associated with the compressed existence bitmap and the average bitmap size, the estimated speedup for deleting a row is:

speedup = atts

7.3.3 Query Execution

For both approaches, when a bitmap is negated during query execution the final answer needs to be ANDed with the EB in order to prune nonexistent rows. However, for update conscious bitmaps, this additional operation needs to be performed for every query if there are deletions.

154 The following analysis refers to queries where no bitmap is negated and there are

deletions, i.e. the update conscious bitmaps require one bit-wise operation more than

the traditional bitmaps. Note that in other cases, the UCB will exhibit the same

query execution performance as the traditional bitmap. For simplicity, and without

loss of generality, we will express the queries in terms of the number of bitmaps that

need to be accessed to answer the query.

For point queries, the number of bitmaps that need to be accessed is equal to the number of attributes queried.

For range queries, the number of bitmaps that need to be accessed depends on the range of each attribute queried. However, it is bound by half of the bitmaps for each attribute.

Traditional Bitmaps

For traditional bitmaps, the time required to execute a query over b bitmaps is:

tex =(b − 1) ∗ tbo

where tex is the estimated time to execute a query, b is the number of bitmaps that need to be accessed to answer the query, and tbo is the time to perform a . The time tbo will vary depending on the size of the bitmaps.

Update Conscious Bitmaps

For update conscious bitmaps, query processing is identical as before with the addition that the final answer obtained from the bitwise operations is ANDed with the existence bitmap. The time to perform a query is:

tex = b ∗ tbo

155 And the speedup (slowdown) is:

speedup =(b − 1)/b

7.4 Experimental Results

7.4.1 Experimental Setup

We performed experiments in order to measure the impact on bitmap update time, query execution time, and index storage for update conscious bitmaps compared to traditional bitmaps.

Four data sets were used in the experiments. Each data set contains 2 million rows and attributes of varying cardinalities. The HEP data set is a set of real bitmap data generated from High Energy Physics experiments with 12 attributes. Each attribute has between 2 and 12 bins, for a total of 122 bitmaps. The UR data set is a synthetically generated data set of uniformly distributed random data. The Z1 and Z2 data sets are synthetically generated data sets following the Zipf distribution with parameters 1 and 2 respectively.

Experiments are performed using a Windows XP Professional Operating System on a Pentium 4 2.26 GHz processor with 512 MB of RAM. A Java implementation of traditional bitmaps and update conscious bitmaps was developed to measure the relative differences between techniques and to compare to predicted results. Each bitmap is stored in a separate binary file. The bitmap update time includes time to read the relevant bitmap files, update the last word(s) of the file, and write the updated file back to disk. Query execution time is the time to run point queries using the appropriate bitmap query execution technique. For HEP data, the query points are randomly selected from the last 100,000 rows which were not used during bitmap

156 Figure 7.3: Index Update Time vs. Number of Rows Inserted.

Figure 7.4: Index Update Time vs. Number of Rows Inserted.

creation. For synthetic data, the point queries correspond to randomly generated data points following the same distribution as the data set.

7.4.2 Update Time Insertion Time versus Rows Appended

Figure 7.3 shows the index update time required as the number of records ap- pended is varied for the synthetic data sets. Each append is done individually, not in a batch fashion. In this experiment we append one attribute with cardinality of

157 Figure 7.5: Index Update Time vs. Cardinality of the Attribute Indexed.

50. For both the original bitmap representation and the UCBs, the update of the

Existence Bitmap is included in this experiment. Both techniques exhibit linear per-

formance with respect to the number of rows appended. The data sets, which vary in

compressibility, have little effect on the append times. The ratio between the update

times for the original technique and the UCBs is close to the predicted value. Using

our experimental setup, on the average, one insertion is done in 7 ms for UCB as

opposed to 212 ms for traditional bitmaps.

Figure 7.4 shows index update time for the real HEP data set. This graph shows the linear performance of insertions and includes the time to append all 12 attributes of an inserted record. Appends to the real data show the same pattern of append time as the synthetic data. To insert one new row, the average time is 50 ms using

UCB as opposed to 470 ms using traditional bitmaps. The speedup is 9.46 (123/13).

158 Figure 7.6: Index Update Time vs. Cardinality of the Attribute Indexed.

Insertion Time versus Attribute Cardinality

Figure 7.5 shows the record append time for the synthetic data as attribute car- dinality varies. For each stated cardinality, 1, 000 rows of a single attribute are ap- pended. The UCBs shows constant update time with respect to attribute cardinality while the original bitmaps exhibit update times that linearly increase with respect to attribute cardinality.

Figure 7.6 shows the relationship between append time and attribute cardinality for the real HEP data set. It shows similar performance to the synthetic data sets.

Insertion Time versus Number of Attributes

Figure 7.7 shows how insertion time is affected by the number of attributes in- dexed. Insertion time measures the time needed to append 1, 000 rows with a varying number of attributes. The attributes involved are equally taken from the UR, Z1, and Z2 data sets and all have cardinality 10. As expected, the performance for both the original and UCBs is linear with respect to the number of indexed attributes and

159 Figure 7.7: Index Update Time vs. Number of Attributes Indexed.

Figure 7.8: Point Query Execution Time vs. Number of Attributes Queried.

the ratio between original bitmap and UCB append time is as predicted. The real

HEP data performance is similar and is not shown here due to space considerations.

7.4.3 Query Execution Time

Figure 7.8 shows how the average query execution time compares using original bitmaps and UCB. Times are provided for performing 100 point queries using the indicated bitmap type as the number of attributes involved in the query vary. Times

160 Figure 7.9: Point Query Execution Time for best case EB and worst case EB.

are provided for both the original bitmaps and UCB with and without an extra AND

operation with a compressed existence bitmap. UCB and original bitmap query

execution time is nearly identical when both use existence bitmap and when both do

not.

7.4.4 Worst Case Existence Bitmap

As deletes occur over time, the existence bitmap becomes less compressible (the

0’s associated with deleted records interrupt runs of 1’s). Since query execution time over bitmaps is dependent on the compressed length of those bitmaps, queries that involve the existence bitmap can become more time consuming as the existence bitmap increases in size. Figure 7.9 demonstrates this effect and the magnitude of the effect. We compare the results using a best case (all 1’s existence bitmap) to a worst case existence bitmap (alternating 1’s and 0’s, no WAH-compression possible). The graph displays results for the Z2 dataset using cardinality 10 attributes. The results show that a worst case existence bitmap adds less than 20 ms in query execution time for this experiment. Other data sets provide similar results.

161 7.4.5 Compressed Size

Considering the case when we only add one pad-word, there is no overhead in terms of storage space when the last word of the compressed bitmap is zero, since this word would be subsumed by the pad-word. If the last word of the compressed bitmap has a one, then we are increasing the compressed size by 1 word. In general, if we add p pad-words, the overhead would be between |B|∗ (p − 1) ∗ w and |B|∗ p ∗ w bits, where |B| is the number of bitmaps for the given table. For example, if we add one 32-bit pad-word to each bitmap and the number of bitmaps for the table is 1,000 bitmaps, in the worse case we are increasing the size of the bitmap indexes by 4 KB.

This is clearly a small overhead when compared with the overall benefits achieved using this technique.

7.5 Conclusion

In this chapter, we have introduced an update conscious bitmap index structure.

In order to insert a record, only those bitmaps associated with attribute-value pairs that match the new tuple need to be updated. This means that for each attribute appends are performed in constant time regardless of the attribute cardinality. This contrasts to traditional bitmaps which would require all bitmaps to be updated in order to maintain the index up-to-date. The speedup of bitmap index insertion using an update conscious bitmap index is on the order of the average indexed cardinality.

In addition, query performance using UCB is as efficient as traditional bitmaps when the updates are in the form of appends. The storage overhead is minimal since the compressed size of each bitmap is increased by at most one word. Even in the case when there have been so many deletes and modifications of existing records

162 that the Existence Bitmap becomes incompressible, there is only a minor impact on query execution time. Furthermore, there is nothing in the update conscious bitmap architecture that prevents periodic rebuilds of the bitmap index to mitigate this effect.

Update conscious bitmaps are particularly a good choice for domains where new data is added to the data set at a rate which allows for updates to take place, and it is desirable to keep the index up-to-date in real time. This approach expands the applicable realm of bitmap indexes to include a wide range of dynamic data domains, such as scientific applications where data is appended as experiments are conducted.

To the best of our knowledge, this is the first work that addresses the update of bitmap indexes and proposes a solution to handle appends of new data efficiently.

163 CHAPTER 8

Non-clustered Bitmap Indexes

8.1 Introduction

Traditional bitmap indexes are built using the physical order of the data table as

corresponding bits are set based on the attribute values for the given tuple. For this

reason, bitmaps indexes can be considered as a special case of primary or clustered

indexes that does not impose a physical order over the base table but rather uses the

physical order of the dataset to build the index. A typical primary or clustered index

would physically order the data by the index key. As opposed to primary indexes,

secondary or non-clustered indexes do not explicitly use the row ids in the table and

use a mapping mechanism to locate the data.

Although uncompressed bitmap indexes involving relatively smaller datasets can

be managed efficiently, large scale data sets require bitmap compression to reduce

the index size. In addition, the compression technique used needs to enable logi-

cal operations over the compressed form of the bitmaps in order to minimize the

overhead during query execution [4, 5, 30, 77]. The general approach is to utilize

run-length encoding1 because it effectively compresses columns and does not require

1 Run-length encoding is the process of replacing repeated occurrences of a symbol by a single instance of the symbol and a count.

164 explicit decompression in query processing. The two popular run-length encoding

based compression techniques are the Byte-aligned Bitmap Code (BBC) [5] and the

Word-Aligned Hybrid (WAH) code [77].

The performance of run-length encoders depends on the presence of long runs of the same symbol. Reordering of the data have been successfully applied as a preprocessing step to increase the performance of run length encoding [29, 50]. The overall compression ratio of the bitmaps is considerably improved when data is sorted.

The improvements are especially significant in the first few columns, since they contain longer and fewer runs after sorting. However, the sorting has no effect on the later columns which would only compress as if they were in random order. The number of sorted runs grows exponentially with the ordinal position of the column. An immediate implication of this unbalanced compression is that the queries involving the

first columns are gaining much more significant speed up compared to the others. We aim to achieve comparable performance improvement for all the attributes therefore

for all queries over the dataset. An intuitive solution is a divide-and-conquer approach that distributes the attributes into smaller subsets and sorts the bitmaps for each partition independently. Since the bit positions in the bitmaps do not longer represent rows ids, a mapping table is needed to translate bit positions into row ids. With the use of mapping tables, one can generate as many sorted secondary bitmap indexes as needed.

With sorted secondary bitmaps (SSB) indexes, there is a clear improvement in the compression performance of the bitmap columns. However, we cannot claim the same for the query execution performance. Even when the queries execute faster for a set of attributes sorted together, executing queries involving several partitions would

165 introduce significant overhead. The reason is that one secondary bitmap needs to be

translated into a primary bitmap using a one-to-one mapping. The same number of

bits is encoded in each bitmap as there is one bit per tuple in each bitmap. Therefore,

in addition to the bitwise operations, we need to account for the mapping of every

bit set to 1 in the partial answers.

In this chapter, we investigate whether it is possible to have the best of both worlds: improved compression and improved execution time for all the bitmap columns of the attributes in a table. The key to our answer is to remove all redundancy from the bitmaps.

We propose a new non-clustered bitmap (NCB) index organization utilizing ad- vantages of horizontal and vertical partitioning with no significant overhead. With horizontal partitioning, tuples are segmented to ensure that partitions fit into main memory. With vertical partitioning the dataset is divided into sets of attributes and one logical index is created for each set. As a result, each partition has its own sec- ondary bitmap index. Instead of storing each bitmap column in the partition we only store one bitmap column for the whole partition and instead of having one bit per tuple in the bitmap we only have one bit per distinct value. The proposed struc- ture indexes the rank of a tuple in a sorted bitmap table. As opposed to clustered bitmaps, the bitmap columns are not stored and there is only one bitmap column per partition. The bitmap corresponds to an existence bitmap (EB) of the distinct ranks of the tuples, i.e. the rows from the full bitmap table present in the data, which after partitioning is considerably smaller than the number of tuples in the dataset. By controlling the number of attributes in each partition, we control the size of the full bitmap table and therefore, guarantee faster query execution than primary bitmaps.

166 We formalize and describe the query execution logic over the proposed architecture

to ensure correct results. Queries are executed against the full bitmap table which is

not stored but rather computed on the fly during query execution time. The resulting

bitmap for each partition is then translated into a primary bitmap. The query result

for each partition is exactly the same as with traditional bitmaps. Primary bitmaps

are ANDed between partitions to obtain the final answer bitmap.

We compare the performance of the proposed approach with the bitmap indexes

supported by FastBit [74]. FastBit is an open source software tool that uses WAH

compressed and verbatim bitmaps to support SQL-like queries, such as selection

queries. The design choices of FastBit are proved to be effective when compared to

other bitmap indexing methods [42].

The rest of this chapter is organized as follows. The proposed approach is de- scribed in Section 8.2. Experimental results are presented in Section 8.3. Finally, conclusions are presented in Section 8.4.

8.2 Approach

In this section, we present the proposed approach that combines the advantages of the representation of the sorted data and the efficient query execution of bitmap indexes. We first provide the technical motivation for the proposed approach and then formally describe the components of our system.

8.2.1 Technical motivation

Consider a relational table D with d attributes and n tuples. The cardinality, i.e., number of distinct values in the range, of attribute Ai is denoted by ci. Often, the

167 range of Ai is quantized into bins before creating the bitmap index. In those cases ci refers to the number of bins for attribute Ai.

For traditional bitmaps, a bitmap table B is created using one column for each attribute bin and one row for each tuple. The number of rows in the bitmap table is denoted by n, and the number of columns Bc is given by the sum of the cardinalities:

d

Bc = ci i=1 X A bitmap table B is called a full bitmap table if the number of the distinct tuples Bn is equal to the Cartesian product of all the values of the attributes in the table. The number of distinct tuples in a full bitmap table is given by the product of the cardinalities: d

Bn = ci i=1 Y For example, Figure 8.1 presents a full bitmap table (all possible distinct tuples present in the table) with 3 attributes each with cardinality of 3. Bc = 3+3+3 = 9, and distinct tuples (which we call distinct ranks) Bn = 3·3·3 = 27.

A full bitmap table B is sorted if the tuples are in lexicographic order of the columns. E.g., the table in Figure 8.1 is a sorted full table. The sorting order of the columns refers to the order in which the columns were evaluated in the ordering.

Different sorting orders of the columns produce different permutations of the tuples under the same ordering criteria. Unless otherwise noted, the sorting order of the columns used in this paper is the order in which the attributes appear in the dataset.

In the full bitmap table B, a tuple t is identified by the set of bin numbers of each attribute value, i.e. t = {bt1, bt2, ..., btd}, where bti refers to the bin number of the value of attribute Ai. For instance, t20 = {1, 3, 2, } in Figure 8.1.

168 Figure 8.1: Full Sorted Bitmap Table with 3 attributes and 3 bins per attribute. The white blocks represent 0s and black blocks represent 1s.

The rank π(t) of tuple t refers to the position of t in the full ordered bitmap table

B and is given by: d d

π(t)=1+ ((ci − bti) · cj) i=1 j>i X Y For example, the rank of t20 (denoted by r20) is 1 + (3-1)·(3·3) + (3-3)·(3) + (3-2)·1

= 20.

The main consideration is that given the attribute cardinalities and their sorting order, the actual value of the attributes can be derived from the rank of the tuple.

169 In other words, the rank function is 1-1 and this property (i.e., mapping from a

given rank to an attribute value) is used in our query execution mechanism, which is

described later.

Summarizing the Sorted Full Bitmap Table: Few elements are needed to summarize the sorted full bitmap table as there is a clear pattern on the organization of the table when lexicographic order is used. For simplicity of the analysis let us consider that all the attributes have the same cardinality.

Columns can be considered to be periodic, i.e. after a certain number of bits, the

pattern repeats itself. The number of periods in the column depends on the number of

attributes precede this column. For columns 1-3 (first attribute) in Figure 8.1, there

is only one period since the runs never repeat. For columns 4-6 (second attribute)

there are three periods (c1), and for columns 7-9 (third attribute) there are 9 periods

(c1 ·c2). Note that the number of periods pi for an attribute Ai is given by the product

of the cardinalities of the preceding attributes. The period length (number of tuples

in a period) can be computed as Bn/pi. For example, for the columns of Attribute 2

in Figure 8.1, the period length is 27/3 = 9.

Notice that there is only a single run of 1s in each period. The total number of

runs in a period ranges from 2 (when the period starts or ends with 1s) to 3 (when

the run of 1s is in the middle of the period). The number of runs in a period is at

most 3.

The run length of 1s in each period depends on the cardinalities of the following

attributes. For columns 7-9, the run length is 1 (as this is the last attribute), for

columns 4-6 the run length is 3 (c3), and for columns 1-3 the run length is 9 (c2 · c3).

170 So far, the number of periods, the length of the periods, the length of the runs of

1s are all the same for the columns of an attribute. The only difference between the columns is the position of the runs of 1s.

The summary of the bitmap table consists of the cardinalities of the attributes, and the sorting order of columns. Finding the bit value of a column for a given rank is actually no more complicated than translating a number from decimal to binary format. The main difference is that the base used to translate the number changes as we move between the attributes, i.e. the base used is derived from the cardinalities of the following attributes.

To efficiently compute the bit values of a rank three values are stored for each bitmap column i of attribute Aj:

• P eriod length (simply P eriod), which is the number of tuples before the pattern

of the runs repeats itself. d

P eriod = ck k j Y= • StartRun1, which is the start position of the run of 1s within a period.

d

StartRun1 =(cj − i) ∗ ck k j =Y+1

• EndRun1, which is the end position of the run of 1s within a period.

d

EndRun1 =(cj − i + 1) ∗ ck − 1 k j =Y+1

Deriving the bit value b for column i of attribute Aj for rank r reduces to testing

whether the rank is within the run of 1s for the corresponding period:

b =? (StartRun1 ≤ (r mode Period) ≤ EndRun1)

171 Figure 8.2: System Overview of the proposed approach.

Therefore, it is possible to store a summary of the full sorted bitmap table and answer queries about the values of a particular tuple efficiently from the rank of the tuple. If we store the ranks of the tuples and the mapping from the tuples to the original row ids, then we can answer selection queries by scanning the existing ranks in only one pass.

8.2.2 System Overview

In this section, we describe the system components of the proposed approach.

Figure 8.2 shows the system overview that includes an index and a query engine.

The index has two parts: a mapping table and an Existence Bitmap (EB). The mapping table translates ranks into row identifiers. The EB indicates which ranks

172 are present in the dataset. The query engine has three functional units: the query

planner, the query executor, and the bitmap translator. The query planner identifies

the bitmaps needed to answer the query and the partitions involved in the query. The

query executor computes the values queried in the logical bitmap on-the-fly, as the

individual bitmaps are not stored. The translator converts the secondary (logical)

bitmap into a primary bitmap, i.e. the bit positions in the bitmap refer to the row

identifiers in the original data. Each component is described in detail below.

8.2.3 Index Structure

The non-clustered bitmap index (NCB) has two elements: a mapping table and an Existence Bitmap.

The Mapping Table (MT) enables the translation from bit positions to row identifiers. This mapping mechanism is mandatory for any secondary index as, for most queries, it is necessary to access the actual data to answer the query. One MT is built for each partition. There are many possible implementations for the mapping table. A simple and update efficient way is to use a B+Tree. The size of the mapping table is O(N log N).

The Existence Bitmap (EB) indicates which tuples from the full ordered bitmap

exist in the partition. When the EB is dense, it is stored as a WAH compressed

bitmap. When the EB is sparse, it is stored as a list of ranks (rankList). Queries are answered with a single scan of the EB. For a partition with Ns distinct ranks,

the size of the EB is O(Ns(log2 N)) bits, when stored as a rank list and O(Ns) when

stored as a WAH compressed bitmap. One thing to note is that we use what is called

173 a one-sided compression, i.e. only the zeros are compressed in the EB. The reason is

that the query is executed inplace over the EB for improved performance.

To create the NCB index, we generate the summary of the full sorted bitmap table using the order of the attributes and the attribute cardinalities. Then, by scanning the dataset we compute the rank of each tuple. The rank is set in the EB and the row id is inserted into the mapping table using the rank as the key value.

8.2.4 Query Engine

A selection query is a set of conditions of the form Aopv, where A is an attribute, op in {=, <, ≤, >, ≥} is the operator, and v is the value queried. We refer to point

queries as the queries that use the equal operator (A = v) for all conditions and range

queries to the queries using a between condition (v1 ≤ A ≤ v2).

In a bitmap based system, queries can be thought as a set of bitmaps for a subset

of the attributes. The queried values, vi, are quantized to identify the corresponding bitmap bins, bi. If the bitmaps correspond to the same attribute then the bitmaps are ORed together, otherwise they are ANDed together.

The Query Planner quantizes the query and transforms it into an execution plan by identifying the attributes bins that need to be accessed and determining the number of partitions and/or segments involved in the query.

The Query Executor takes the parsed query and scans the corresponding EB.

For simplicity of the discussions and without loss of generality, let us assume that the

EB is stored as a rank list. In contrast to traditional bitmaps, the query execution is done row wise, not column wise. When a rank is read from the EB the query executor decides whether that rank satisfies all query conditions or not. Once the

174 rank does not satisfy one of the query conditions, then the rank is discarded and no

further evaluation is needed. Moreover, since the columns queried are the same for

all the ranks, we can take advantage of caching of the column results. The way we

enabled caching is by returning, together with the bit value of the rank for the given

column, another rank, called validResult which is nothing more than the end position

of the run where the queried rank falls. For the cases when the validResult rank is

greater than the next evaluated rank, we can use the cached bit value and avoid the

rank comparisons. During query parsing, attributes in the query are arranged in the

same order as the sorting order to take advantage of the caching mechanism as earlier

columns would have large validResult ranks. When the columns queried involve only

the last attributes in the sorting order, then caching has no effect.

Algorithm 5 presents a simplified pseudocode of the query processing algorithm.

Here we assume only one segment in the table.

In Line 1, the primary bitmap B is initialized as an all zeros bitmap. In Lines 2 and 3, cachedResult and validResult are set to 0 and -1, respectively. Then, the EB is scanned (Line 4). For each rank r, the variable isAnswer is initialized to true (Line

5) and is used to indicate whether the tuples with rank r satisfy the query. Then each attribute in the query is evaluated (Line 6). If the cached result is not valid for rank r then the value is derived (Line 7). The start and end positions of the run of

1s is computed for period p (Line 8) in which the rank r lies (Lines 9-10). Then, if r lies within the run of 1s (Line 11) then the result for this column is set to 1 (Line

12) and the validity of the cached value is set to end of the runs of 1s (Line 13).

However, if r lies outside the run of 1s then the answer is set to 0 and r is evaluated to decide whether it falls within the first run of 0s (in which case the cachedResult

175 Execute(P,Q)

1: B = All zeros bitmap 2: cachedResult = {0}qDim 3: validResult = {−1}qDim 4: for each rank r in the EB 5: isAnswer = true 6: for each attribute Ai in Q 7: if r> validResult[i] 8: p = r/Ai.period 9: start1s = Ai.period ∗ p + Ai.StartRun1 10: end1s = Ai.period ∗ p + Ai.EndRun1 11: if Ai.StartRun1 ≤ r ≤ Ai.EndRun1 12: cachedResult[i]= 1 13: validResult[i]= end1s 14: else if r < start1s 15: cachedResult[i]= 0 16: validResult[i]= start1s − 1 17: else 18: cachedResult[i]= 0 19: validResult[i]= start1s + Ai.period − 1 20: if cachedResult[i] = 0 21: isAnswer = false 22: if isAnswer 23: C =translate(r) 24: B=BORC 25: return B

Algorithm 5: Query execution algorithm. P is the partition and Q is the query representation.

176 is valid until the start position of the runs of 1s) or it falls within the second run of

0s (in which case the cachedResult is valid until the start of the run of 1s in the next period) (Lines 15-19). If a bit value is 0 for a given attribute then the loop stops and r is not an answer to the query (Lines 20-21). If after evaluating all the queried columns isAnswer is still true, then the rank is translated into a primary bitmap C.

The Bitmap Translator is given a rank and it returns a primary bitmap, i.e. as the physical order of the table that have set bits for the corresponding tuples that have that rank. The bitmap translator is the only process that accesses the mapping table or B+Tree. For clarity of the presentation, we made C a bitmap, but for efficiency we implemented it at the word level, and the OR is only done between B and the non-zero words of C. Finally, the primary bitmap B is returned as the answer.

The pseudocode in Algorithm 5 only considers a partition with one segment. Hav- ing several segments would increase the number of bitmaps returned as no operations are done across segments. In the case that there are more than one partition involved in the query, then the translated bitmaps B from each partition are ANDed together before returning the final answer to the user.

8.2.5 Partitioning

For NCB we have two types of partitioning: vertical and horizontal partitioning.

Vertical partitioning, A = V (D), refers to subset of the attributes in D, such that |A| ≤ d.

Horizontal partitioning, TA = H(D), refers to a condition to distribute the tuples in D for the attributes in A. Horizontal partitioning can be done, for example, on ranges of some attribute, the row identifier, or hash values.

177 A Partition P = {A, TA} is a pair of a set of attributes A and a horizontal

partitioning criterion TA. Each set of tuples produced by TA is called a segment.

For simplicity, we assume that each attribute is in exactly one partition and there

is no overlap between partitions, therefore the union of all partitions produces D.

For NCB, the goal of horizontal partitioning is to ensure that each segment fits into

main memory and the goal of vertical partitioning is to guarantee the performance

of the ordering for all the attributes in the partition.

The following is the criteria used to decide how many vertical partitions to build

for NCB. The improved performance of NCB relies on processing smaller EBs than

the traditional bitmaps. There are two factors that can make the EB have less

ranks/words than the verbatim bitmaps. The first one is the number of duplicates

in the dataset and the second one is the number of possible distinct ranks in the full

bitmap table.

Without considering the time involved in performing the bitwise logical operations

over the bitmaps, but focusing solely in the time required to scan one bitmap column,

the optimal case is having a small number of distinct ranks in the EB when compared

to the number of rows in the dataset, as traditional bitmaps would have one bit per

n row. Then, the number of distinct ranks in the EB is set to be at most k where k is a user defined parameter that guarantees improved query execution times as the

evaluation of one rank in the EB would simultaneously be evaluating k dataset tuples on the average.

n The size in words of a verbatim bitmap is W , where n is the number of rows in the table and W is the word size, e.g. 32 bits2. For WAH compressed bitmaps, the

2 For our implementation in Java, the word size is set to 31 bits since Java does not have unsigned data types.

178 n maximum size (when the column is incompressible) is W −1 , as 1 bit is used in each word to indicate the type of word. In theory, k = W would guarantee faster query

execution time than traditional bitmaps in one partition at the cost of producing

more partitions.

In practice, k does not need to be a large number since the bitwise operations

within the bitmap columns need to be also factor in. In our experiments we set k to

be 5. We want to keep the number of partitions low because of the space requirement

of the mapping tables.

Another constraint imposed over NCB was that the number of bits required to

encode a given rank in the full bitmap table could not exceed W + ⌈log2 W ⌉ bits.

In our experiments W was set to 31 because we used integers to represent the word

number in the bitmap.

For all real datasets our constraints were satisfied using 2 partitions with equal

number of attributes. Since in our case the query patterns are unknown, we produced

equi-sized partitions to balance the performance of all queries and avoid worse case

execution times for some queries. In the case when the query patterns are known,

attributes that are often queried together can be placed in the same partition to

avoid the overhead of translating bitmaps and ANDing the partial results. Moreover,

attributes could be replicated into two or more partitions as long as the constraints

are met. Another consideration is to place attributes in their order of query frequency

within the partition as, for point queries, earlier columns have a slight advantage over

subsequent columns as it would be evidenced in the experimental results.

179 8.3 Experimental Results

We performed experiments over both, real and synthetic datasets. For the real

datasets we used HEP data, which comes from high energy experiments with over 2

million rows and 12 attributes with cardinality ranging from 2 to 12 bins, and three

other datasets from the UCI data repository [1]. Adult dataset is census data with

48,842 rows and 14 attributes with cardinality ranging from 2 to 42. For the Adult

dataset, the 5 continuous attributes were quantized using equi-populated partitions

into 3 bins. Nursery is data derived from a hierarchical decision model originally

developed to rank applications for nursery schools. Nursery has 12,960 rows and 8 attributes with cardinality ranging from 3 to 6. In poker hand (poker) each record is an example of a hand consisting of five playing cards drawn from a standard deck of 52. Poker dataset has 1,025,010 rows and 10 attributes half with cardinality 4

(suit) and half with cardinality 13 (rank). For the synthetic datasets we generated uniform (UNI) and Zipf distributions for datasets with varying number of attributes

(5 to 50) with varying cardinality from 5 to 50 and different number of rows starting from 1 million to 5 million. Zipf distributions were generated using s=1 (Zipf1) and

2 (Zipf2) as the value of the exponent characterizing the distribution.

Point queries are generated by randomly sampling 100 tuples from the dataset.

Range queries are selected the same way but the attribute value of the tuple is ex- panded into both directions (higher and lower values) until the number of bitmaps queried is equal to the percent of cardinality-selectivity specified. When no specified, point queries and range queries refer to queries over all the attributes in the dataset.

The implementation of the Non-Clustered Bitmaps (NCB) is done in Java. We

compare query execution time with FastBit. The queries for FastBit are specified as

180 selects of the first attribute in the dataset and the time reported corresponds to the

CPU time output by the program. This CPU time (as opposed to the elapsed time)

does not include I/O time. The experiments are run in a computer with 3.00GHz

CPU, 2GB RAM, Linux Ubuntu 7.10 operating system. To measure query execution

time for NCB and FasBit we run each set of 100 queries 5 times and drop the high-

est value to avoid outliers produced by process preemption, or java virtual machine

garbage collection. The remaining 4 runs are averaged together and reported in this

section.

Next, we present the experiments conducted over synthetic data to analyze and evaluate the performance of NCB. Then, we present the query execution times for real datasets.

8.3.1 Number of Rows

Figure 8.3 shows the average query execution time as the number of rows increases

(from 1 to 5 million). Note that FastBit execution time is linear with the number of rows in the dataset as the size of each bitmap and the number of hits in the queries increases. However, NCB execution time does not depend on the number of rows but rather on the number of distinct ranks, which remains constant since the number of attributes and their cardinality do not change.

8.3.2 Number of Attributes

Figure 8.4 shows the average query execution time as the number of attributes increases. For this experiment, NCB is built using 5 attributes per partition. As can be seen, both approaches grow linearly but the slope of NCB is smaller than FastBit.

181 Figure 8.3: Execution of point queries for datasets with 5 attributes each with cardi- nality 5, different distributions, and varying number of rows.

8.3.3 Attribute Cardinality

Figure 8.5 shows the average query execution time as the cardinality increases.

NCB performs considerably faster than FastBit for all distributions. It is worth noting that depending on the cardinality NCB is using 1 or 2 partitions. For UNI and Zipf1, two partitions are created after cardinality 20 and 30, respectively. For Zipf2, the number of distinct ranks never violates the partition constraint and therefore only one partition is used for Zipf2 for all cardinalities. The number of distinct ranks in the dataset for the three distributions is presented in Table 8.1.

8.3.4 Number of Distinct Ranks

Figure 8.6 shows the average query execution time as the number of distinct ranks in the EB increases. As can be seen, the execution time is linear to the number of distinct ranks when the EB is stored as a rankList, however, when the EB is stored

182 Figure 8.4: Execution of point queries for datasets with varying number of attributes each with cardinality 5, different distributions, and 1M rows.

as a WAH compressed bitmap the query execution increases until the EB is not

compressed and then stabilizes. Recall that we only compress the zeros in the EB.

8.3.5 Number of Segments

Figure 8.7 shows the average query execution time as the number of horizontal partitions (segments) used to create the NCB increases. Partitions are built with 1M,

500K, 250K, 200K, 125K, and 100K rows, to produce 1, 2, 4, 5, 8, and 10 segments, respectively. As can be seen, the execution time is linear with respect to the number of segments.

8.3.6 NCB Performance over Real Datasets

Figure 8.8 presents the average query execution time as the number of queried attributes increases for real datasets. For these experiments, query attributes are se- lected in the same order of the dataset attributes. For example, query dimensionality

183 Figure 8.5: Execution of point queries for datasets with 5M rows, 5 attributes, and different distributions with varying cardinality.

2 means the first two attributes of the dataset are queried. In general, the execution

time of NCB decreases as query dimensionality increases because queries are executed

row wise and the answer is returned as soon as one query condition is not satisfied.

Since two partitions are used for each dataset, when the query ask for the first 8

attributes, for Adult data for example, there is a sudden increase in query execution time. The reason is that partitions are built using 7 attributes for Adult data and therefore, when a query ask for 8 attributes there is an extra cost of translating two bitmaps and ANDing them together to produce the final answer.

Figure 8.9 presents the same set of experiments with the only distinction that the attributes are randomly selected from the dataset. In this case a query with dimensionality of 2 can query any two attributes in the data, including two attributes from different partitions. As can be seen, when comparing the two graphs, earlier column in the sorting order have a slight advantage in query execution time over

184 Cardinality UNI Zipf1 Zipf2 5 2,978 3,125 2,978 10 100,000 95,293 33,299 15 758,302 457,842 74,029 20 2,526,975 974,746 109,797 30 4,518,787 1,931,186 162,716 40 4,878,900 2,613,087 199,400 50 4,960,228 3,082,957 224,688

Table 8.1: Number of distinct ranks for different data distributions as attribute car- dinality increases. The datasets have 5 attributes and 5M rows.

latter columns. Nevertheless, the execution time continues to be consistently faster

than FastBit.

Whole Dataset NCB (P=2) (Aggregated) Distinct Possible Distinct EB Size Dataset Rows Bn log2 Bn Ranks Ranks Ranks (KB) HEP 2,173,762 505 × 109 39 353,889 2,180,178 12,063 30.51 Adult 48,842 148 × 109 38 24,057 2,245,320 9,345 24.2 Nursery 12,960 115 × 103 17 12,960 792 294 0.35 Poker 1,025,010 380 × 106 29 1,022,771 45,968 45,084 5.99

Table 8.2: Size statistics for real datasets and NCB with two partitions.

Figure 8.10 shows the query execution time of range queries when the percentage of cardinality queried is increased from 10 to 50%. NCB execution time is faster than

FastBit for all four datasets.

8.3.7 Index Size

Figures 8.11 shows a comparison between the random ordered compressed bitmaps

(FastBit), the sorted compressed bitmaps (FastBit Sorted) and NCB for the real

185 Figure 8.6: Execution time of point queries for increasing number of distinct ranks in the EB.

datasets. The NCB size is also presented for each component: the EBs and the MTs

for each partition. As can be seen, the NCB size is completely dominated by the size

of the mapping tables (MT) and in most cases the NCB size is worse than the ordered

WAH compressed bitmaps (FastBit Sorted) but better than the total size of random

ordered WAH compressed bitmaps.

Table 8.2, shows the effect of vertical partition on the size of the full bitmap table.

By creating 2 partitions, the size of the full bitmap is not reduced by half, but rather by the product of the cardinalities of the attributes involved. This is the reason why low number of partitions satisfy the partition constraints that guarantee faster query execution of NCB when compared against random ordered compressed bitmaps.

186 Figure 8.7: Execution time of point queries for datasets with 5 attributes, each with cardinality 5, 1M rows, and different distributions. NCBs are built with varying number of horizontal partitions (segments).

Figure 8.8: Execution of point queries for real datasets as the query dimensionality increases. Query attributes are selected in the same order of the dataset attributes.

8.3.8 Comparison with Secondary Sorted Bitmaps (SSB)

With these set of experiments we compare the performance of NCB against SSB, where the sorted bitmap columns are explicitly stored. To produce the effect of SSB, we partitioned each dataset using the same partitions than NCB. Then we sorted the data physically and used FastBit to create WAH compressed bitmaps over this dataset. For the experiments, we do not used queries involving attributes across

187 Figure 8.9: Execution of point queries for real datasets as the query dimensionality increases. Query attributes are randomly selected.

partitions. We ran the same set of queries for NCB and SSB and compared execution time.

Figure 8.12 shows the query execution time of point queries with varying dimen- sionality for SSB and NCB. For this experiment we ran point queries with dimension- ality ranging from 1 to the number of attributes in the partition. We ran queries over the two partitions and averaged the running times. As can be seen, NCB performs better than SSB because there are no redundancy in the EB.

Figure 8.13 shows the query execution time of range queries with varying percent- age of cardinality queried. Again, both techniques show the same trend but NCB performs consistently faster than SSB.

In Table 8.3 we compare the sizes (in KB) of the two approaches. The reported sizes are only for the EB of NCB and the bitmap columns of SSB, mapping table sizes are omitted since it would be the same for both approaches. As can be seen the EB compact representation is more space efficient than the compressed bitmaps, even when they are sorted.

188 (a) (b)

(c) (d)

Figure 8.10: Execution of range queries with varying percentage of attribute cardi- nalities queried.

8.4 Conclusion

In this chapter we propose a novel bitmap index design with vertical and horizontal partitioning that serves as basis for non-clustered bitmap indexes. As opposed to traditional bitmap indexes, where one bitmap column is stored for each attribute value and each tuple is represented by one bit in each bitmap, the proposed scheme generates only one bitmap column for a set of attributes (partition) to encode the

189 (a) (b)

(c) (d)

Figure 8.11: Index Size Comparison for FastBit, Sorted FastBit and NCB for real datasets.

rank of the tuples in the full sorted bitmap table. The full sorted bitmap table is not stored, but rather logically queried using the rank of the tuples.

The proposed approach is compared against Sorted Secondary Bitmap Indexes,

Sorted Clustered Bitmap Indexes, and FastBit, a recent bitmap index implementa- tion. Experiments show that query execution time is greatly reduced when NCB is used.

One positive side effect of having only one EB per partition as opposed to one bitmap per column in the partition is that the update cost is greatly reduced. In sorted secondary bitmaps, in addition to updating the mapping table, the bitmap

190 SSB NCB Dataset Partition 1 Partition 2 Total Partition 1 Partition 2 Total Adult 64.49 15.51 80 3.95 26.56 30.51 Hep 37.03 118.06 155.09 21.14 3.06 24.2 Nursery 4.98 1.46 6.44 0.29 0.05 0.34 Poker 208.64 447.6 656.24 1.41 4.58 5.99

Table 8.3: Bitmap Size comparison of SSB and NCB (in KB).

Figure 8.12: Execution of point queries with varying dimensionality in sorting order for Sorted Secondary Bitmaps (SSB) and Non-clustered Bitmaps (NCB).

indexes need to be rebuilt with every update or batch update as the insertion of a new bit in the middle of the bitmap would shift the bit positions and the alignment of the words would no longer be valid. In NCB, the EB is stored as a WAH compressed bitmap on the ranks. In the case when the rank of the new tuple is already in the partition only the mapping table needs to be updated. And even in the case when the rank of the tuple is new and the word for the rank is compressed in the EB, we would break the run and increase the size of the EB at most 3 words. Rebuilt can be avoided because the bit position for the rank is already in the EB.

191 Figure 8.13: Execution of range queries with varying percentage of attribute car- dinalities queried for Sorted Secondary Bitmaps (SSB) and Non-clustered Bitmaps (NCB).

192 CHAPTER 9

Conclusions and Future Research

The work presented herein enhances bitmap indexes for the efficient management of large scale data. Bitmap indexes are at the lowest level of representation. Ones and zeros represent whether a given object satisfies a given property or belong to a certain class. In addition, bitmaps provide a very efficient way of computing and combining partial results by using bitwise operations supported by hardware. We have addressed several important issues as the first steps toward applying bitmap indexes as part of a general and comprehensive framework. In Chapter 3, we extended bitmaps to handle incomplete data, i.e. tuples with missing attribute values. The extensions are easy to apply and require little modifications to the existing query processing algorithms. But the most important thing is that the extended bitmaps continue to exhibit linear per- formance for query execution time with respect to database and query dimensionality, i.e. efficiently handling missing data. In Chapter 4, we designed an integrated scheme to reorganize the data for improved compression ratio using an adaptive algorithm that improves the previously proposed Gray code ordering. The reordering algorithm proposed is a preprocessing step, in-place and linear in the size of the database. The reordered compressed bitmaps are up to an order of magnitude times smaller than

193 the original compressed bitmaps. This reduction in compressed size translates di- rectly into improved query execution times. Reordered bitmaps require up to 7 times less than the corresponding original WAH compressed bitmaps. In Chapter 5, we ad- dressed the inability of compressed bitmaps to directly locate the row identifier by the bit position in the bitmap. We proposed an approximate bitmap encoding based on hashing that improves the processing time of highly selective queries up to 3 orders of magnitude when compared to WAH compressed bitmaps. In Chapter 6, we proposed a complete framework to execute similarity searches over the bitmap indexes. This framework requires no change to the current bitmap structure and produces meaning- ful results in high dimensional spaces. There is no need to access the actual data and the index provides enough granularity to efficiently resolve ties. With Bitmap-NN, query processing is still scalable with respect to the data dimensionality and it is pos- sible to execute a wide range of similarity searches such as complex similarity search, projected search, and constrained similarity search. In Chapter 7, we addressed the bad update performance of bitmap indexes by encoding non-existing rows as zeros in the compressed bitmaps. With this, the update cost of bitmap indexes is comparable to B+-Trees. In Chapter 8, we proposed a new indexing mechanism for non-clustered bitmaps that provides three advantages: faster query execution, smaller space re- quirement, and improved update times. The reason is that only one bitmap vector is stored for several attributes and the bitmaps are logically sorted and not stored.

There is no doubt that we have come far from the initial structure of bitmap indexes.

However, there is still a large amount of future work ahead of us. In our future work we plan to use the Bitmap-NN framework as the basis for data mining tasks such as clustering and classification. For ranked Non-clustered Bitmap Indexes we want to

194 explore the optimal column ordering and formalize strategies to decide what would be an appropriate or even best vertical partitioning. In addition, we want to extend our non-clustered bitmap indexes to support more complex type of queries efficiently, such as nearest neighbor queries. Finally, we want to build a hybrid approach that combines our enhanced bitmap indexes with other type of indexes that better deal with floating point numbers and continuous attributes to maintain query precision and avoid expensive candidate evaluation.

195 BIBLIOGRAPHY

[1] http://mlearn.ics.uci.edu/MLRepository.html.

[2] http://kdd.ics.uci.edu/databases/census1990/USCensus1990-desc.html.

[3] C. C. Aggarwal and P. S. Yu. The igrid index: reversing the dimensionality curse for similarity indexing in high dimensional space. In KDD ’00: Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 119–129, New York, NY, USA, 2000. ACM Press.

[4] S. Amer-Yahia and T. Johnson. Optimizing queries on compressed bitmaps. In The VLDB Journal, pages 329–338, 2000.

[5] G. Antoshenkov. Byte-aligned bitmap compression. In Data Compression Con- ference, Nashua, NH, 1995. Oracle Corp.

[6] G. Antoshenkov and M. Ziauddin. Query processing and optimization in oracle rdb. The VLDB Journal, 1996.

[7] S. Berchtold, C. Bohm, D. Keim, and H.-P. Kriegel. A cost model for nearest neighbor search in high-dimensional data space. In Proceedings International Conference Principles of Database Systems, pages 78–86, 1997.

[8] S. Berchtold, D. Keim, and H. Kriegel. The x-tree: An index structure for high- dimensional data. In Proceedings of the International Conference on Very Large Data Bases, pages 28–39, 1996.

[9] K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft. When is “nearest neighbor” meaningful? Lecture Notes in Computer Science, 1540:217–235, 1999.

[10] C. Bohm, S. Berchtold, and D. A. Keim. Searching in high-dimensional spaces: Index structures for improving the performance of multimedia databases. ACM Computing Surveys, 33:322–373, 2001.

[11] A. Broder and M. Mitzenmacher. Network Applications of Bloom Filters: A Sur- vey. In Proceedings of the 40th Annual Allerton Conference on Communication, Control, and Computing, pages 636–646, 2002.

196 [12] D. K. Burleson. Oracle Tuning: The Definitive Reference. Rampant TechPress, April 2006.

[13] J. Byers, J. Considine, M. Mitzenmacher, and S. Rost. Informed content delivery across adaptive overlay networks. Proceedings of ACM SIGCOMM, August 2002, pp. 47-60, August 2002.

[14] G.-H. Cha. Bitmap indexing method for complex similarity queries with rele- vance feedback. In MMDB ’03: Proceedings of the 1st ACM international work- shop on Multimedia databases, pages 55–62, New York, NY, USA, 2003. ACM Press.

[15] K. Chakrabarti and S. Mehrotra. The hybrid tree: An index structure for high dimensional feature spaces. In Proceedings International Conference Data Engi- neering, pages 440–447, 1999.

[16] C. Chan and Y. Ioannidis. An efficient bitmap encoding scheme for selection queries. SIGMOD Rec., 28(2):215–226, 1999.

[17] C.-Y. Chan and Y. E. Ioannidis. Bitmap index design and evaluation. In Pro- ceedings of the 1998 ACM SIGMOD international conference on Management of data, pages 355–366. ACM Press, 1998.

[18] W. chang Feng, D. Kandlur, D. Saha, and K. Shin. Stochastic Fair Blue: A Queue Management Algorithm for Enforcing Fairness. In Proc. of INFOCOM, volume 3, pages 1520 – 1529, April 2001.

[19] B. Chazelle, J. Kilian, R. Rubinfeld, and A. Tal. Compactly encoding a function with static support in order to support approximate evaluations queries. Pro- ceedings of the fifteenth annual ACM-SIAM symposium on Discrete algorithms, January 2004.

[20] B. Consulting. Oracle bitmap index techniques. http://www.dba- oracle.com/oracle tips bitmapped indexes.htm.

[21] H. Edelstein. Faster data warehouses. Information Week, December 1995.

[22] L. Fan, P. Cao, J. Almeida, and A. Broder. Summary Cache: A Scalable Wide- Area Web Cache Sharing Protocol. In IEEE/ACM Transactions on Networking, Canada, 2000.

[23] L. Fan, P. Cao, J. Almeida, and A. Broder. Web cache sharing. Collaborating Web caches use bloom filter to represent local set of cached files to reduce the netwrok traffic. In IEEE/ACM Transactions on Networking, 2000.

197 [24] V. Gaede and O. Gunther. Multidimensional access methods. ACM Computing Surveys, 30:170–231, 1998.

[25] K. Goh, B. Li, and E. Chang. Dyndex: A dynamic and nonmetric space indexer, 2002.

[26] P. Howarth and S. Rger. Trading precision for speed: Localised similarity func- tions. Lecture Notes in Computer Science, 3568:415–424, 2005.

[27] Informix. Informix decision support indexing for the enterprise data warehouse. http://www.informix.com/informix/corpinfo/- zines/whiteidx.htm.

[28] J. Jeong and J. Nang. An efficient bitmap indexing method for similarity search in high dimensional multimedia databases. In ICME ’04: Proceeding of the IEEE International Conference on Multimedia and Expo, volume 2, pages 815–818, June 2004.

[29] D. S. Johnson, S. Krishnan, J. Chhugani, S. Kumar, and S. Venkatasubramanian. Compressing large boolean matrices using reordering techniques. In VLDB, pages 13–23, 2004.

[30] T. Johnson. Performance measurements of compressed bitmap indices. In M. P. Atkinson, M. E. Orlowska, P. Valduriez, S. B. Zdonik, and M. L. Brodie, editors, VLDB’99, Proceedings of 25th International Conference on Very Large Data Bases, September 7-10, 1999, Edinburgh, Scotland, UK, pages 278–289. Morgan Kaufmann, 1999.

[31] N. Koudas. Space efficient bitmap indexing. In Proceedings of the ninth inter- national conference on Information and knowledge management, pages 194–201. ACM Press, 2000.

[32] A. Kumar, J. Xu, and J. W. L. Li. Algorithms: Space-code bloom filter for efficient traffic flow measurement. In Proceedings of the 2003 ACM SIGCOMM conference on Internet measurement, October 2003.

[33] A. Kumar, J. Xu, L. Li, and J. Wang. Measuring approximately yet reasonably accurate per-flow accounting without maintaining per-flow state. Proceedings of the 2003 ACM SIGCOMM conference on Internet measurement, 2003 October.

[34] L. A. Kurgan and K. J. Cios. Caim discretization algorithm. IEEE Transactions on Knowledge and Data Engineering, 16(2):145–153, 2004.

[35] J. Lewis. Understanding bitmap indexes. http://www.dbazine.com/oracle/or- articles/jlewis3, 2006.

198 [36] K. Lin, H. V. Jagadish, and C. Faloutsos. The tv-tree: An index structure for high-dimensional data. VLDB Journal, 3:517–542, 1995.

[37] S. P. Lloyd. Least squares quantization in pcm. IEEE Transactions on Informa- tion Theory, 28:127–135, Mar. 1982.

[38] D. Lomet and B. Salzberg. The hb-tree: A multi-attribute indexing method with good guaranteed performance. ACM Transactions on Database Systems, 15(4):625–658, December 1990.

[39] P. Mishra and M. Eich. Join processing in relational databases. In ACM Com- puting Surveys (CSUR), March 1992.

[40] J. Mullin. Estimating the size of joins in distributed databases where com- munication cost must be maintained low. In IEEE Transactions on Software Engineering, 1990.

[41] J. Mullin. Optimal semijoins for distributed database systems. In IEEE Trans- actions on Software Engineering, volume 16, pages 558 – 560, 1990.

[42] E. O’Neil, P. O’Neil, and K. Wu. Bitmap index design choices and their perfor- mance implications. In IDEAS, Banff, Canada, 2007.

[43] P. O’Neil. Model 204 architecture and performance. In Proceedings of the 2nd International Workshop on High Performance Transaction Systems, pages 40–59. Springer-Verlag, 1989.

[44] P. O’Neil. Informix and indexing support for data warehouses, 1997.

[45] P. O’Neil and D. Quass. Improved query performance with variant indexes. In Proceedings of the 1997 ACM SIGMOD International Conference on Manage- ment of Data, pages 38–49. ACM Press, 1997.

[46] B. C. Ooi, C. H. Goh, and K.-L. Tan. Fast high-dimensional data search in incomplete databases. In Proceedings of the 24rd International Conference on Very Large Data Bases, pages 357–367. Morgan Kaufmann Publishers Inc., 1998.

[47] E. J. Otoo, A. Shoshani, and S. Hwang. Clustering high dimensional massive scientific dataset. In SSDBM, pages 147–157, Fairfax, Virginia, July 2001.

[48] A. Partow. General purpose hash function algorithms library. http://www.partow.net/programming/hashfunctions/index.html, 2002.

[49] A. Pınar and M. Heath. Improving performance of sparse matrix-vector multi- plication. In Proc. of Supercomputing 99, 1999.

199 [50] A. Pinar, T. Tao, and H. Ferhatosmanoglu. Compressing bitmap indices by data reorganization. In International Conference on Data Engineering, pages 310–321, 2005.

[51] M. Ramakrishna. In Indexing Goes a New Direction., volume 2, page 70, 1999.

[52] D. Richards. Data compression and gray-code sorting. Information Processing Letters, 22:201–205, 1986.

[53] D. Rinfret, P. O’Neil, and E. O’Neil. Bit-sliced index arithmetic. SIGMOD Rec., 30(2):47–57, 2001.

[54] M. Robshaw. Md2, md4, md5, sha and other hash functions. technical report tr-101, version 4.0. RSA Laboratories, July 1995.

[55] D. Salomon. Data Compression 2nd edition. Springer Verlag, New York, 2000.

[56] H. Samet. The Design and Analysis of Spatial Structures. Addison Wesley Publishing Company, Inc., Massachusetts, 1989.

[57] SciDAC. Scientific data management center. http://sdm.lbl.gov/sdmcenter/, 2002.

[58] A. Shoshani, L. Bernardo, H. Nordberg, D. Rotem, and A.Sim. Multidimensinal indexing and query coordination for tertiary storage management. In SSDBM, pages 214–225, 1999.

[59] SNAP. Supernova acceleration probe. http://snap.lbl.gov/, 2004.

[60] A. Snoeren. Hash-based IP traceback. In ACM SIGCOMM Computer Commu- nication Review, 2001.

[61] A. Snoeren, C. Partridge, L. Sanchez, C. Jones, F. Tchakountio, B. Schwartz, S. Kent, and W. Strayer. IP Traceback to record packet digests traffic forwarded by the routers. IEEE/ACM Transactions on Networking (TON), December 2002.

[62] K. Stockinger. Design and implementation of bitmap indices for scientific data. In International Database Engineering and Application Symposium, pages 47–57, 2001.

[63] K. Stockinger. Bitmap indices for speeding up high-dimensional data analysis. In Proceedings of the 13th International Conference on Database and Expert Systems Applications, pages 881–890. Springer-Verlag, 2002.

[64] K. Stockinger, J. Shalf, W. Bethel, and K. Wu. Dex: Increasing the capability of scientific data analysis pipelines by using efficient bitmap indices to accelerate scientific visualization. In Proceedings of SSDBM, 2005.

200 [65] K. Stockinger and K. Wu. Improved searching for spatial features in spatio- temporal data. In Technical Report. Lawrence Berkeley National Laboratory. Paper LBNL-56376. http://repositories.cdlib.org/lbnl/LBNL-56376, September 2004.

[66] K. Stockinger, K. Wu, and A. Shoshani. Strategies for processing ad hoc queries on large data warehouses. In Proceedings of the 5th ACM international workshop on Data Warehousing and OLAP, pages 72–79, 2002.

[67] Sybase. Sybase IQ Indexes, chapter 5: Sybase IQ Release 11.2 Collection. Sybase Inc., March 1997.

[68] A. K. H. Tung, R. Zhang, N. Koudas, and B. C. Ooi. Similarity search: a matching based approach. In VLDB’2006: Proceedings of the 32nd international conference on Very large data bases, pages 631–642. VLDB Endowment, 2006.

[69] J. Wang. Caching proxy servers on the world wide web to improve performance and reduce traffic, October 1999.

[70] R. Weber and S. Blott. An approximation based data structure for similarity search, 1997.

[71] R. Weber and K. Bohm. Trading quality for time with nearest-neighbor search. In Proc. Int. Conf. on Extending Database Technology, pages 21–35, Konstanz, Germany, Mar. 2000.

[72] R. Weber, H.-J. Schek, and S. Blott. A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In Proceedings of the Int. Conf. on Very Large Data Bases, pages 194–205, New York City, New York, Aug. 1998.

[73] A. Whitaker and D. Wetherall. Detecting loops in small networks. 5th IEEE Conference on Open Architectures and Network Programming (OPENARCH), June 2002.

[74] K. Wu. Fastbit: an efficient indexing technology for accelerating data-intensive science. J. Phys.: Conf. Ser., 16:556–560, 2005.

[75] K. Wu, W. Koegler, J. Chen, and A. Shoshani. Using bitmap index for interactive exploration of large datasets. In Proceedings of SSDBM, 2003.

[76] K. Wu, E. Otoo, and A. Shoshani. A performance comparison of bitmap indexes. In Proc. Conf. on 10th International Conference on Information and Knowledge Management, pages 559–561. ACM Press, 2001.

201 [77] K. Wu, E. J. Otoo, and A. Shoshani. Compressing bitmap indexes for faster search operations. In SSDBM, pages 99–108, Edinburgh, Scotland, UK, July 2002.

[78] K. Wu, E. J. Otoo, and A. Shoshani. On the performance of bitmap indices for high cardinality attributes. Technical Report 54673, LBNL, Mar. 2004.

[79] K. Wu, E. J. Otoo, and A. Shoshani. Optimizing bitmap indexes with efficient compression. ACM Transactions on Database Systems, 2006.

[80] M.-C. Wu and A. P. Buchmann. Encoded bitmap indexing for data warehouses. In ICDE ’98: Proceedings of the Fourteenth International Conference on Data Engineering, pages 220–230, Washington, DC, USA, 1998. IEEE Computer So- ciety.

[81] M. J. Zaki and J. T. L. Wang. Special issue on bioinformatics and biological data management. Information Systems, 28:241–367, 2003.

[82] E. Zimanyi. Incomplete and Uncertain Information in Relational Databases. PhD thesis, Universit´eLibre de Bruxelles, 1992.

202