Investigating Design Choices Between Bitmap Index and B-Tree Index for a Large Data Warehouse System
Total Page:16
File Type:pdf, Size:1020Kb
Proceedings of the 8th WSEAS International Conference on APPLIED COMPUTER SCIENCE (ACS'08) Investigating Design Choices between Bitmap index and B-tree index for a Large Data Warehouse System Morteza Zaker, Somnuk Phon-Amnuaisuk, Su-Cheng Haw Faculty of Information Technology, Multimedia University, Malaysia [email protected], [email protected], [email protected] Abstract: Building indexes on database is common, but it has an important impact on the query performance, especially in large databases such as a Data Warehouse where the queries are usually very complex and ad hoc. If a proper index structure is chosen, the query response time can be accelerated. Until now, there is no definite guideline for Data Warehouse analysts to choose the appropriate index. According to conventional wisdom, Bitmap index is a preferred indexing technique for cases where the indexed attributes have few distinct values (i.e., low cardinality). The query response time is expected to degrade as the cardinality of indexed columns increase due to a larger index size. On the other hand, B-tree index is good if the column values are of high cardinality due to its indexing and retrieving mechanisms. In this paper, we show that this may not be true under certain circumstances. Experimental results support the fact that even though the level of column cardinality determines the index file size, but the query processing time is not determined by the level of column cardinality. Moreover, our results indicate that the Bitmap index is faster than B-tree index on a large dataset with multi-billion records. Key–Words: Data warehouse, Bitmap index, B-tree index, Query processing 1 Introduction quently updated because it does not need re-balancing as frequently as other self-balancing search trees. In addition, all leaf blocks of the tree are at the same A Data Warehouse (DW) is the foundation for Deci- depth [6] . Thus, choosing the proper type of index sion Support Systems (DSS) with a large collection structures has a big impact on the DW environment. of information that can be accessed through an On- line Analytical Processing (OLAP) application. This The main problem that arises is that there is no large database stores current and historical data that definite guideline for DW analysts to choose appro- come from several external data sources [1, 2, 3] . priate indexing methods. According to common prac- The queries built on DW system are complex and usu- tice, Bitmap index is best suited for columns having ally include some join operations that incur computa- low cardinality and should be only considered for low- tional overhead. This overhead increase the response cardinality data [1, 3, 6] . Strohm [6]concludes that time especially when queries are performed on a large the advantages of using Bitmap indexes are greatest dataset. To increase the performance, DW analysts for low cardinality columns, i.e., columns which has a commonly use some solutions such as indexes, sum- small number of distinct values compared to the num- mary tables and partition mechanism [4] . ber of rows in the table. If the number of distinct There are various index techniques supported by values of a column is less than 1%, then the column database vendors such as Bitmap index[4], B-tree is a candidate for a Bitmap index. This assumption [3, 5, 6], Projection index [7], Join bitmap index [8], may be correct to some extent based on previous al- Range base bitmap index [9] and so on. Bitmap in- gorithms and based on old machine processing used dex for example is advisable for a system that contains by the database software and hardware respectively, data that are not frequently updated by many concur- but, as the usage of data is exploding, this assumption rent processes [10, 11, 12] . This is mainly due to may no longer be applicable. the fact that Bitmap index stores large amount of row In this paper, we demonstrate that: (i) Bitmap information in each block of the index structure. In index on a column with high cardinality is more ef- addition, since Bitmap index locking is at the block ficient than a B-tree index, (ii) The query response level, any insert, update, or delete activity may result time in multi-dimensional queries is not pursued by in locking an entire range of values [13] . On the other the time that is needed to one-dimensional queries on hand, B-tree index is good for a system which is fre- both Bitmap index and B-tree index, and (iii) Query ISSN: 1790-5109 123 ISBN: 978-960-474-028-4 Proceedings of the 8th WSEAS International Conference on APPLIED COMPUTER SCIENCE (ACS'08) utilizing Bitmap index which are executed within a Table 1: Basic Bitmap index adopted by[10] range of predicates is affected by the distribution of data, but does not have any affinity by the cardinality RowId C B0 B1 B2 B3 conditions. 0 2 0 0 1 0 The rest of the paper is organized as follows. Sec- 1 1 0 1 0 0 tions 2 presents the background studies on Bitmap in- 2 3 0 0 0 1 dex, B-tree index and cardinality concepts. Section 3 0 1 0 0 0 3 defines a case study and performance methodology 4 3 0 0 0 1 with a set of queries to compare the performances of 5 1 0 1 0 0 Bitmap index and B-tree index. Section 4 discusses 6 0 1 0 0 0 the experimental results. Lastly, Section 5 concludes 7 0 1 0 0 0 the paper. 8 2 0 0 1 0 2 Background and related work that point to the next level in the index. Leaf nodes consist of the index key and pointers pointing to the 2.1 Bitmap index physical location in which the corresponding records Bitmap index is built to enhance the performance on are stored. According to some research studies [3, 11], various query types including range, aggregation and B-tree index has features that make it a well selection join queries. It is used to index the values of a single criterion on column with high cardinality values espe- column in a table. Bitmap index is derived from a cially in DW’s designing. sequence of the key values which depicts the number of distinct values of the column. Each row in Bitmap 2.3 Cardinality index is sequentially numbered starting from integer 0. If the bit is set to ”1”, it indicates that the row Definition of cardinality in set theory is refers to the with the corresponding RowId contains the key value; number of members in the set. On database theory, the otherwise the bit is set to ”0”. cardinality of a table refers to the number of rows con- To illustrate how Bitmap indexes work, we show tained in a particular table. In terms of OLAP system, an example which is based on the example illustrated cardinality refers to the number of rows in a table. On by E.E-O’Neil and P.P-O’Neil[10] . ” Table 1 shows the other hand, on a data warehousing point of view, a basic Bitmap index on a table containing 9 rows, cardinality usually refers to the number of distinct val- where Bitmap index is to be created on column C with ues in a column. Generally, there is four levels of car- integer values ranging from 0 to 3. We say that the dinalities; Low, Normal, High and Very high cardinal- column cardinality of C is 4 because it has 4 distinct ity (also known as Full Cardinality). Low-cardinality values. Bitmap index for C contains 4 bitmaps, shown refers to columns which have a very few unique val- as B0, B1, B2 and B3 corresponding to the value rep- ues. Normal-cardinality refers to columns which have resented. For instance, in the first row where RowId sporadic unique values. High-cardinality is related to =0, column C has the value 2. Hence, in column B2, columns which has a large number of distinct values the bit is set to ”1”, while the rest of bitmaps bits are containing very unique values. Very high cardinality set to ”0” . Similarly, for the second row, bit of B1 is related to columns which has a very large number of is ”1” because the second row of C has the value 1, distinct values. Recently, very high cardinality is also while the corresponding bits of B0, B2 and B3 are known as Full-cardinality in the database community. all ”0” . This process repeats for the rest of the rows [10].” 2.4 Related Work 2.2 B-tree index Recently, there are some significant research stud- ies investigating the main limitation of Bitmap in- B-tree [5] stores the index pointers and values to other dex. New indexing strategy that applied to bitmap index nodes by using a recursive tree structure. The compression schemes requires less space and provides data could be easily retrieved by tracing on the pointer. performance gains [10, 12, 14, 17, 18, 19] . The top-most level of the index is known as root while In [16, 17], they have analyzed that WAH com- the lowest level is known as the leaf node. All the pression is effective in reducing Bitmap index size. other levels in between are called branches (Internal They have shown that query processing time grows nodes). Both the root and branches contain entries linearly as the index size increases. Besides, they also ISSN: 1790-5109 124 ISBN: 978-960-474-028-4 Proceedings of the 8th WSEAS International Conference on APPLIED COMPUTER SCIENCE (ACS'08) demonstrated that the query processing time is linear table involves low-cardinality columns, while the Or- in the number of hits when using a WAH compressed der and Product tables involve normal and high car- bitmap index.