To Trie Or Not to Trie? Realizing Space-Partitioning Trees Inside Postgresql: Challenges, Experiences and Performance
Total Page:16
File Type:pdf, Size:1020Kb
Purdue University Purdue e-Pubs Department of Computer Science Technical Reports Department of Computer Science 2005 To Trie or Not to Trie? Realizing Space-partitioning Trees inside PostgreSQL: Challenges, Experiences and Performance Mohamed Y. Eltabakh Ramy Eltarras Walid G. Aref Purdue University, [email protected] Report Number: 05-008 Eltabakh, Mohamed Y.; Eltarras, Ramy; and Aref, Walid G., "To Trie or Not to Trie? Realizing Space- partitioning Trees inside PostgreSQL: Challenges, Experiences and Performance" (2005). Department of Computer Science Technical Reports. Paper 1623. https://docs.lib.purdue.edu/cstech/1623 This document has been made available through Purdue e-Pubs, a service of the Purdue University Libraries. Please contact [email protected] for additional information. TO TRIE OR NOT TO TRIE? REALIZING SPACE PARTITIONING TREES INSIDE POSTGRESQL: CHALLENGES, EXPERIENCES AND PERFORMANCE Mohamed Y. Eltaabakh Ramy Eltarras Walid G. Aref Department of Computer Sciences Purdue University West Lafayette, IN 47907 CSD TR #05-008 April 2005 To Trie or Not to Trie? Realizing Space-partitioning Trees inside PostgreSQL: Challenges, Experiences, and Performance l\Iohamed Y. Eltabakh Ramy Eltarras Walid G. Aref Purdue Universit~· West Lafayette, IN. 47907-1398. USA {meltabak, rhassan, aref}@cs.purdue.edu Abstract this need and have initiated efforts to support several non-traditional indexes, e.g., (Oracle [35]. and IBl\l l'l1an~' evolving datilbilse applications warrant DB2 [ID. the use of non-twditional indexing mecha One of the major hurdles in implementing non nisms beyond B+-trees and hash tables. SP traditionill indexes inside a database engine is the GiST is an extensible indexing framework that very wide variety of such indexes. Moreover. there broadens the class of supported indexes to is tremendous overhead that is associated with real include disk-based wrsions of a \vide vari izing and integrating any of these indexes inside the ety of space-partitioning trees, e.g., disk-based engine. Generalized search trees (e.g., GiST [21] and trie variants, quadtree vilriants, and kd-trees. SP-GiST [4. are designed to address this problem. This paper presents il serious attempt at im 3D plementing and realizing SP-GiST-based in Generalized search trees (GiST [21 D and Space dexes inside PostgreSQL. Several index types partitioning Generalized search trees (SP-GiST [4. 3D are realized inside PostgreSQL facilitated by are software engineering frameworks for rapid proto rapid SP-GiST instantiations. Challenge~, typing of indexes inside a database engine. GiST experiences. and performance issues are ad supports the class of balanced trees (B+-tree-like dressed in the paper. Performance compar trees). e.g., R-trees [7, 20, 32], SR-trees [24], and isons are conducted from within PostgreSQL RD-trees [22], while SP-GiST supports the class of to compare updilte ilnd search performances space-partitioning trees. e.g., tries, quadtrees. and k-d of SP-GiST-based indexes against the B+-tree trees. Both frameworks have internal methods that and the R-tree for text string and point data furnish general database functionalities. e.g.. gener sets. Interesting performilnce results are pre alized search and insert algorithms, as well as user sented in the paper. Results highlight the po defined external methods and parameters that tailor tentifd performance gains of SP-GiST-based the generalized index into one instance index from the indexes as well ilS several strengths and weak corresponding index class. GiST has been tested in nesses of SP-GiST-bilsed indexes that need to prototype systems, e.g.. in Predator [34] and in Post be addressed in future research greSQL [37J. and hence is not the focus of this study. The purpose of this study is to demonstrate feasibil 1 Introduction ity and performance issues of SP-GiST-based indexes. This work presents a serious attempt at implement l\Iany emerging database applicat.ions warrant the use ing and realizing SP-GiST-based indexes inside Post of non-traditional indexing mechanisms beyond B+ greSQL. Using rapid SP-GiST instantiations. several trees and hash tables. Diltil base vendors hilve realized index types are realized inside PostgreSQL that in Perrnission to copy wi.thout fee all OJ' part of this material is dex string and point data types. Performance compar grunted pT'Ovided t./wt the copies are nol made or distributed for isons are conducted from within PostgreSQL to com direct com mercial a.dvantage. I he VLDB copyrighl not.ice and pare update and search performances of one disk-based the tit.le of the pllblicaJion and its dale appear. and not.ice is trie variant against the B+-tree for a variety of string gillen Owl copying is by pennission of the Very Large Data. Base Endowment. To copy olherll'ise. or 10 republish. requires a fee dataset collections, and one disk-based kd-tree variant and/or s}i<"ciol permission fmm the Endowment. against the R-tree for two-dimensional point dataset Proceedings of the 31st VLDB Conference, collections. Other more sophisticated index types can Trondheim, Norway, 2005 be instantiated using the same SP-GiST platform. We plan to release the PostgreSQL version of SP-GiST 2 Related Work around the same time we publish this paper. The contributions of this research can be summa I\l ultidimensional searching is a fundamental oper rized as follows: ation for many database applications. Several in dex structures beyond B-trees [6, 11], and hash ta 1. We realized SP-GiST inside PostgreSQL to ex bles [14, 29], have been proposed to manage the update tend the available access methods to include the and search operations over spatial and multidimen class of space partitioning trees. Our implementa sional data, e.g., [17, 27. 31, 33]. These index struc tion methodology makes SP-GiST portable. That tures include the R-tree, and its variants [7. 20, 32]. is. SP-GiST can be realized inside PostgreSQL the quadtree, and its variants [15, 18, 25, 39], the kd without recompiling PostgreSQL code. tree [8], the disk based kd-tree variants, e.g.. [9, 30], 2. \Ve extended the indexing operations of the space and the trie structure, and its variants [2, 10. 16]. Ex partitioning trees, e.g., the trie, to include more tensions to the B-tree have been also proposed to ad challenging search operations, e.g., the prefix dress indexing multidimensional attributes. such as the match search, and the regular expression match universal B-tree [5], and the hB-tree [13]. Extensible search. indexing frameworks have been proposed to instanti ate a variety of index structures in an efficient way and 3. \:Ve conducted extensive experiments from within without modifying the database engine. Extensible in PostgreSQL to compare the performance of the dexing frameworks are first proposed in [36]. GiST trie and the kd-tree against the B+-tree and the (Generalized Search Trees) is an extensible framework R-tree, respectively. Our results illustrate that for B-tree-Iike indexes [21]. SP-GiST (Space Partition the lrie has more than 150% search performance ing Generalized Search Trees) is an extensible frame improvement over the B+-tree performance in the work for the family of space partitioning trees [4, 3]. case of the exact match search. Also, the trie An extensible bulk loading algorithm for SP-GiST is has more than 2 orders of magnitude search per proposed in [19]. Extensible indexing structures gain formance improvement in the case of the regular more importance in the context of the object rela expression match search. Our experiments also tional database management systems (ORDBMS) that demonstrate that the kd-tree has more than 300% are designed primarily to support new types of data search performance improvement over the R-tree and applications. The implementation of GiST in 1n in the case of the point match search. formix Dynamic Server with Universal Data Option 4. We realized the suffix tree index structure using (IDSjUDO) is presented in [26]. Current commercial the SP-GiST framevvork to support the substring databases have also supported the extensible indexing match searching. Our experiments demonstrate frameworks, e.g., IBM DB2 [1], and Oracle [35]. The that the suffix tree improves the search perfor performance of the various index strllctures have been mance by more than 3 orders of magnitude over studied extensively. For example. a model for the R the sequential scan. The other index structures, tree performance is proposed in [38]. A comparison e.g.. B+-tree, and R-tree, do not support the sub between quadtree and R-tree indexes in Oracle spa string match search. tial is presented in [28]. Another comparative study among the R-tree variants and the Pl\lR quadtree is 5. We identified weaknesses of SP-GiST-based in presented in [23]. dexes compared to the B+-tree and the R-tree indexes that need to be addressed in future re 3 Space-partitioning 'Trees: Overview, search. Basically, the insertion time and the index size of the SP-GiST-based indexes involve higher Challenges, and SP-GiST overhead than those of the B+-tree and the R-tree The main characteristic of space-partitioning trees indexes. is that they partition the multi-dimensional space The rest of this paper proceeds as follows. In Sec into disjoint (non-overlapping) regions. Refer to tion 2. we highlight related work in the area of ex Figures I, 2, and 3, for a few examples of space tensible indexes and generalized indexing frameworks. partitioning trees. Partitioning can be either In Section 3. we overview space-partitioning trees, the (1) space-driven (e.g., Figure 2), where we decompose challenges they have from database indexing point of the space into equal-sized partitions regardless of the view, and how these challenges are addressed in SP data distribution, or (2) data-driven (e.g., Figure 3), GiST.