Ordered Indexing Methods for Data Streaming Applications
Total Page:16
File Type:pdf, Size:1020Kb
IT 11 061 Examensarbete 45 hp Augusti 2011 Ordered indexing methods for data streaming applications Sobhan Badiozamany Institutionen för informationsteknologi Department of Information Technology Abstract Ordered indexing methods for data streaming applications Sobhan Badiozamany Teknisk- naturvetenskaplig fakultet UTH-enheten Many data streaming applications need ordered indexing techniques in order to maintain statistics required to answer historical and continuous queries. Conventional Besöksadress: DBMS indexing techniques are not specifically designed for data streaming Ångströmlaboratoriet Lägerhyddsvägen 1 applications; this motivates an investigation on comparing the scalability of different Hus 4, Plan 0 ordered indexing techniques in data streaming applications. This master thesis compares two ordered indexing techniques – tries and B-trees - with each other in Postadress: the context of data streaming applications. Indexing data streams also requires Box 536 751 21 Uppsala supporting high insertion and deletion rates. Two deletion strategies, bulk deletion and incremental deletion were also compared in this project. The Linear Road Telefon: Benchmark, SCSQ-LR and Amos II were comprehensively used to perform the 018 – 471 30 03 scalability comparisons and draw conclusions. Telefax: 018 – 471 30 00 Hemsida: http://www.teknat.uu.se/student Handledare: Tore Risch Ämnesgranskare: Tore Risch Examinator: Anders Jansson IT 11 061 Sponsor: This project is supported by VINNOVA, grant 2007-02916. Tryckt av: Reprocentralen ITC Acknowledgments The idea of this project was suggested by Prof. Tore Risch. Without his continuous valuable supervision, support, guidance and encouragement it was impossible to perform this study. I would also like to thank Erik Zeitler and Thanh Truong. Erik helped me in understanding the LRB and also running the experiments. Thanh generously shared his valuable knowledge on extending Amos with new indexing structures. I owe my deepest gratitude to my wife Elham, who has always been supporting me with her pure love. She went through a lot of lonely moments so that I can finish my studies. I would also like to show my gratitude to my parents for their dedication and the many years of support during my whole life that has provided the foundation for this work. I also want to thank our little Avina who with her beautiful smile gave a new meaning to our family life and provided relaxation from stressful moments. My special thanks go to my best friend Karwan Jacksi, who during last two years was always around for a none-study-related friendly talk. Amanj, Hajar, Salah and other friends, thank you very much for your support and friendship. This project is supported by VINNOVA, grant 2007-02916. Table of Contents 1 Introduction .......................................................................................................................................... 1 2 Background and related work ............................................................................................................... 4 2.1 The Linear Road Benchmark ......................................................................................................... 4 2.1.1 LRB data load ........................................................................................................................ 4 2.1.2 LRB performance requirements ............................................................................................ 5 2.2 Amos II and its extensibility mechanisms ..................................................................................... 6 2.3 SCSQ-LR ......................................................................................................................................... 7 2.3.1 Segment statistics, the SCSQ-LR bottleneck ......................................................................... 8 2.4 Previous applications of tries in data streaming ......................................................................... 10 2.5 Tries and common trie compression techniques ........................................................................ 10 2.5.1 Naïve trie ............................................................................................................................. 11 2.5.2 How to support integer keys in tries ................................................................................... 12 2.5.3 Naïve tries’ sensitivity to key distribution ........................................................................... 12 2.5.4 Using linked lists to reduce memory consumption in tries ................................................ 13 2.5.5 Building a local search tree in every trie node .................................................................... 13 2.5.6 Burst Trie ............................................................................................................................. 14 2.5.7 Judy, a trie based implementation from HP ....................................................................... 14 2.6 A brief evolutionary review on main memory B-trees ............................................................... 18 3 Ordered indexing methods in main memory; Trie vs. B-tree ............................................................. 19 3.1 Insertion ...................................................................................................................................... 20 3.2 Deletion ....................................................................................................................................... 20 3.3 Range Search ............................................................................................................................... 21 3.4 Memory utilization ...................................................................................................................... 22 3.5 B-trees vs. Tries in LRB ................................................................................................................ 24 3.6 Tries vs. B-trees: a more general comparison and analysis ........................................................ 24 4 SCSQ-LR implementation specifics ..................................................................................................... 26 4.1 LRB simplifications ...................................................................................................................... 26 4.2 Exploiting application knowledge to form integer keys ............................................................. 26 4.3 Implementation structure: C, ALisp, AmosQL ............................................................................. 27 4.4 Performance improvements ....................................................................................................... 29 5 Index maintenance strategies ............................................................................................................. 31 5.1 Bulk deletion ............................................................................................................................... 31 5.2 Incremental deletion ................................................................................................................... 32 5.3 Bulk vs. Incremental deletion ..................................................................................................... 33 6 Experimental results ........................................................................................................................... 35 7 Conclusion ........................................................................................................................................... 36 8 Future work ......................................................................................................................................... 37 9 References .......................................................................................................................................... 38 10 Appendixes ...................................................................................................................................... 42 10.1 Appendix A – The original SCSQ-LR segment statistics code ...................................................... 42 10.2 Appendix B – Range-Search-Free segment statistics .................................................................. 43 10.3 Appendix C – Trie-based segment statistics ............................................................................... 47 10.4 Appendix D – Implementation of segment statistics using Incremental deletion ...................... 49 10.5 Appendix E – Sample code for implementing Judy ordered retrieval ........................................ 50 10.6 Appendix F – The Naïve trie data structure ................................................................................ 51 1 Introduction In recent years, the area of stream processing applications has attracted many research activities from the database research community. In contrast to database management systems (DBMSs), in a Data Stream Management System (DSMS) data is not stored in persistent relations; instead it is produced in continuous, transient and rapid data streams. There are various applications for DSMS; Examples include network monitoring, data gathered from high throughput scientific instruments, stock market, data from telecommunications industry, web applications, and sensor networks. DSMS applications require support of continuous queries (1), approximation (2),