A Scalable Microblogs Data Management System

View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by University of Minnesota Digital Conservancy Kite: A Scalable Microblogs Data Management System A THESIS SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL OF THE UNIVERSITY OF MINNESOTA BY Amr Magdy Ahmed IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF Doctor of Philosophy Mohamed F. Mokbel June, 2017 c Amr Magdy Ahmed 2017 ALL RIGHTS RESERVED Acknowledgements There are many people that have earned my gratitude for their contribution to my time in graduate school. I am extremely thankful to my parents for supporting me over the years. Without their support, it was very hard to reach this stage in my graduate studies and career. My doctoral adviser Mohamed Mokbel is the main contributor to all accomplishments I have achieved during the last six years. His guidance, efforts, spirit, time, and commitment were key elements that significantly contributed positively to my graduate school experience. I am also grateful to my collaborators Sameh Elnikety, Yuxiong He, Suman Nath, Ahmed Aly, Walid Aref, Sohaib Ghani, and Thanaa Ghanem. Their efforts and collabo- rative work turned my research into reality. I also thank my lab mates for collaborating with me on exciting research ideas that greatly enriched my research and mentoring experience. In particular, Mohamed Sarwat, Jie Bao, Ahmed Eldawy, Mashaal Musleh, Louai Alarabi, Rami Alghamdi, Saif Al-Harthi and Christopher Jonathan. Finally, I thank my close social circle that supported me throughout my PhD journey. In particular, my wife, Reem, my son, Yousof, and my family friends, Hesham, Eman, Ibrahim, Ahmed, Sara, Jana, Heba, Eyad, and Sara. i Dedication To my lord, adviser, parents, wife, and kids. ii Abstract Developers, researchers, and practitioners have been building a myriad of applications to analyze microblogs data, e.g., tweets, online reviews, and user comments. Examples of such applications include citizen journalism, events detection and analysis, geo-targeted advertising, medical research, and studying social influences in social sci- ences. Building such applications require data management infrastructure to deal with microblogs, including data digestion, indexing, and main-memory management. The lack of such infrastructure hinders the scalability and the widespread of such applications especially among users who are not computer scientists. This thesis proposes Kite; an end-to-end system that is able to manage microblogs data at a large scale. Using Kite, developers and practitioners can simply write SQL- like queries without worrying about the internal data management issues. Internally, Kite is equipped with scalable indexing and main-memory management techniques to support top-k temporal, spatial, keyword, and trending queries on both very recent data and historical data. Kite indexer supports scalable digestion and retrieval for incoming fast data in real time. Recent data are digested in efficient main-memory index structures. Kite in-memory index structure are able to scale up a single machine indexing capabilities to handle the overwhelming amount of data in real time. Mean- while, Kite memory manager is monitoring the memory contents and smartly decides on which data is regularly moved to disk. This is accomplished through effective memory flushing policies that are designed for top-k query workloads, which are popular on microblogs data. Both in-memory and in-disk data are queried seamlessly through efficient retrieval techniques that are encapsulated in Kite query processor. The query processor exploits the top-k ranking function to early prune the search space and reduce the query latency significantly. Kite is open-sourced and available to the community to build on (http://kite.cs.umn.edu). Extensive experimentation on different Kite compo- nents show the efficiency and the effectiveness of the proposed techniques to manage microblogs data at scale. iii Contents Acknowledgements i Dedication ii Abstract iii List of Figures vii 1 Introduction 1 1.1 KiteSystemOverview ............................ 8 1.1.1 MemoryIndexer ........................... 8 1.1.2 DiskIndexer ............................. 9 1.1.3 QueryProcessor ........................... 9 1.2 SupportedQueries .............................. 10 1.3 QueryLanguage ............................... 14 1.4 ThesisOrganization ............................. 15 2 Indexing 17 2.1 Main-memoryIndexing ........................... 19 2.1.1 In-memoryKeywordIndex . 20 2.1.2 In-memorySpatialIndex . 20 2.2 Disk-basedIndexing ............................. 26 2.3 FlushingManagement ............................ 28 2.4 ExperimentalEvaluation. 28 2.4.1 ExperimentalSetup . .. .. .. .. .. .. 28 iv 2.4.2 Real-timeIndexingPerformance . 29 3 Main-memory Management 31 3.1 FlushingPoliciesOverview . 32 3.2 ManagingKiteSpatialIndexMemory . 34 3.2.1 SpatialIndexSizeTuning . 35 3.2.2 SpatialIndexLoadShedding . 40 3.2.3 SpatialIndexAdaptiveLoadShedding . 42 3.3 ManagingKiteKeywordIndexMemory . 47 3.3.1 kFlushingPolicy ........................... 51 3.3.2 ExtensibilityofkFlushing . 58 3.4 ExperimentalEvaluation. 63 3.4.1 ExperimentalSetup . .. .. .. .. .. .. 63 3.4.2 Evaluating Managing Spatial Index Memory . 65 3.4.3 Evaluating Managing Keyword Index Memory . 73 4 Query Processing 83 4.1 QueryProcessorOverview. 83 4.2 ProcessingSpatialQueries. 85 4.2.1 QueryDataStructure . 86 4.2.2 TheInitializationPhase . 86 4.2.3 ThePruningPhase.......................... 87 4.3 ProcessingKeywordQueries. 88 4.4 ProcessingSpatial-KeywordQueries . 89 4.5 Processing Queries on Arbitrary Attributes . ...... 90 4.6 ExperimentalEvaluation. 90 5 Trend Discovery 95 5.1 TrendingMeasures .............................. 97 5.1.1 RateofIncreaseMeasure . 98 5.1.2 WeightedCountMeasure . 100 5.2 TrendingIndexing .............................. 101 5.2.1 IndexStructure............................ 102 v 5.2.2 IndexInsertion ............................ 103 5.2.3 DataExpiration ........................... 105 5.3 MemoryOptimization ............................ 107 5.3.1 TrendMem KeyIdeas....................... 108 5.3.2 TrendMem Technique....................... 109 5.4 TrendingQueryProcessing . 111 5.5 ExperimentalEvaluation. 112 5.5.1 Evaluation of TrendMem ...................... 113 5.5.2 Evaluation of Trending Query Processing . 115 6 Related Work 119 6.1 MicroblogsDataAnalysis . 120 6.2 MicroblogsDataManagement. 122 6.3 MicroblogsDataVisualization. 124 6.4 MicroblogsSystems.............................. 124 7 Conclusions 127 References 129 vi List of Figures 1.1 Kite SystemOverview............................. 8 2.1 Kite In-memorySegmentedIndex. 19 2.2 Main-memoryspatialindexsegment. 21 2.3 Kite DiskSpatialIndex. ........................... 25 2.4 Spatialindexinsertiontime.. 29 3.1 kFlushingmainidea ............................. 50 3.2 FlushingExample .............................. 52 3.3 Memoryconsumptionbehavior . 55 3.4 Example of Multiple-Keyword Extension of kFlushing .......... 61 3.5 Spatial index insertion time - reduced index sizes. ......... 66 3.6 Effect of T onindexstoragevs. queryaccuracy . 67 3.7 Effect of α onindexstoragevs. queryaccuracy . 68 3.8 Effect of β onindexstoragevs. queryaccuracy . 69 3.9 Effect of α on index storage vs. query accuracy - adaptive . 70 3.10 Effect of adaptive load shedding on query latency . ....... 71 3.11 Ranking effect on storage vs. accuracy varying T ............. 72 3.12 Ranking effect on storage vs. accuracy varying α ............. 72 3.13 Ranking effect on storage vs. accuracy varying β ............. 73 3.14 NumberofMemory-hitKeywords. 75 3.15 HitRatioonCorrelatedQueryLoad . 77 3.16 HitRatioonUniformQueryLoad . 78 3.17 Flushing Overhead vs. k ........................... 79 3.18 kFlushing PerformanceonSpatialAttribute . 80 3.19 kFlushing PerformanceonUserAttribute . 81 vii 4.1 Microblogsqueryingenvironment . 84 4.2 QueryProcessingOverview . 85 4.3 Averagequerylatency.. 91 4.4 Rankingeffectonquerypruning . 93 5.1 GeoTrend indexstructureandcellcontents. 102 5.2 Example of GeoTrend index insertion and expiration. 104 5.3 Zipf distribution of Twitter keywords at different spatiallevels. 108 5.4 Impact of GeoTrend memoryoptimizationmodule. 114 5.5 Impact of keyword replication across GeoTrend indexlevels. 115 5.6 Impact of maintaining top-k lists....................... 117 6.1 MicroblogsResearchTimeline. 120 viii Chapter 1 Introduction The last decade has witnessed an explosive growth in the user-generated contents on the web, including discussion forums, blogs, wikis, videos, and more recently Microblogs. Microblogs is a comprehensive title that describes the micro-length chunks of user- generated data that are posted on the web every second. This includes tweets, online user reviews, user comments, and user check-ins. This data easy-to-produce by a user, who can produce a tweet or a comment in just few seconds, and thus it comes liter- ally in thousands of records every second. Everyday, 288+ Million active Twitter users generate 500+ Million tweets [1], while 1.39+ Billion Facebook users post 3.2+ Billion comments [2]. The vast majority of microblogging activity comes from mobile users, specifically, 80+% of Twitter users and 85+% of Facebook users are mobile. Such large number of data records carry very rich user-generated contents such as news, opinions, discussions, as well as meta data including location information, language information, and personal information. The rich content and the popularity of microblogging plat- forms results

Load more