Antares: a Scalable, Efficient Platform for Stream, Historic

School of Computing Science Antares: A Scalable, Efficient Platform for Stream, Historic, Combined and Geospatial Querying Rebecca Simmonds Submitted for the degree of Doctor of Philosophy in the School of Computing Science, Newcastle University June 2016 c 2016, Rebecca Simmonds Abstract Traditional methods for storing and analysing data are proving inadequate for processing \Big Data". This is due to its volume, and the rate at which it is being generated. The limitations of current technologies are further exacerbated by the increased de- mand for applications which allow users to access and interact with data as soon as it is generated. Near real-time analysis such as this can be partially supported by stream processing systems, however they currently lack the ability to store data for efficient historic processing: many applications require a combination of near real-time and historic data analysis. This thesis investigates this problem, and describes and evaluates a novel approach for addressing it. Antares is a layered framework that has been designed to exploit and extend the scalability of NoSQL databases to support low latency querying and high throughput rates for both stream and historic data analysis simultaneously. Antares began as a company funded project, sponsored by Red Hat the motivation was to identify a new technology which could provide scalable analysis of data, both stream and historic. The motivation for this was to explore new methods for supporting scale and efficiency, for example a layered approach. A layered approach would exploit the scale of historic stores and the speed of in-memory processing. New technologies were investigates to identify current mechanisms and suggest a means of improvement. Antares supports a layered approach for analysis, the motivation for the platform was to provide scalable, low latency querying of Twitter data for other researchers to help automate analysis. Antares needed to provide temporal and spatial analysis of Twitter data using the timestamp and geotag. The approach used Twitter as a use case and derived requirements from social scientists for a broader research project called Tweet My Street. Many data streaming applications have a location-based aspect, using geospatial data to enhance the functionality they provide. However geospatial data is inherently diffi- cult to process at scale due to its multidimensional nature. To address these difficulties, - i - this thesis proposes Antares as a new solution to providing scalable and efficient mechanisms for querying geospatial data. The thesis describes the design of Antares and evaluates its performance on a range of scenarios taken from a real social media analytics application. The results show significant performance gains when compared to existing approaches, for particular types of analysis. The approach is evaluated by executing experiments across Antares and similar systems to show the improved results. Antares demonstrates a layered approach can be used to improve performance for inserts and searches as well as increasing the ingestion rate of the system. - ii - Declaration I declare that this thesis is my own work unless otherwise stated. No part of this thesis has previously been submitted for a degree or any other qualification at Newcastle University or any other institution. Total word count approx 45,000 Rebecca Simmonds - iii - Publications Portions of the work within this thesis have been documented in the following publications: Simmonds, R. M., Watson, P., Halliday, J. and Missier, P. (2014). A Platform for Analysing Stream and Historic Data with Efficient and Scalable Design Patterns. 2014 IEEE World Congress on Services, (Ii), 174-181. doi:10.1109/SERVICES.2014.40 Mearns, G., Simmonds, R., Richardson, R., Turner, M., Watson, P. and Missier, P. (2014). Tweet My Street: A Cross-Disciplinary Collaboration for the Analysis of Local Twitter Data. Future Internet, 6(2), 378-396. doi:10.3390/fi6020378 Simmonds, R., Watson, P. and Halliday, J. (2015). Antares: A Scalable, Real-Time, Fault Tolerant Data Store for Spatial Analysis. 2015 IEEE World Congress on Ser- vices, 105-112. doi:10.1109/SERVICES.2015.24 - iv - Acknowledgements Firstly I would like to thank my supervisory team Paul Watson and Paolo Missier for their continued support throughout and for the opportunity to be a part of such a great project. I would like to give a big thanks to Stephen Dowsland and Mark Turner for their collaboration and contributions to the visualisations and Tweet My Street project - it was a pleasure working as a development team with them. Tweet My Street would not have been possible without the collaborative work and support of Graeme Mearns and Ranald Richardson - who it was a pleasure to work with - many thanks to them. I would also like to thank Dawn Branley for her support and collaboration. I would like to thank the following people for their continued support and for re- viewing the thesis: Matthew Forshaw, Jonathan Halliday, Derek Mortimer and Simon Woodman. I would also like to thank all of my colleagues and friends from Newcastle who have supported me and helped to make this project possible, they are: Michael Bell, Ryan Emerson, Barry Hodgson, Adnan Hussain, Beth Lawry, Stephen Mackenzie, Oonagh Mcgee, Dominic Searson, David Stokes, Jennifer Warrender and Ginny Wooley. Lastly I would like to acknowledge and give huge thanks to my family especially Alison Simmonds, Ingrid Simmonds and Mark Simmonds for their continued support and strength through the whole process it would not have been possible without them. - v - Contents 1 Introduction 1 1.1 Thesis Contributions . .4 1.2 User Requirements . .5 1.3 Dissertation Outline . .6 2 Technical Context 7 2.1 Introduction.................................8 2.2 Stream....................................8 2.3 Stream Processing . .8 2.4 NoSQL Database . .8 2.5 Social Media Analytics . .9 2.6 Twitter ...................................9 2.7 Combined Querying . .9 3 Architecture 10 3.1 Introduction................................. 11 3.2 Related Work . 13 3.2.1 Scalable Stream Processing . 13 3.2.1.1 Storm ........................... 14 3.2.1.2 Drools . 14 3.2.1.3 S4 ............................. 15 3.2.1.4 System S . 16 3.2.1.5 ESPER . 17 - vi - 3.2.1.6 Rainbird.......................... 18 3.2.1.7 Millwheel . 18 3.2.1.8 InfoSphere . 18 3.2.1.9 Apache Samza . 19 3.2.1.10 Heron . 19 3.2.1.11 Conclusion . 19 3.2.2 Historic Processing . 20 3.2.2.1 Relational Cloud . 23 3.2.2.2 HBase . 25 3.2.2.3 MongoDB . 27 3.2.2.4 BigTable . 29 3.2.2.5 HadoopDB . 30 3.2.2.6 Infinispan ......................... 32 3.2.2.7 Titan . 34 3.2.2.8 SimpleDB......................... 35 3.2.2.9 Neo4j . 36 3.2.2.10 Riak . 38 3.2.2.11 Conclusion . 40 3.2.2.12 Cassandra: in Detail . 40 3.2.3 Combined Processing . 43 3.2.3.1 StreamInsight . 43 3.2.3.2 Truviso .......................... 44 3.2.3.3 Cloud Dataflow . 45 3.2.3.4 Spark ........................... 46 3.2.3.5 Summingbird . 47 - vii - 3.2.3.6 Lambda.......................... 47 3.2.3.7 Conclusion . 48 3.3 System Architecture . 50 3.3.1 Stream Querying . 51 3.3.2 Historic Querying . 53 3.3.3 Combined Querying . 55 3.3.4 Insert Rate . 59 3.4 DataModel ................................. 61 3.4.1 Case Studies . 62 3.4.2 Queries ............................... 63 3.4.3 Schema ............................... 69 3.4.4 Stream Schema . 72 3.5 Evaluation.................................. 74 3.5.1 Ingestion Rate . 74 3.5.2 Batch Experiment . 76 3.5.2.1 Comparison of Antares and the Synchronous Cluster . 77 3.5.2.2 Conclusion . 79 3.5.3 Read Performance . 79 3.5.3.1 Exact Match Experiment . 82 3.5.3.2 Range Query Experiment . 85 3.5.3.3 Stress Experiments . 88 3.5.3.4 Conclusion . 90 3.5.4 Stream Queries . 91 3.5.4.1 Conclusion . 91 3.5.5 Combined Queries . 92 3.5.5.1 Conclusion . 94 3.6 Conclusion . 94 - viii - 4 Spatial Extensions 96 4.1 Introduction................................. 98 4.2 Related Work . 99 4.3 Research into Spatial Databases and GIS . 99 4.3.1 B-Trees . 99 4.3.2 R-Trees . 100 4.3.3 Kd-Tree ............................... 101 4.3.4 QuadTree.............................. 101 4.3.5 Geohashing . 101 4.3.6 Distributed Trees . 101 4.3.7 NoSQL system and Geospatial Processing . 102 4.3.8 Conclusion . 104 4.4 Geospatial Querying . 105 4.4.1 SolrOverview............................ 106 4.4.2 Dynamic Index Layer . 108 4.4.3 An Unbalanced Tree . 116 4.4.4 Kd-Tree ............................... 120 4.4.5 Quad-Tree.............................. 122 4.4.6 Geohashing . 123 4.4.7 Schema ............................... 123 4.4.8 QueryMapping........................... 124 4.4.9 Spatial Algorithms . 125 4.4.9.1 Insertion Algorithm . 125 4.4.9.2 Point and Range Algorithms . 128 4.4.10 Extension to Querying Model . 130 - ix - 4.5 Consistency . 130 4.5.1 Scenarios . 131 4.6 Evaluation.................................. 137 4.6.1 Read Performance . 138 4.6.2 Scale Experiments Comparing Different Structures in Antares with Different Systems . 141 4.6.3 Writes ................................ 148 4.6.4 Conclusion . 149 4.7 Conclusion . 150 5 Application to Twitter 151 5.1 Introduction................................. 152 5.2 Social Media Analysis . 152 5.2.1 Conclusion . 158 5.3 User Interface . 159 5.4 Tweet My Street . 161 5.4.1 West End of Newcastle . 162 5.4.2 International Day Against Homophobia and Transphobia . 164 5.4.3 Psychology . 165 5.4.4 Conclusion . 165 6 Conclusion 167 References 171 - x - List of Figures 3.1 A Cassandra column family . 41 3.2 A Cassandra cluster . 42 3.3 Antares: abstract architecture .

Antares: a Scalable, Efficient Platform for Stream, Historic

AUTOMATIC DESIGN of NONCRYPTOGRAPHIC HASH FUNCTIONS USING GENETIC PROGRAMMING, Computational Intelligence, 4, 798– 831

Open Source Used in Quantum SON Suite 18C

Implementation of the Programming Language Dino – a Case Study in Dynamic Language Performance

Hash-Flooding Dos Reloaded

Automated Malware Analysis Report for Phish Survey.Js

Symantec Data Insight 4.5.1 Third-Party Attributions

Fundamental Data Structures Contents

An Enhanced Non-Cryptographic Hash Function

The Power of Evil Choices in Bloom Filters Thomas Gerbet, Amrit Kumar, Cédric Lauradoux

Qlikview-Third-Party-License-Terms.Pdf

Licensing Information User Manual for Autonomous Database on Shared Exadata Infrastructure

Random Number Generator Recommendations for Applications