A Scalable Data Store and Analytic Platform for Real-Time Monitoring of Data-Intensive Scientiﬁc Infrastructure

Doctoral Thesis A Scalable Data Store and Analytic Platform for Real-Time Monitoring of Data-Intensive Scientific Infrastructure A thesis submitted to Brunel University London in accordance with the requirements for award of the degree of Doctor of Philosophy By Uthayanath Suthakar in the Department of Electronic and Computer Engineering College of Engineering, Design and Physical Sciences November 23, 2017 Declaration of Authorship I, Uthayanath Suthakar, declare that the work in this dissertation was carried out in accordance with the requirements of the University's Regulations and Code of Practice for Research Degree Programmes and that it has not been submitted for any other academic award. Except where indicated by specific reference in the text, the work is the candidate's own work. Work done in collaboration with, or with the assistance of, others, is indicated as such. Any views expressed in the dissertation are those of the author. SIGNED: ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: DATE: ::::::::::::::::::::::::::::::::::: (Signature of student) \Progress is made by trial and failure; the failures are generally a hundred times more numerous than the successes; yet they are usually left unchronicled." Sir William Ramsay, Chemist Abstract Monitoring data-intensive scientific infrastructures in real-time such as jobs, data trans- fers, and hardware failures is vital for efficient operation. Due to the high volume and velocity of events that are produced, traditional methods are no longer optimal. Several techniques, as well as enabling architectures, are available to support the Big Data issue. In this respect, this thesis complements existing survey work by contributing an extensive literature review of both traditional and emerging Big Data architecture. Scalability, low-latency, fault-tolerance, and intelligence are key challenges of the traditional architecture. However, Big Data technologies and approaches have become in- creasingly popular for use cases that demand the use of scalable, data intensive processing (parallel), and fault-tolerance (data replication) and support for low-latency computations. In the context of a scalable data store and analytics platform for monitoring data-intensive scientific infrastructure, Lambda Architecture was adapted and evaluated on the World- wide LHC Computing Grid, which has been proven effective. This is especially true for computationally and data-intensive use cases. In this thesis, an efficient strategy for the collection and storage of large volumes of data for computation is presented. By moving the transformation logic out from the data pipeline and moving to analytics layers, it simplifies the architecture and overall process. Time utilised is reduced, untampered raw data are kept at storage level for fault-tolerance, and the required transformation can be done when needed. An optimised Lambda Architecture (OLA), which involved modelling an efficient way of joining batch layer and streaming layer with minimum code duplications in order to support scalability, low-latency, and fault-tolerance is presented. A few models were evaluated; pure streaming layer, pure batch layer and the combination of both batch and streaming layers. Experimental results demonstrate that OLA performed better than the traditional architecture as well the Lambda Architecture. The OLA was also enhanced by adding an intelligence layer for predicting data access pattern. The intelligence layer actively adapts and updates the model built by the batch layer, which eliminates the re- training time while providing a high level of accuracy using the Deep Learning technique. The fundamental contribution to knowledge is a scalable, low-latency, fault-tolerant, intelligent, and heterogeneous-based architecture for monitoring a data-intensive scientific infrastructure, that can benefit from Big Data,technologies and approaches. Acknowledgements I would like to express my sincere gratitude to Dr. David Ryan Smith, who supervised my research and guided me with great dedication throughout my time as a PhD student. His view of research has proved valuable to my research and supporting work. His personality, together with his high standards for research, makes him an ideal advisor. I would like to thank Prof. Akram Khan for introducing me to the support for Distributed Computing group, at the CERN IT department, and for providing me support and advice throughout my time as a PhD student. I would like to express a special thanks to Dr. Luca Magnoni for supervising me from CERN and for sharing his extensive knowledge with me while conducting this research. I am thankful to all the members of Support for Distributed Computing group, at the CERN IT department for guiding me and supporting me while I was on site. I am appreciative to the Thomas Gerald Gray Trust for funding and supporting my research. I would like to thank all my family members for supporting me while conducting this research. Without them, I could not have reached this stage in my life. Finally, I have made a few exceptional friends at Brunel University London, Nikki Berry, Rhys Gardener, Hassan Ahmad, Asim Jan, and Sema Zahid. I would like to thank them for making my experience enjoyable while conducting this research. List of Publications The following papers have been accepted / to be submitted for publication as a direct or indirect result of the research discussed in this thesis. Suthakar, U., Magnoni, L., Smith, D.R., Khan, A., Andreeva, J., An efficient strategy for the collection and storage of large volumes of data for computation, Springer: Journal of Big Data, (2016) [This paper corresponds to Chapter 3]. Magnoni, L., Suthakar, U., Cordeiro, C., Georgiou, M., Andreeva, J., Khan, A., Smith, D.R., Monitoring WLCG with lambda-architecture: a new scalable data store and analytics platform for monitoring at petabyte scale. Journal of Physics: Conference Series 664(5), 052023 (2015) [This paper corresponds to Chapter 4]. Suthakar, U., Magnoni, L., Smith, D.R., Khan, A., Optimised lambda-architecture for monitoring WLCG using Spark and Spark Streaming, IEEE Nuclear Science Sympo- sium, (2016) [This paper corresponds to Chapter 5]. Suthakar, U., Magnoni, L., Smith, D.R., Khan, A., Optimised Lambda Architecture for monitoring scientific infrastructure, IEEE Transactions on Parallel and Distributed Systems, (2017) [This paper corresponds to Chapter 5]. Contents 1 Introduction1 1.1 Motivations....................................4 1.2 Methodology...................................5 1.3 Major Contributions to Knowledge.......................6 1.4 Thesis Organisation...............................8 2 Review of Literature 10 2.1 Architecture.................................... 10 2.1.1 Review.................................. 11 2.1.2 Summary................................. 18 2.2 Technology.................................... 18 2.2.1 Batch Process............................... 18 2.2.2 Interactive ad-hoc query engine..................... 22 2.2.3 Real-Time Processing.......................... 26 2.2.4 Summary................................. 33 2.3 Machine Learning techniques.......................... 33 2.3.1 Machine Learning libraries and techniques............... 34 2.3.2 Summary................................. 37 3 An efficient strategy for the collection and storage of large volumes of data for computation 38 3.1 Introduction.................................... 39 3.2 Background.................................... 39 3.3 Design and methodology............................. 46 i Contents 3.3.1 Implementation.............................. 48 3.4 Results and discussion.............................. 51 3.4.1 Performance results of data ingestion with and without data transformation................................. 54 3.4.2 Performance results of intermediate data transformation using a MapReduce job.............................. 56 3.4.3 Performance results of a simple analytic computation with and without data transformation......................... 56 3.4.4 Summary of the performance results.................. 57 3.4.5 Evaluation of Apache Flume...................... 58 3.5 Summary..................................... 59 4 Monitoring scientific infrastructure with the Lambda architecture 61 4.1 Introduction.................................... 61 4.2 The Lambda Architecture............................ 62 4.2.1 Difference between common scientific use case and the classic Lambda use case.................................. 63 4.3 A new data store and analytics platform for monitoring scientific infrastructure...................................... 63 4.3.1 Data transport: Message Broker.................... 64 4.3.2 Data collection: Apache Flume..................... 64 4.3.3 Batch processing: Apache Hadoop................... 65 4.3.4 Archiving: HDFS............................. 65 4.3.5 The common data access service layer................. 65 4.3.6 The serving layer: Elasticsearch..................... 66 4.3.7 Real-time processing: Esper....................... 66 4.4 Implementation of WLCG analytics on the new platform........... 66 4.4.1 WLCG data activities use case..................... 67 4.4.2 Implementation of the batch layer................... 67 4.4.3 Data representation........................... 69 4.4.4 Implementation of the real-time layer................. 70 4.4.5 Implementation of the serving layer.................. 73 ii Contents 4.5 Performance results for WDT computation on the new platform...... 74 4.5.1 Experiment

A Scalable Data Store and Analytic Platform for Real-Time Monitoring of Data-Intensive Scientiﬁc Infrastructure

Aligning Machine Learning for the Lambda Architecture

Building a Scalable Distributed Data Platform Using Lambda Architecture

Introduction to Big Data & Architectures

Open Source Lambda Architecture for Interactive Analytics

Real-Time Data Processing with Lambda Architecture

The Radstack: Open Source Lambda Architecture for Interactive Analytics

Dmla: a Dynamic Model-Based Lambda Architecture for Learning And

Lambda Architecture for Distributed Stream Processing in the Fog

Fundamentals of Real-Time Data Processing Architectures Lambda and Kappa

Efficient Stream Data Management

How to Build and Run a Big Data Platform in the 21St Century