An Evaluation of Real-Time Processing of Call Detail Records Using Stream Processing

UNIVERSITY OF NAIROBI COLLEGE OF BIOLOGICAL AND PHYSICAL SCIENCES SCHOOL OF COMPUTING AND INFORMATICS An Evaluation of Real-Time Processing of Call Detail Records Using Stream Processing CATHERINE KITHUSI WAMBUA P53/73389/2014 A research project report submitted to the School of Computing and Informatics in partial fulfillment of the requirements for the award of the Degree of Masters of Science in Distributed Computing Technology at the University of Nairobi December 2017. DECLARATION I certify that this research project report to the best of my knowledge, is my original authorial work except as acknowledged therein and has not been submitted for any other degree or professional qualification award in this or any other University. Signature: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ Date: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ Catherine Kithusi Wambua (P53/73389/2014) This research report has been submitted in partial fulfillment of the requirements for the Degree of Master of Science in Distributed Computing Technology at the University of Nairobi with my approval as the University supervisor. Signature: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ Date: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ Dr. Christopher Chepken i | P a g e DEDICATION To my beloved parents, for their unrelenting dedication to ensuring that my siblings and I acquired the best education despite all odds. To my dear sister and brothers, for being the best cheering squad anyone could ever ask for. To my friends and colleagues, for their advice and support and for affording me time to pursue this degree ii | P a g e ACKNOWLEDGEMENTS I would like to sincerely thank the almighty God who has been a source of strength, wisdom and perseverance during this project, this research project could not have been done without his ever present sustenance. My deepest gratitude to my research project advisor and supervisor, Dr. Christopher Chepken for his invaluable advice and motivation. His guidance, feedback and expert opinions were instrumental towards the completion of this project. I would also like to thank the members of the research panel for their constructive criticisms, input and corrections which aided in the completion of this research project. iii | P a g e ABSTRACT A common problem plaguing the telecommunication industry is how to process the gigantic amounts of Call Detail Records (CDR) data it generates. Currently, telecommunication companies use batch processing systems to process CDR data at intervals ranging from 5 minutes to 24 hours, and even then, not all data is processed. Present batch processing platforms are vendor based, requiring proprietary software, specialized hardware and licenses. Because of this, processing of CDR data is expensive and has prevented telecommunication companies from gaining all the benefits that could be acquired by the effective and total processing of CDR data. With the strides made in big data recently and especially in stream processing, total processing of CDR data is made possible, furthermore, stream processing facilitates the real-time processing of data. This research primarily focuses on stream processing of CDR data, this would be of benefit to telecommunication companies seeking to gain complex, intricate and speedy insights into their customers and networks. This research also involves a feature comparison of several stream processing platforms in use today for the purposes of selecting a single suitable platform for this project. The selected platform is then evaluated in terms of performance and resource usage, all in an effort to determine whether the selected stream processing platform is suitable for the real-time processing of CDR data. iv | P a g e TABLE OF CONTENTS DECLARATION .................................................................................................................................. i DEDICATION .....................................................................................................................................ii ACKNOWLEDGEMENTS ................................................................................................................... iii ABSTRACT ........................................................................................................................................ iv TABLE OF CONTENTS........................................................................................................................ v LIST OF FIGURES ............................................................................................................................ viii LIST OF TABLES .............................................................................................................................. viii ABBREVIATIONS .............................................................................................................................. ix CHAPTER ONE ................................................................................................................................. 1 INTRODUCTION ............................................................................................................................... 1 1.0. Background Information .................................................................................................. 1 1.0.1. SMSC CDR Processing ............................................................................................... 3 1.0.2. Batch Processing vs Big Data Processing .................................................................. 4 1.0.3. Implementing Big Data Projects ............................................................................... 6 1.1. Problem Statement .......................................................................................................... 7 1.2. Significance of the Research ............................................................................................ 8 1.3. Research Objectives ......................................................................................................... 8 1.4. Research Questions .......................................................................................................... 9 1.5. Research Assumptions ..................................................................................................... 9 CHAPTER TWO .............................................................................................................................. 10 LITERATURE REVIEW ..................................................................................................................... 10 2.0. Introduction.................................................................................................................... 10 2.1. Stream Processing .......................................................................................................... 12 2.2. Selecting a Stream Processing Platform......................................................................... 14 2.3. Apache Spark Streaming ................................................................................................ 19 2.3.1. Apache Spark Streaming Architecture .................................................................... 21 v | P a g e 2.4. Apache Hadoop Yarn ...................................................................................................... 23 2.5. Apache Hadoop HDFS .................................................................................................... 24 2.6. Apache Cassandra .......................................................................................................... 25 2.7. Apache Kafka .................................................................................................................. 27 2.8. Apache Zookeeper ......................................................................................................... 29 2.9. Graphite and Grafana ..................................................................................................... 30 2.10. Related Work and Research Gap ................................................................................ 30 CHAPTER THREE ............................................................................................................................ 33 METHODOLOGY ............................................................................................................................ 33 3.0. Introduction.................................................................................................................... 33 3.1. System Infrastructure Setup .......................................................................................... 34 3.2. Prototype Development ................................................................................................. 37 3.2.1. Software Development Model................................................................................ 37 3.2.1.1. Requirements Definition and Analysis ............................................................ 37 3.2.1.2. Prototype Design and Development ............................................................... 38 3.2.1.3. System Testing ................................................................................................. 45 3.2.1.4. Software Implementation ............................................................................... 46 3.3. Experimentation ............................................................................................................. 47 3.3.1. Fixed Parameters ...................................................................................................

An Evaluation of Real-Time Processing of Call Detail Records Using Stream Processing

Large-Scale Learning from Data Streams with Apache SAMOA

DSP Frameworks DSP Frameworks We Consider

Comparative Analysis of Data Stream Processing Systems

Network Traffic Profiling and Anomaly Detection for Cyber Security

Storage and Ingestion Systems in Support of Stream Processing

A Study of Incremental Checkpointing in Distributed Stream Processing Systems

Apache Samza

Code Smell Prediction Employing Machine Learning Meets Emerging Java Language Constructs"

Parte I Studio Delle Tecnologie Utili Per L'analisi, L'elaborazione E L'interrogazione Di Big Data

Top Big Data Technologies for Data Ingestion

Complex Event Processing As a Service in Multi-Cloud Environments

A Survey of State Management in Big Data Processing Systems