A Comparison of Real Time Stream Processing Frameworks

Technological University Dublin ARROW@TU Dublin Dissertations School of Computer Sciences 2018 A Comparison of Real Time Stream Processing Frameworks Jonathan Curtis Technological University Dublin, Ireland Follow this and additional works at: https://arrow.tudublin.ie/scschcomdis Part of the Computer Engineering Commons Recommended Citation Curtis, J. (2018). A Comparison of Real Time Stream Processing Frameworks. Dissertation M.Sc. in Computing (Advanced Software Development), DIT, 2018. This Dissertation is brought to you for free and open access by the School of Computer Sciences at ARROW@TU Dublin. It has been accepted for inclusion in Dissertations by an authorized administrator of ARROW@TU Dublin. For more information, please contact [email protected], [email protected]. This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 4.0 License A Comparison of Real Time Stream Processing Frameworks Jonathan Curtis C06038514 A dissertation submitted in partial fulfilment of the requirements of Dublin Institute of Technology for the degree of M.Sc. in Computing (Advanced Software Development) 2018 I certify that this dissertation which I now submit for examination for the award of MSc in Computing (Advanced Software Development), is entirely my own work and has not been taken from the work of others save and to the extent that such work has been cited and acknowledged within the test of my work. This dissertation was prepared according to the regulations for postgraduate study of the Dublin Institute of Technology and has not been submitted in whole or part for an award in any other Institute or University. The work reported on in this dissertation conforms to the principles and requirements of the Institute’s guidelines for ethics in research. Signed: _________________________________ Date: 04 March 2018 i ABSTRACT The need to process the ever-expanding volumes of information being generated daily in the modern world is driving radical changes in traditional data analysis techniques. As a result of this, a number of open source tools for handling real-time data streams has become available in recent years. Four, in particular, have gained significant traction: Apache Flink, Apache Samza, Apache Spark and Apache Storm. Despite the rising popularity of these frameworks, however, there are few studies that analyse their performance in terms of important metrics, such as throughput and latency. This study aims to correct this, by running several benchmarks against these frameworks. Key words: Stream Processing, Apache Flink, Apache Kafka, Apache Samza, Apache Spark, Apache Storm, Latency, Throughput ii ACKNOWLEDGEMENTS I would like to thank my supervisor Andrea Curley for all her help. I would also like to thank my family for supporting me through this process. iii Contents 1 INTRODUCTION ....................................................................................................... 1 1.1 Background ........................................................................................................... 1 1.2 Research Project/Problem ..................................................................................... 2 1.3 Research Objectives.............................................................................................. 3 1.4 Research Methodologies ....................................................................................... 3 1.5 Scope and Limitations .......................................................................................... 3 1.6 Document Outline ................................................................................................. 5 2 LITERATURE REVIEW ............................................................................................ 6 2.1 Big Data ................................................................................................................ 6 2.2 Traditional Data Processing .................................................................................. 7 2.2.1 Hadoop ........................................................................................................... 8 2.2.2 YARN ............................................................................................................ 9 2.3 Streaming and Stream-Processing ...................................................................... 11 2.4 Streaming Architecture ....................................................................................... 12 2.4.1 Scaling ......................................................................................................... 12 2.4.2 Lambda Architecture ................................................................................... 13 2.4.3 Streaming Challenges .................................................................................. 14 2.5 Streaming Platforms ........................................................................................... 14 2.5.1 Apache Kafka .............................................................................................. 14 2.5.2 Apache Flink ................................................................................................ 16 2.5.3 Apache Samza ............................................................................................. 17 2.5.4 Apache Spark ............................................................................................... 18 2.5.5 Apache Storm .............................................................................................. 20 2.6 Benchmarking ..................................................................................................... 22 iv 2.6.1 Latency ........................................................................................................ 23 2.6.2 Throughput .................................................................................................. 24 2.6.3 Existing Comparative Benchmarking Studies ............................................. 25 3 DESIGN AND METHODOLOGY ........................................................................... 28 3.1 Experiment Environment .................................................................................... 28 3.2 Proposed Architecture ........................................................................................ 29 3.3 Metrics Gathering ............................................................................................... 32 3.2.1 Capturing Latency ....................................................................................... 33 3.2.2 Capturing Throughput ................................................................................. 35 3.4 Benchmarking Applications ............................................................................... 36 3.4.1 Application Selection Criteria ..................................................................... 38 3.4.2 Experiment Dataset ...................................................................................... 39 3.4.3 Identity ......................................................................................................... 41 3.4.4 Extraction ..................................................................................................... 43 3.4.5 Grep ............................................................................................................. 44 3.4.6 WordCount .................................................................................................. 45 4 IMPLEMENTATION AND RESULTS.................................................................... 47 4.1 Experiment Setup................................................................................................ 49 4.1.2 Kafka Cluster Setup ..................................................................................... 49 4.1.2 Computational Cluster Setup ....................................................................... 51 4.2 Implementation Details ....................................................................................... 52 4.2.1 Extracting Kafka Timestamp ....................................................................... 53 4.2.2 Framework Configuration ........................................................................... 54 4.3 Experiment Results ............................................................................................. 56 v 4.3.1 Identity ......................................................................................................... 58 4.3.2 Extraction ..................................................................................................... 65 4.3.3 Grep ............................................................................................................. 70 4.3.4 WordCount .................................................................................................. 76 5 ANALYSIS ............................................................................................................... 82 5.1 Flink and Spark Standalone ................................................................................ 82 5.2 Storm and Spark on YARN ................................................................................ 85 6 CONCLUSION ......................................................................................................... 88 6.1 Research Overview and Problem Definition ...................................................... 88 6.2 Findings .............................................................................................................

Load more