Performance Modelling of Distributed Stream Processing Topologies
Total Page:16
File Type:pdf, Size:1020Kb
Performance Modelling of Distributed Stream Processing Topologies Thomas Cooper Submitted for the degree of Doctor of Philosophy in the School of Computing, Newcastle University August 2020 © 2020, Thomas Cooper For Amanda and Gillian Abstract Distributed stream processing systems (like Apache Storm, Heron and Flink) allow the processing of massive amounts of data with low latency and high throughput. Part of the power of these systems is their ability to scale the separate sections (operators) of a stream processing query to adapt to changes in incoming workload and maintain end-to-end latency or throughput requirements. Whilst many of the popular stream processing systems provide the functionality to scale their queries, none of them suggest to the user how best to do this to ensure better performance. The user is required to deploy the query, wait for it to stabilise, assess its performance, alter its configuration and repeat this loop until the required performance is achieved. This scaling decision loop is intensely time consuming and for large deployments the process can take days to complete. The time and effort involved also discourage users from changing the configuration once one is found to satisfy peak load. This leads to the over-provisioning of resources. To solve these issues a performance modelling system for stream processing queries is required. This would allow the performance effect of changes to a query’s configuration to be assessed before they are deployed, reducing the time spent within the scaling decision loop. Previous research on auto-scaling systems using performance models has focused either; on queueing theory based approaches, which require small amounts of historical performance data but suffer from poor accuracy; or on machine learning based approaches which have better accuracy, but require large amounts of historical performance data to produce their predictions. The research detailed in this thesis is focused on showing that an approach based on queueing theory and discrete event simulation can be used, with only a relatively small amount of performance data, to provide accurate performance modelling results for stream processing queries. We have analysed the many aspects involved in the operation of modern stream processing systems, in particular Apache Storm. From this we have created processes and models to predict end-to-end latencies for proposed configuration changes to running streaming queries. These predictions are validated against a diverse range of example streaming queries, under various workload environments and configurations. The evaluations show that for most query configurations, our approach can produce performance predictions with similar or better accuracy than the best performing machine learning based approaches, over a more diverse range of query types, using only a fraction of the performance data those approaches require. iii iv Declaration I declare that this thesis is my own work unless otherwise stated. No part of this thesis has previously been submitted for a degree or any other qualification at Newcastle University or any other institution. Thomas Cooper August 2020 v vi Acknowledgements I would like to start by thanking my supervisor, Dr. Paul Ezhilchelvan, for his advice, patience, good humor and honesty (even if sometimes I didn’t want to hear it). It has been a journey. Thanks to my fellow MSc Computer Science students for making my introduction to the field so much fun and for their continued friendship and support. Special thanks to Soo Darcy for her excellent proof reading skills. My thanks to the staff of the Newcastle University Centre for Doctoral Training (CDT) in Cloud Computing for Big Data: to Drs. Paul Watson and Darren Wilkinson for leading the CDT and taking a chance on a masters student with some funny ideas; to Drs. Matt Foreshaw and Sarah Heaps for giving me a great start and support along the way; and thanks especially to Oonagh McGee and Jen Wood, without whom chaos would have reigned. Thanks to Karthik Ramasamy, Ning Wang and all the Twitter Real Time Compute team for giving me a chance to test my ideas and learn all I could from them. Thanks to my fellow CDT PhD students for sharing their knowledge, experience and friendship. Special mentions go to Naomi Hannaford and Lauren Roberts for being far more patient with the loud guy sitting next to them than they needed to be; and to Dr. Hugo Firth who is a good friend and fellow shaver of yaks. Thank you to my family, who have supported me throughout my (multiple) academic endeavours: to my father and grandfathers, for giving me my love of technology; and to my mother Amanda, who gave me the means to start my journey toward this PhD and who never wavered in her belief in me. Finally, love and thanks to Gillian, without whom none of this would have been possible and who already knows everything I want to say. vii viii Contents Glossary 1 1 Introduction 7 1.1 Distributed Stream Processing . .8 1.2 Topology Scaling Decisions . .9 1.3 Model-Based Scaling Decisions . 11 1.3.1 Faster convergence on a valid plan . 12 1.3.2 Feedback for scheduling decisions . 13 1.3.3 Pre-emptive scaling . 13 1.3.4 N-version scaling . 14 1.4 Summary . 15 1.4.1 Research aims . 15 1.5 Thesis Structure . 16 1.6 Related Publications . 16 2 Distributed Stream Processing Architecture 19 2.1 Choosing an Example System . 19 2.2 Apache Storm Overview . 20 2.2.1 Background . 20 2.2.2 Overview . 21 2.3 Storm Elements . 22 2.3.1 Storm cluster . 22 2.3.2 Topology . 22 2.4 Parallelism in Storm . 23 2.4.1 Executors . 24 2.4.2 Tasks . 25 2.4.3 Worker processes . 26 2.5 Stream Groupings . 27 2.6 Topology Plans . 28 2.6.1 Query plan . 28 2.6.2 Configuration . 28 2.6.3 Logical plan . 30 2.6.4 Physical plan . 30 2.7 Internal Queues . 30 2.7.1 Overview . 31 2.7.2 Queue input . 32 2.7.3 Arrival into the queue . 32 2.7.4 Timer flush interval completes . 32 2.7.5 Service completes . 35 ix x CONTENTS 2.8 Tuple Flow . 36 2.8.1 Executor tuple flow . 36 2.8.2 Worker process tuple flow . 38 2.9 Guaranteed Message Processing . 38 2.9.1 Acker . 40 2.9.2 Tuple tree . 40 2.10 Topology Scheduling . 42 2.11 Rebalancing . 42 2.11.1 Worker processes . 43 2.11.2 Executors . 43 2.11.3 Tasks . 43 2.12 Windowing . 43 2.12.1 Tumbling windows . 43 2.12.2 Sliding windows . 44 2.12.3 Windowing in Apache Storm . 44 2.13 Storm Metrics . 45 2.13.1 Accessing metrics . 45 2.13.2 Component metrics . 46 2.13.3 Queue metrics . 49 2.13.4 Custom metrics . 50 2.13.5 Metrics sample rates . 50 2.14 Summary . 51 3 Related Work 53 3.1 Threshold Based Auto-scaling . 53 3.2 Performance Model Based Auto-scaling . 56 3.2.1 Queueing theory . 56 3.2.2 Machine learning . 59 3.2.3 Other approaches . 62 3.3 Summary . 64 4 Topology Performance Modelling 67 4.1 Performance Modelling Procedure . 67 4.1.1 Modelling the topology . 67 4.1.2 Tuple flow plan . 70 4.1.3 Elements to be modelled . 71 4.2 Executor Latency Modelling . 73 4.2.1 Queue simulation . 74 4.2.2 Executor simulator . 75 4.3 Incoming Workload . 75 4.4 Routing Probabilities . 76 4.4.1 Stream routing probability . 77 4.4.2 Global routing probabilities . 80 4.5 Predicting Routing Probabilities . 81 4.5.1 Predicting stream routing probabilities . 81 4.5.2 Predicting global routing probabilities . 89 4.6 Input to Output Ratios . 89 4.6.1 Calculating input to output coefficients for source physical plans . 91 4.7 Predicting Input to Output Ratios . 91 4.7.1 Input streams containing only shuffle groupings . 91 CONTENTS xi 4.7.2 Input streams containing at least one fields grouping . 92 4.8 Arrival Rates . 93 4.8.1 Predicting executor arrival rates . 93 4.8.2 Predicting worker process arrival rates . 94 4.9 Service Times . 94 4.9.1 Executor co-location effects on service time . 95 4.10 Predicting Service Times . ..