Distributed Stream Processing in a Global Sensor Grid for Scientific Simulations
Total Page:16
File Type:pdf, Size:1020Kb
Distributed Stream Processing in a Global Sensor Grid for Scientific Simulations Von der Fakultät Informatik, Elektrotechnik und Informationstechnik und dem Stuttgart Research Centre for Simulation Technology der Universität Stuttgart zur Erlangung der Würde eines Doktors der Naturwissenschaften (Dr. rer. nat.) genehmigte Abhandlung Vorgelegt von Andreas Martin Georg Benzing aus Stuttgart – Bad Cannstatt Hauptberichter: Prof. Dr. rer. nat. Kurt Rothermel Mitberichter: Prof. Dr. Frank Leymann Prof. Dr. Pedro José Marrón Tag der mündlichen Prüfung: 19.10.2015 Institut für Parallele und Verteilte Systeme Universität Stuttgart 2015 Danksagungen Ich danke allen Menschen, die mich während meines Promotionsvorhabens begleitet und unterstützt haben. Zunächst gilt mein Dank meinem Doktorvater Prof. Dr. Kurt Rothermel, für zahlreiche wertvolle Hinweise, konstruktive Diskussionen und die Möglichkeit, meine Forschungen am Lehrstuhl Verteilte Systeme durchführen zu können. Prof. Dr. Frank Leymann, meinem Mitbetreuer, gilt mein Dank für seine hilfreichen Kommentare. Prof. Dr. Pedro José Marrón danke ich für sein Interesse an meiner Arbeit als Gutachter. Für die Betreuung während meines Aufenthalts in Toronto danke ich Prof. Dr. Hans-Arno Jacobsen und den Mitarbeitern der Middleware Systems Research Group. Allen meinen Kollegen am Lehrstuhl danke ich für eine sehr angenehme Arbeitsatmo- sphäre. Dr. Boris Koldehofe stand mir immer als Ansprechpartner zur Verfügung und hatte bei jedem Problem ein offenes Ohr. Explizit nennen möchte ich auch meine Begleiter seit Studienbeginn Beate Ottenwälder und Björn Schilling. Dr. Marcus Reble, Dr. Jan Hasenauer und ihre Kollegen vom Institut für Systemtheorie und Regelungstechnik zeig- ten mir mit ihrer Sicht auf die Welt neue Perspektiven. Dr. Tudor Ionescu vom Institut für Kernenergetik und Energiesysteme danke ich für die konstruktive Zusammenarbeit. Darüber hinaus hat mich die Vielfalt an Themen im ganzen Exzellenzcluster SimTech immer inspiriert und motiviert, wofür ich ebenfalls allen Beteiligten danken möchte. Genauso wichtig wie das fachliche Umfeld war für mich immer das persönliche: meine Eltern Margot Benzing und Martin Benzing sowie Michael Müller, meine Geschwister Julia Klenk und Marc Benzing, meine Freundin Dani Hildebrand und meine Neffen Felix und Fabian. Herzlichsten Dank für eure Unterstützung, eure Geduld und eure Liebe. Vielen Dank an Bernd Oczko und Katrin Hagelstein für das Korrekturlesen der Arbeit. Schließlich möchte ich Christian Heigele, Dr. Stefanie Wöhrle, Marco Völz, Ulrich Gösele und allen anderen Freunden für viele anregende und motivierende Gespräche danken. 3 Contents Contents 5 List of Acronyms 9 Glossary of Terms 11 Abstract 13 Deutsche Zusammenfassung 15 1 Introduction 33 1.1 Motivation . 33 1.2 Contributions . 36 1.3 Structure . 37 2 Background 39 2.1 Scientific Simulation Workflows . 39 2.2 Distributed Stream Processing . 42 2.3 Wired and Wireless Sensor Networks . 49 2.4 Stuttgart Research Center & Cluster of Excellence Simulation Technology 56 3 System Overview 61 3.1 System Components . 61 3.1.1 Sensors and Gateways . 61 3.1.2 Broker Network . 63 3.1.3 Clients and Simulations . 64 3.2 Query Abstractions . 64 3.2.1 Direct Sensor Queries . 65 3.2.2 Simulation Queries . 66 5 CONTENTS 3.3 Data Processing Workflow . 69 3.3.1 Acquiring Raw Sensor Data . 69 3.3.2 Preprocessing using Diagnostic Simulations . 71 3.3.3 Distribution of Sensor Streams . 72 4 Real-Time Monitoring of Measurements 75 4.1 Preliminaries . 75 4.2 System Model . 77 4.3 Distributed Aggregation of Sensor Data . 78 4.3.1 Basic Indexing Structure . 78 4.3.2 Query Routing and Insertion . 80 4.3.3 Data Routing and Aggregate Calculation . 82 4.4 Prediction in Multi-Hop Environments . 83 4.4.1 Data Reduction using Predictors . 84 4.4.2 Integrated Aggregation and Prediction . 91 4.4.3 Multi-Hop Update Strategy . 92 4.5 Evaluation . 94 4.5.1 Data Reduction . 96 4.5.2 Prediction Error . 103 4.6 Related Work . 105 4.7 Summary . 107 5 Data Stream Distribution 109 5.1 Preliminaries . 109 5.2 System Model . 111 5.3 Multi-Resolution Query Processing . 111 5.3.1 Spatial Indexing . 112 5.3.2 Query Processing and Routing . 119 5.4 Maximizing Utility of the Broker Network . 123 5.4.1 Problem Statement . 123 5.4.2 Query Load Estimation and Region Selection . 124 5.4.3 Directed Load Balancing . 126 5.5 Evaluation . 130 5.5.1 System Capacity . 130 5.5.2 Bandwidth Usage . 132 5.6 Related Work . 137 6 CONTENTS 5.7 Summary . 138 6 Minimizing Network Usage 141 6.1 Preliminaries . 141 6.2 System Model . 143 6.2.1 System Components . 143 6.2.2 Cost Model . 143 6.3 Sensor Stream Distribution Problem . 145 6.4 Minimizing Overall Network Usage . 147 6.4.1 Operation Example . 147 6.4.2 Optimizing Distribution Structures . 150 6.4.3 Underlay Aware Overlay Management . 154 6.4.4 Heuristic Online-Adaptation of Streams . 162 6.5 Evaluation . 165 6.5.1 Bandwidth Usage . 166 6.5.2 Network Stretch . 167 6.5.3 Client Perceived Delay . 169 6.5.4 Scalability Improvement . 169 6.6 Related Work . 170 6.7 Summary . 172 7 Conclusion 173 7.1 Summary . 173 7.2 Future Work . 175 List of Figures 177 List of Tables 181 Bibliography 183 7 List of Acronyms ALM application layer multicast. API application programming interface. CDN content distribution network. DSP distributed stream processing. EXC Cluster of Excellence. GPS global positioning system. GSG Global Sensor Grid. GSGM Global Sensor Grid Middleware. GSN global sensor network. ISP Internet service provider. 9 List of Acronyms LMS least mean squares. MBR minimum bounding rectangle. MST minimum Steiner tree problem. SARIMA seasonal autoregressive integrated moving average. SDN software defined networking. SN sensor network. SRC Stuttgart Research Center. WSN wireless sensor network. 10 Glossary of Terms broker A dedicated server which executes an instance of the GSGM. 63 diagnostic simulation A diagnostic simulation continuously integrates measurements and runs in real-time to create a digital representation of the physical world. 57, 71, 110 direct sensor query A query for directly monitoring measurements of a certain geographic location and sensor type. 65 full sensor stream A continuous stream containing a regular grid of data points equally distributed over a geographic region in each update. 63, 64, 72, 110, 111 relay broker A broker which participates in distributing full sensor streams. 63, 143, 150 simulation query A query for requesting a full sensor stream from a diagnostic simulation which preprocesses measurements. 66, 68, 109 source broker A broker which generates full sensor streams from sensor measurements. 63, 64, 147 target broker A broker which delivers full sensor streams to simulation clients. 63, 143, 147 11 Abstract With today’s large number of sensors available all around the globe, an enormous amount of measurements has become available for integration into applications. Especially scientific simulations of environmental phenomena can greatly benefit from detailed information about the physical world. The problem with integrating data from sensors to simulations is to automate the monitoring of geographical regions for interesting data and the provision of continuous data streams from identified regions. Current simulation setups use hard coded information about sensors or even manual data transfer using external memory to bring data from sensors to simulations. This solution is very robust, but adding new sensors to a simulation requires manual setup of the sensor interaction and changing the source code of the simulation, therefore incurring extremely high cost. Manual transmission allows an operator to drop obvious outliers but prohibits real-time operation due to the long delay between measurement and simulation. For more generic applications that operate on sensor data, these problems have been partially solved by approaches that decouple the sensing from the application, thereby allowing for the automation of the sensing process. However, these solutions focus on small scale wireless sensor networks rather than the global scale and therefore optimize for the lifetime of these networks instead of providing high-resolution data streams. In order to provide sensor data for scientific simulations, two tasks are required: i) con- tinuous monitoring of sensors to trigger simulations and ii) high-resolution measurement streams of the simulated area during the simulation. Since a simulation is not aware of the deployed sensors, the sensing interface must work without an explicit specification of individual sensors. Instead, the interface must work only on the geographical region, sensor type, and the resolution used by the simulation. The challenges in these tasks are to efficiently identify relevant sensors from the large number of sources around the globe, to detect when the current measurements are of relevance, and to scale data stream distribution to a potentially large number of simulations. Furthermore, the process must adapt to complex network structures and dynamic network conditions as found in the Internet. 13 Abstract The Global Sensor Grid (GSG) presented in this thesis attempts to close this gap by approaching three core problems: First, a distributed aggregation scheme has been developed which allows for the monitoring of geographic areas for sensor data of interest. The reuse of partial aggregates thereby ensures highly efficient operation and alleviates the sensor sources from individually providing numerous clients with measurements. Second, the distribution of