Data Quality Management in Large-Scale Cyber-Physical Systems
Total Page:16
File Type:pdf, Size:1020Kb
Data Quality Management in Large-Scale Cyber-Physical Systems Ahmed Abdulhasan Alwan School of Architecture, Computing and Engineering University of East London A thesis presented for the degree of Doctor of Philosophy July 19, 2021 Abstract Cyber-Physical Systems (CPSs) are cross-domain, multi-model, advance informa- tion systems that play a significant role in many large-scale infrastructure sectors of smart cities public services such as traffic control, smart transportation control, and environmental and noise monitoring systems. Such systems, typically, involve a substantial number of sensor nodes and other devices that stream and exchange data in real-time and usually are deployed in uncontrolled, broad environments. Thus, unexpected measurements may occur due to several internal and external factors, including noise, communication errors, and hardware failures, which may compromise these systems quality of data and raise serious concerns related to safety, reliability, performance, and security. In all cases, these unexpected measurements need to be carefully interpreted and managed based on domain knowledge and computational models. Therefore, in this research, data quality challenges were investigated, and a com- prehensive, proof of concept, data quality management system was developed to tackle unaddressed data quality challenges in large-scale CPSs. The data quality management system was designed to address data quality challenges associated with detecting: sensor nodes measurement errors, sensor nodes hardware failures, and mismatches in sensor nodes spatial and temporal contextual attributes. De- tecting sensor nodes measurement errors associated with the primary data quality dimensions of accuracy, timeliness, completeness, and consistency in large-scale CPSs were investigated using predictive and anomaly analysis models via utilising statistical and machine-learning techniques. Time-series clustering techniques were investigated as a feasible mean for detecting long-segmental outliers as an indicator of sensor nodes’ continuous halting and incipient hardware failures. Fur- thermore, the quality of the spatial and temporal contextual attributes of sensor nodes observations was investigated using timestamp analysis techniques. The different components of the data quality management system were tested and calibrated using benchmark time-series collected from a high-quality, temperature sensor network deployed at the University of East London. Furthermore, the effectiveness of the proposed data quality management system was evaluated using a real-world, large-scale environmental monitoring network consisting of more than 200 temperature sensor nodes distributed around London. i The data quality management system achieved high accuracy detection rate us- ing LSTM predictive analysis technique and anomaly detection associated with DBSCAN. It successfully identified timeliness and completeness errors in sensor nodes’ measurements using periodicity analysis combined with a rule engine. It achieved up to 100% accuracy in detecting potentially failed sensor nodes using the characteristic-based time-series clustering technique when applied to two days or longer time-series window. Timestamp analysis was adopted effectively for evaluating the quality of temporal and spatial contextual attributes of sensor nodes observations, but only within CPS applications in which using gateway modules is possible. ii Acknowledgements I would like to express my heartfelt appreciation to Professor Allan J Brimicombe, Dr. Mihaela Anca Ciupala, Dr. Paolo Falcarin and Dr. Andres Baravalle for their professional guidance and support throughout this research. This research was sponsored by the University of East London (UEL) under the PhD scholarship scheme 2017 and through the UEL Research Internship pro- gramme 2019. Therefore, I am very grateful for the generous sponsorship and kind support from UEL staff, in particular Richard Bottoms and Charlotte Forbes. I would like to thank all my friends and staff in the Ministry of Planning - Iraq for their continuous support. I am deeply grateful for the collaboration with Thingful. Ltd, UK, in particular, I thank Usman Haque for his insight and expertise that assisted the research. iii To Suad, My parents and Rawia iv Contents Abstract i Acknowledgements iii Table of Contents iv List of Figures xi List of Tables xxi List of Abbreviations xxv Author’s declaration xxvii 1 Introduction 1 1.1 Cyber-Physical Systems . 1 1.2 Smart Cities as Large-Scale CPSs . 6 1.3 Research Motivation . 11 1.4 The Research Aim . 15 1.5 The Research Questions and Objectives . 16 1.6 Novelty of Research . 18 1.7 Research Boundaries . 19 1.8 Research Structure . 20 2 Literature Review 22 2.1 Data Quality Concepts and Terminology . 23 2.1.1 Accuracy . 24 v 2.1.2 Time-Related Accuracy (Timeliness) . 25 2.1.3 Completeness (Completability) . 28 2.1.4 Consistency . 30 2.2 Data Quality Challenges in Large-Scale CPSs, a Systematic Litera- ture Review . 31 2.2.1 Review Motivation / Introduction . 32 2.2.2 Review Process and Methodology . 34 2.2.3 Review Conduct and Primary Studies Selection . 40 2.2.4 RQ1: Data Quality Challenges in Large-Scale CPSs. 57 2.2.5 RQ2: Data Mining and Data Quality Management in Large- Scale CPSs. 61 2.2.6 RQ3: Unaddressed Data Quality Management Challenges in Large-Scale CPSs and The Research Gap. 67 2.3 The Research Questions . 70 2.4 Summary . 71 3 System Design (Methodology) 72 3.1 Overview of Research Paradigms . 72 3.2 Empirical Research Methods . 74 3.2.1 Quantitative . 74 3.2.2 Qualitative . 74 3.2.3 Mixed-Method (Triangulation) . 75 3.3 Empirical Research Strategy . 76 3.3.1 Experimental . 76 3.3.2 Case Study . 77 3.3.3 Choosing the Research Methodology . 77 3.4 System Design and Development Phases . 81 3.4.1 System Analysis and Design . 85 3.4.2 Data Acquisition Unit . 87 3.4.3 Data Quality Assessment Unit . 91 3.4.4 Selecting Data Analysing Methods . 96 3.4.5 Time-Series Decomposition . 96 vi 3.5 Online Mode - Predictive Analysis Models . 99 3.5.1 Simple Forecasting Methods . 101 3.5.2 Holt-Winters Seasonal . 102 3.5.3 ARMA, ARIMA and Seasonal ARIMA Models . 104 3.5.4 Gaussian Process Regression . 110 3.5.5 Long Short-Term Memory Networks . 112 3.6 Online Mode – Anomaly Analysis Models . 114 3.6.1 Distance-Based Spatial Clustering (K-means) . 121 3.6.2 Density-Based Spatial Clustering (DBSCAN) . 123 3.7 Online Mode - Timestamp Analysis (Temporal Consistency) . 127 3.8 Offline Mode - Time-series Clustering . 130 3.8.1 Dynamic Time Warping . 134 3.8.2 K-Shape . 135 3.8.3 Characteristic-Based Time-Series Clustering . 135 3.9 Offline Mode – Timestamp Analysis (Spatial Attributes Consistency)136 3.10 Summary . 142 4 Implementation and Results 143 4.1 Data Acquisition and Data Process . 144 4.1.1 Sensor Node Networks . 144 4.1.2 Datasets . 150 4.1.3 Software Framework . 156 4.2 Online-Mode Data Quality Assessment . 164 4.2.1 Predictive Analysis Models . 165 4.2.2 Anomaly Analysis Models . 207 4.2.3 Timestamp Analysis (Temporal Consistency) . 220 4.3 Offline-Mode Data Quality Assessment . 223 4.3.1 Time-Series Clustering Models . 224 4.3.2 Timestamp Analysis Model (Spatial Attributes Consistency) 235 4.4 Discussion and Summary . 243 4.4.1 The Data Acquisition Unit (Layers 1 and 2) . 243 vii 4.4.2 The Data Quality Assessment Unit . 244 5 Conclusions and Future Work 253 5.1 Revisiting the Research Questions and Objectives . 254 5.1.1 Review Question-1: . 254 5.1.2 Review Question-2: . 260 5.1.3 Review Question-3: . 260 5.2 Contribution to Knowledge . 262 5.3 Conclusions . 263 5.4 Future Work . 266 References 267 Appendices 291 A Technical and Implementation Details 292 A.1 Sensor Nodes Anatomy and Data Quality . 292 A.1.1 Hardware Components . 293 A.1.2 Operating System . 294 A.2 The Technical Details of the Local Sensor Node Network . 296 A.3 Technical Details of The Data Acquisition Unit . 301 A.3.1 JSON Parsing and Duplication Prevention Trigger . 307 A.3.2 The Periodicity Analysis Rule Engine. 308 A.4 Configuration and programming details . 310 A.4.1 Holt-Winters predictive model . 310 A.4.2 ARMA and ARIMA predictive models . 313 A.4.3 SARIMA predictive model . 320 A.4.4 GPR predictive model . 322 A.4.5 LSTM predictive model . 326 A.4.6 K-means and DBSCAN Partitioning Models . 329 A.4.7 DTW and K-Shape Models . 335 A.4.8 Characteristics (features)-based Model . 338 viii B Research Integrity and Technology RI Completion Certificate 342 ix x List of Figures 1.1 Cyber-Physical Systems structure diagram. 3 2.1 A holistic overview of the processes adopted in this systematic literature review. 35 2.2 The number of SLR primary studies by the year of publication, (October 2020). 42 2.3 The number of SLR primary studies by the country of publication. 43 2.4 The ratio of the data quality dimensions addressed by the SLR primary studies. 59 2.5 The key data quality challenges in large-scale CPSs. 60 2.6 The most popular data quality assessment/management methods or techniques in large-scale CPS applications based on the number (left) and the ratio (right) of the SLR studies. 62 2.7 The most popular data mining techniques in large-scale CPSs based on the No. of the SLR studies utilising these techniques. 63 2.8 The key data mining techniques used to assess the main data quality dimensions in large-scale CPSs. 63 2.9 A holistic diagram of the main data quality management/assess- ment methods and techniques and their associated data quality dimensions that these techniques are addressing, based on the SLR results. 64 3.1 The main processes of the experimental research approach. 77 3.2 The main processes of the case study research approach.