Information Theory for the Analysis of Large Spatio-Temporal Datasets
Total Page:16
File Type:pdf, Size:1020Kb
Information Theory for the Analysis of Large Spatio-Temporal Datasets Yannick Allard Michel Mayrand Dan Radulescu Prepared by: OODA Technologies Inc. 4580 Circle Rd. Montréal (Qc), H3W 1Y7 PWGSC Contract Number: W7707-145677 Technical Authority: Bruce McArthur, Defence Scientist The scientific or technical validity of this Contract Report is entirely the responsibility of the contractor and the contents do not necessarily have the approval or endorsement of the Department of National Defence of Canada. Contract Report DRDC-RDDC-2017-C051 February 2017 Template in use: (2010) SR Advanced Template_EN (051115).dotm © Her Majesty the Queen in Right of Canada, as represented by the Minister of National Defence, 2017 © Sa Majesté la Reine (en droit du Canada), telle que représentée par le ministre de la Défense nationale, 2017 Information Theory for the Analysis of Large Spatio-Temporal Datasets B316<=:=573A Yannick Allard Michel Mayrand Dan Radulescu Prepared By: OODA Technologies Inc. 4580 Circle Rd. Montr´eal(Qc), H3W 1Y7 514-476-4773 Prepared For: Defence Research & Development Canada, Atlantic Research Centre 9 Grove Street, PO Box 1012 Dartmouth, NS B2Y 3Z7 902-426-3100 ext.359 Scientific Authority: Bruce McArthur, Defence Scientist Contract Number: W7707-145677 Call Up Number: 14 Project: Vessel Traffic Analysis of Large Spatio-Temporal Datasets Report Delivery Date: February 17, 2017 The scientific or technical validity of this Contract Report is entirely the responsibility of the contractor and the contents do not necessarily have the approval or endorsement of the Department of National Defence of Canada. © Her Majesty the Queen in Right of Canada, as represented by the Minister of National Defence, 2017. © Sa Majest´ela Reine (en droit du Canada), telle que repr´esent´eepar le ministre de la D´efensenationale, 2017. This page is intentionally left blank. Executive Summary One important aspect of Maritime Domain Awareness (MDA) is the aggregation of data and information to help formulate an accurate description of vessel activity, with the Automatic Iden- tification System (AIS) being a key data source. However, the introduction of AIS, together with other sources of MDA-related data, has also posed a challenge, in that there is an overabundance of data, making manual analysis prohibitively time-consuming. In response, there has been a growing interest in the application of techniques for automated or semi-automated analysis of vessel traffic from large volume datasets. These techniques draw from research areas such as: data mining { the computational process of discovering patterns in large data sets; spatio-temporal trajectory anal- ysis; information theory techniques and visual analytics, analytical reasoning supported by highly interactive visual interfaces. All of these approaches have been used in support of MDA-related capabilities, which include anomaly detection, traffic route extraction, area of interest analysis, collision risk analysis, vessels of interest analysis, vessel tracking and data fusion. In this study, we focus our investigation on the area of information theory, which has so far seen little overlap with the maritime domain, despite having significant potential. In this report, we present standard measures and techniques in information theory, and its appli- cation in large spatio-temporal datasets, in particular, maritime AIS datasets. By its very nature, information theory should be able to handle problems specific to AIS data, as well as problems that arise when performing spatio-temporal data mining based on the conducted literature survey and current challenges in MDA. A demonstration of a possible application over a one-month dataset has been implemented. The entropy and spatial diversity measures were applied as local measures at multiple scales and on selected attributes within Canada's Exclusive Economic Zone. Its potential was assessed through visual inspection. It was highlighted that transit zones, or shipping routes, exhibit fairly stable characteristics resulting in lower entropic and spatial diversity intensity compared to areas where multiple activities occur. Temporal behaviour was not investigated, however, results suggest that the implementation of spatio-temporal scale selection process should lead to the extraction of meaningful maritime pro- cesses over large datasets. Where processing is concerned, one should consider implementing an information-theoretic measure within a framework designed for large-scale cloud computing to speed up information extraction, even in an exploratory study. In conclusion, it was shown that information-theoretic measures have the potential to be used over large maritime datasets in a data mining context, and that this potential remains largely untapped based on the available literature. i OODA Technologies Inc. The use or disclosure of the information on this sheet is subject to the restrictions on the title page of this document. Study Report RISOMIA Call-up 14 This page is intentionally left blank. ii The use or disclosure of the information on this sheet is subject to the restrictions on the title page of this document. Contents Executive Summary i Contents iii List of Figures vii List of Tables ix 1 Introduction 1 2 Methodology 3 2.1 Summary of the Literature Survey . 3 2.2 Summary of Search Terms . 4 3 AIS Data 7 3.1 AIS Data and Information Product Sources . 7 3.2 AIS Capabilities and MDA Usage . 8 3.3 Problems with AIS Data . 9 3.4 Other Maritime Data Sources . 11 3.4.1 Earth Observation Data and Information Product Sources . 11 3.4.2 Earth Observation Capabilities and MDA Usage . 12 3.4.3 Other Public Information Products . 12 3.5 Spatio-Temporal and AIS Data Mining . 14 iii Study Report RISOMIA Call-up 14 4 Information Theory Basic Concepts 17 4.1 Information-Theoretic Measures . 17 4.1.1 Entropy . 17 4.1.1.1 Shannon Entropy . 17 4.1.1.2 R´enyi Entropy . 18 4.1.2 Conditional Entropy . 18 4.1.3 Mutual Information . 19 4.1.4 Kullback-Leibler Divergence . 19 4.1.5 Self-Information or Surprisal . 19 4.2 Spatio-Temporal Extension of Information Theory . 20 4.2.1 Co-occurrence-based Spatial Entropy . 20 4.2.2 Distance Ratios Spatial Diversity . 21 4.2.3 Markov Random Field-based Spatial Entropy . 21 4.3 Summary . 22 5 Applications of Information Theory 25 5.1 Maritime Domain Applications . 25 5.1.1 Pattern Discovery in Maritime Data . 25 5.1.2 Measures of Diversity Based on Distance and Co-occurrence . 26 5.1.3 Vessel Imaging . 27 5.1.4 Feature Extraction and Categorization . 28 5.1.5 Vessel Report Quality Assessment . 28 5.2 Non-Maritime Domain Applications . 28 5.2.1 Psychology . 28 5.2.2 Image Processing . 29 5.2.3 Information Theory for KDD . 31 5.2.4 Decision Trees . 31 5.2.5 Entropy Based Time Series Analysis in Biomedical Applications . 33 iv The use or disclosure of the information on this sheet is subject to the restrictions on the title page of this document. CONTENTS 5.3 Application in Visualization and Interactive Analysis . 33 5.4 Summary . 39 6 Proof-of-Concept Demonstration 41 6.1 Potential Applications of Information Theory for Maritime Domain Awareness . 42 6.1.1 Global and Local Entropy Computation . 43 6.1.2 Co-occurrence-based Spatial Entropy for Pattern Extraction . 44 6.1.3 Spatial Diversity Applied to Vessel/Data Attributes Using Distance Ratios 45 6.1.4 Other Avenues and Considerations . 45 6.1.5 Space-time Scale Selection for Spatio-Temporal Analysis . 46 6.1.5.1 Multigrid Methods . 46 6.1.5.2 Space-time Permutation Scan Statistic . 47 6.1.5.3 Scale-Space Analysis . 48 6.1.5.4 Number of Empty Grid Cells for Spatial Entropy . 48 6.1.5.5 Entropy-based Scale Saliency . 49 6.1.6 Summary of the Proposed Proof-of-Concept Demonstration . 51 6.2 Application of Information-Theoretic Measures over a Large Maritime Dataset . 51 6.2.1 Description of the Dataset . 52 6.2.2 Global Entropy of the Dataset . 52 6.2.3 Multiscale Analysis and Measures Comparison . 54 7 Conclusion 79 Bibliography 81 Appendix A Conferences and books A-1 A.1 Conferences . A-1 A.2 Books . A-1 Appendix B Administration and User Guide B-3 v OODA Technologies Inc. The use or disclosure of the information on this sheet is subject to the restrictions on the title page of this document. Study Report RISOMIA Call-up 14 B.1 Administration guide . B-4 B.1.1 Requirements . B-4 B.1.2 Quick Installation Guide Using Docker . B-4 B.2 User guide . B-6 B.2.1 Creating a MSARI database from raw AIS compressed files . B-6 B.2.2 Java script . B-8 B.2.3 SQL Scripts . B-10 B.2.4 Bash Script . B-11 vi The use or disclosure of the information on this sheet is subject to the restrictions on the title page of this document. List of Figures 4.1 Venn diagram representing the different entropies and the mutual information. 19 4.2 Two neighborhoods of the same equivalence class (4-2-1-1), as the actual values of the neighbours are not considered, only their numbers. Source: reference [76]. 22 6.1 Multi-level application of information theory for analysis of large spatio-temporal datasets . 42 6.2 A multigrid Example . 47 6.3 Entropy of the course over ground attributes over multiple scales . 55 6.4 Entropy of the speed attribute over multiple scales . 56 6.5 Entropy of the ship type attribute over multiple scales . 57 6.6 Spatial diversity of the course over ground attributes over multiple scales . 58 6.7 Spatial diversity of the speed attribute over multiple scales . 59 6.8 Spatial diversity of the ship type attribute over multiple scales . 60 6.9 Entropy and spatial diversity of the course over ground attributes at 0.5 and 0.1 degrees .