Topological Data Analysis on Road Network Data

Topological Data Analysis on Road Network Data A Thesis Presented in Partial Fulfillment of the Requirements for the Degree Master of Mathematical Science in the Graduate School of The Ohio State University By Xiao Zha, B.S. Graduate Program in Mathematical Science The Ohio State University 2019 Master's Examination Committee: Facundo Mémoli,Advisor Yusu Wang, Co-Advisor c Copyright by Xiao Zha 2019 Abstract Many problems in science and engineering involve signal analysis. Engineers and scientists came up with many approaches to study signals. Recently, researchers propose a new framework, combining the time-delay embedding with the tools from computational topology, for the study of periodic signals. By applying time-delay embedding to the periodic signals, the periodic behaviors express themselves as topological cycles and we can use persistent homology to detect these topological features. In this thesis, we apply this method to analyze road network data, specifically vehicle flow data recorded by detectors placed on highways. First, we apply time-delay embedding to project the vehicle flow data into point cloud data in a high dimensional space. Then, we use persistent homology tools to detect the topological features and get persistence digram. Next, we can repeat the same experiment to vehicle flow data of different period. Fox example, in our experiment, we use the vehicle flow data of different weeks and months. Therefore, we get persistence diagrams corresponding to the vehicle flow data of different period. Finally, we calculate the bottleneck distance and wasserstein distance between these persistence diagrams and do hierarchical clustering. The dendrograms of the hierarchical clustering show us the patterns behind these vehicle flow data. ii This thesis is dedicated to everyone ever iii Acknowledgments First and foremost, I would like to thank my advisors Yusu Wang and Facundo Mémoli for introducing me to the field of computational topology and topological data analysis, for mentoring me through the whole project, and for their endless patience. I also want to thank Dayu Shi for helping me with Simba and Jiayuan Wang for helping me with the denoising algorithm. iv Vita July 10, 1993 . Born - Anhui, China 2015 . .B.S. Applied Mathematics, China Uni- versity of Petroleum, Beijing 2016-present . .Graduate Student, The Ohio State University. Fields of Study Major Field: Mathematical Science v Table of Contents Page Abstract . ii Dedication . iii Acknowledgments . iv Vita.............................................v List of Figures . viii 1. Introduction . .1 1.1 Topological Data Analysis (TDA) . .1 1.2 Road Network Data . .4 1.3 Outline . .5 2. Persistent Homology . .6 2.1 Complexes . .6 2.2 Homology . .9 2.3 Persistent Homology . 11 2.4 Persistence Module . 13 3. Road Network Data Analysis . 16 3.1 Time-delay Embedding . 16 3.2 Data Visualization . 17 3.3 Denoising . 25 3.4 Experiments . 27 3.5 Results and Comparation . 27 3.6 Extension and future work . 33 vi Appendices 36 A. More results and plots . 36 B. Main Code . 41 Bibliography . 43 vii List of Figures Figure Page 1.1 Topological data analysis workflow from Wikipedia [1]. .3 2.1 0-simplex, 1-simplex, 2-simplex, and 3-simplex [2]. .7 3.1 Time series data visualization . 18 3.2 project the point cloud to 2D plane . 20 3.3 project the high-dimensional point cloud to 2D plane . 21 3.4 Using Bottleneck Distance . 22 3.5 Time series data of Week 4 . 23 3.6 project the point cloud of week 4 to 2D plane . 24 3.7 project the denoised point cloud to 2D plane . 26 3.8 Persistence Diagrams for week 4 . 26 3.9 Persistence Diagram . 28 3.10 Hierarchical Clustering of Weekly Data of Detector: 409529 . 29 3.11 Time series data of Week 11 . 30 3.12 Time series data of Week 1 and Week 8 . 31 3.13 Hierarchical Clustering of Weeky Data of Detector: 409528 . 32 3.14 Time series data of Week 11 . 33 viii 3.15 Hierarchical Clustering of Monthly Data . 34 A.1 project the denoised point cloud of week 8 to 2D plane . 36 A.2 Persistence Diagram for Week 11 . 37 A.3 project the generated point cloud to 2D plane (M = 4 and τ = 50) . 38 A.4 Visualization of major barcodes (M = 4 and τ = 50) . 38 A.5 Hierarchical Clustering of Weekly Data for Detector 409529 with k = 25 . 40 ix Chapter 1: Introduction Signal analysis is a fundamental problem for many engineers and scientist. There are numer- ous methods to analyze signals and many associated applications. In this paper, we applied a new way, Time-Delay Embedding (see Chapter 3 for the definition), to study periodicity in signals. In particular, we apply the typical workflow of topological data analysis to the point clouds obtained by applying time-delay embedding to the signal. 1.1 Topological Data Analysis (TDA) Topological Data Analysis is an emerging field that traces back to the development of computational topology during the first decade of this century [3]. Geometric approach for data analysis has been used for quite a long time. Until 2002, the concept of persistent homology was introduced by Edelsbrunner et al [4]. In addition, they also put forward an efficient algorithm to compute persistent homology and its visualization as persistence diagram [5]. Then, in 2004, Carlsson et al reformulated the initial definition and gave an visualization method called persistence barcodes which is equivalent to persistence diagram and interprets persistence in the language of commutative algebra [4]. Finally, in 2009, TDA was popular- ized in a milestone paper of Carlsson [6]. The past decade have witnessed the success of Topological Data Analysis (TDA) as an approach to the analysis of dataset using techniques from topology. Extracting information from datasets that are high-dimensional, large and complex is always challenging. TDA 1 aims at providing a general framework to unravel and analyze the complex topological and geometrical structures underlying data. Usually, the data are represented in the form of point clouds in Euclidean or more general metric spaces. The basic and standard pipeline in TDA is: 1. Clouds of Data. In many instances, the input dataset is an unordered sequence of points coming with a notion of distance. The underlying logic of TDA is that shape matters and the global "shape" of the data can help us unravel the underlying pattern or phenomena indicated by the data. 2. Nested Complexes. The most natural way to construct a global structure from the point cloud is to take the points as the vertices of a combinatorial graph whose edges are determined by proximity (vertices with some specified distance ) [7]. As we increase the parameter continuously, a sequence of structures is constructed on top of the point cloud and highlights the underlying topology or geometry. This process converts the point cloud into a parametrized and nested family of simplicial complexes, namely a filtration of simplicial complexes. 3. Persistence Module. The parametrized and nested family of simplicial complexes is called a filtration of simplicial complexes. Taking the homology of each complex in the filtration gives a persistence module. The underlying topological or geometric features can be discovered from the homology groups of each complex. We can use birth time and death time to name the parameter values at which the features are created and disappeared respectively. 4. Barcode or Diagram. A barcode is a graphical representation of the birth times and death times of the topological features as a collection of horizontal line segments in a plane whose horizontal axis corresponds to the parameter and whose vertical 2 axis represents an (arbitrary) ordering of the topological features [7]. A persistence diagram is another visualization method of the topological features. It can be created by drawing a collection of points in the plane. Each point represents a topological feature and its two coordinates are the birth time dan death time of the topological feature respectively. Overall the typical workflow in TDA is: point cloud ! nested complexes ! persistence module ! barcode or diagram Graphically speaking, TDA has many successful applications including shape study [8], ma- Figure 1.1: Topological data analysis workflow from Wikipedia [1]. terial science [9], sensor network [10], progression analysis of disease [11] and so on. Perhaps the best-known example of TDA's success is the discovery of a new type of breast cancer from an old dataset which had been analyzed via other techniques for decades. In this project, we will apply tools from topological data analysis to analyze road network data and unravel the underlying patterns and features. 3 1.2 Road Network Data Caltrans PeMS (California Transportation Performance Measurement Systems) is a web- based software tool that is designed and tailored to Caltrans data and users. It collect traffic data from Caltrans traffic sensors placed in highways throughout California, as well as other Caltrans and partner agency data sets. It archives raw sensor data in a database, quality controls and processes it, and outputs it to users on the web in many useful formats to help managers, engineers, planners, and researchers understand transportation performance, identify problems, and formulate solutions [12]. PeMS also processes data from other Cal- trans and Caltrans acquired data sources to support other types of data analysis. PeMS can be accessed via a standard internet browser by anyone who has established an online account. Data Clearinghouse (http : ==pems:dot:ca:gov=?dnode = Clearinghouse) is a tool offered by the PeMS website. The Data Clearinghouse provides a single access point for downloading PeMS data sets. We can use the Data Clearinghouse page to quickly locate data by district, month and format. After selecting the district, the type of data set, and clicking the submit button, it will present a calendar for that data set.

Load more