Topological Data Analysis on Road Network Data

A Thesis

Presented in Partial Fulfillment of the Requirements for the Degree Master of Mathematical Science in the Graduate School of The Ohio State University

By

Xiao Zha, B.S.

Graduate Program in Mathematical Science

The Ohio State University

2019

Master’s Examination Committee:

Facundo M´emoli,Advisor Yusu Wang, Co-Advisor c Copyright by

Xiao Zha

2019 Abstract

Many problems in science and engineering involve signal analysis. Engineers and scientists came up with many approaches to study signals. Recently, researchers propose a new frame- work, combining the time-delay embedding with the tools from computational , for the study of periodic signals. By applying time-delay embedding to the periodic signals, the periodic behaviors express themselves as topological cycles and we can use persistent homol- ogy to detect these topological features. In this thesis, we apply this method to analyze road network data, specifically vehicle flow data recorded by detectors placed on highways. First, we apply time-delay embedding to project the vehicle flow data into point cloud data in a high dimensional space. Then, we use persistent tools to detect the topological features and get persistence digram. Next, we can repeat the same experiment to vehicle

flow data of different period. Fox example, in our experiment, we use the vehicle flow data of different weeks and months. Therefore, we get persistence diagrams corresponding to the vehicle flow data of different period. Finally, we calculate the bottleneck distance and wasserstein distance between these persistence diagrams and do hierarchical clustering. The dendrograms of the hierarchical clustering show us the patterns behind these vehicle flow data.

ii This thesis is dedicated to everyone ever

iii Acknowledgments

First and foremost, I would like to thank my advisors Yusu Wang and Facundo M´emoli for introducing me to the field of computational topology and topological data analysis, for mentoring me through the whole project, and for their endless patience.

I also want to thank Dayu Shi for helping me with Simba and Jiayuan Wang for helping me with the denoising .

iv Vita

July 10, 1993 ...... Born - Anhui, China

2015 ...... B.S. Applied Mathematics, China Uni- versity of Petroleum, Beijing 2016-present ...... Graduate Student, The Ohio State University.

Fields of Study

Major Field: Mathematical Science

v Table of Contents

Page

Abstract ...... ii

Dedication ...... iii

Acknowledgments ...... iv

Vita...... v

List of Figures ...... viii

1. Introduction ...... 1

1.1 Topological Data Analysis (TDA) ...... 1 1.2 Road Network Data ...... 4 1.3 Outline ...... 5

2. Persistent Homology ...... 6

2.1 Complexes ...... 6 2.2 Homology ...... 9 2.3 Persistent Homology ...... 11 2.4 Persistence Module ...... 13

3. Road Network Data Analysis ...... 16

3.1 Time-delay Embedding ...... 16 3.2 Data Visualization ...... 17 3.3 Denoising ...... 25 3.4 Experiments ...... 27 3.5 Results and Comparation ...... 27 3.6 Extension and future work ...... 33

vi Appendices 36

A. More results and plots ...... 36

B. Main Code ...... 41

Bibliography ...... 43

vii List of Figures

Figure Page

1.1 Topological data analysis workflow from Wikipedia [1]...... 3

2.1 0-simplex, 1-simplex, 2-simplex, and 3-simplex [2]...... 7

3.1 Time series data visualization ...... 18

3.2 project the point cloud to 2D plane ...... 20

3.3 project the high-dimensional point cloud to 2D plane ...... 21

3.4 Using Bottleneck Distance ...... 22

3.5 Time series data of Week 4 ...... 23

3.6 project the point cloud of week 4 to 2D plane ...... 24

3.7 project the denoised point cloud to 2D plane ...... 26

3.8 Persistence Diagrams for week 4 ...... 26

3.9 Persistence Diagram ...... 28

3.10 Hierarchical Clustering of Weekly Data of Detector: 409529 ...... 29

3.11 Time series data of Week 11 ...... 30

3.12 Time series data of Week 1 and Week 8 ...... 31

3.13 Hierarchical Clustering of Weeky Data of Detector: 409528 ...... 32

3.14 Time series data of Week 11 ...... 33

viii 3.15 Hierarchical Clustering of Monthly Data ...... 34

A.1 project the denoised point cloud of week 8 to 2D plane ...... 36

A.2 Persistence Diagram for Week 11 ...... 37

A.3 project the generated point cloud to 2D plane (M = 4 and τ = 50) . . . . . 38

A.4 Visualization of major barcodes (M = 4 and τ = 50) ...... 38

A.5 Hierarchical Clustering of Weekly Data for Detector 409529 with k = 25 . . 40

ix Chapter 1: Introduction

Signal analysis is a fundamental problem for many engineers and scientist. There are numer- ous methods to analyze signals and many associated applications. In this paper, we applied a new way, Time-Delay Embedding (see Chapter 3 for the definition), to study periodicity in signals. In particular, we apply the typical workflow of topological data analysis to the point clouds obtained by applying time-delay embedding to the signal.

1.1 Topological Data Analysis (TDA)

Topological Data Analysis is an emerging field that traces back to the development of com- putational topology during the first decade of this century [3]. Geometric approach for data analysis has been used for quite a long time. Until 2002, the concept of persistent homology was introduced by Edelsbrunner et al [4]. In addition, they also put forward an efficient algorithm to compute persistent homology and its visualization as persistence diagram [5].

Then, in 2004, Carlsson et al reformulated the initial definition and gave an visualization method called persistence barcodes which is equivalent to persistence diagram and interprets persistence in the language of commutative algebra [4]. Finally, in 2009, TDA was popular- ized in a milestone paper of Carlsson [6].

The past decade have witnessed the success of Topological Data Analysis (TDA) as an ap- proach to the analysis of dataset using techniques from topology. Extracting information from datasets that are high-dimensional, large and complex is always challenging. TDA

1 aims at providing a general framework to unravel and analyze the complex topological and geometrical structures underlying data. Usually, the data are represented in the form of point clouds in Euclidean or more general metric spaces. The basic and standard pipeline in TDA is:

1. Clouds of Data. In many instances, the input dataset is an unordered sequence of

points coming with a notion of distance. The underlying logic of TDA is that shape

matters and the global ”shape” of the data can help us unravel the underlying pattern

or phenomena indicated by the data.

2. Nested Complexes. The most natural way to construct a global structure from

the point cloud is to take the points as the vertices of a combinatorial graph whose

edges are determined by proximity (vertices with some specified distance ) [7]. As we

increase the parameter  continuously, a sequence of structures is constructed on top

of the point cloud and highlights the underlying topology or geometry. This process

converts the point cloud into a parametrized and nested family of simplicial complexes,

namely a filtration of simplicial complexes.

3. Persistence Module. The parametrized and nested family of simplicial complexes is

called a filtration of simplicial complexes. Taking the homology of each complex in the

filtration gives a persistence module. The underlying topological or geometric features

can be discovered from the homology groups of each complex. We can use birth time

and death time to name the parameter values at which the features are created and

disappeared respectively.

4. Barcode or Diagram. A barcode is a graphical representation of the birth times

and death times of the topological features as a collection of horizontal line segments

in a plane whose horizontal axis corresponds to the parameter and whose vertical

2 axis represents an (arbitrary) ordering of the topological features [7]. A persistence

diagram is another visualization method of the topological features. It can be created

by drawing a collection of points in the plane. Each point represents a topological

feature and its two coordinates are the birth time dan death time of the topological

feature respectively.

Overall the typical workflow in TDA is:

point cloud → nested complexes → persistence module → barcode or diagram

Graphically speaking, TDA has many successful applications including shape study [8], ma-

Figure 1.1: Topological data analysis workflow from Wikipedia [1].

terial science [9], sensor network [10], progression analysis of disease [11] and so on. Perhaps the best-known example of TDA’s success is the discovery of a new type of breast cancer from an old dataset which had been analyzed via other techniques for decades. In this project, we will apply tools from topological data analysis to analyze road network data and unravel the underlying patterns and features.

3 1.2 Road Network Data

Caltrans PeMS (California Transportation Performance Measurement Systems) is a web- based software tool that is designed and tailored to Caltrans data and users. It collect traffic data from Caltrans traffic sensors placed in highways throughout California, as well as other

Caltrans and partner agency data sets. It archives raw sensor data in a database, quality controls and processes it, and outputs it to users on the web in many useful formats to help managers, engineers, planners, and researchers understand transportation performance, identify problems, and formulate solutions [12]. PeMS also processes data from other Cal- trans and Caltrans acquired data sources to support other types of data analysis. PeMS can be accessed via a standard internet browser by anyone who has established an online account.

Data Clearinghouse (http : //pems.dot.ca.gov/?dnode = Clearinghouse) is a tool offered by the PeMS website. The Data Clearinghouse provides a single access point for downloading

PeMS data sets. We can use the Data Clearinghouse page to quickly locate data by district, month and format. After selecting the district, the type of data set, and clicking the submit button, it will present a calendar for that data set. The chart shows what months (and completeness) are available. In this project, we used the data set of ”District 4” and of type

”Station 5-minute”. For the data set, there are 17 fields including ”Timestamp”, ”Station”,

”Total Flow” and so on. For the ”Timestamp”, it indicates the data and time of the begin- ning of the summary interval. For example, a time of 08:00:00 indicates that the aggregates contain measurements collected between 08:00:00 and 08:04:59. Note that the second values are always 0 for five-minute aggregations. The format is MM/DD/YYYY HH24:MI:SS. In this paper, we mainly pay attention to the 5-minute vehicle flow. It is a time-series data.

4 1.3 Outline

For a given signal, as in the attractor reconstruction method in chaos theory, we consider the time-series to be observation function of the dynamical system. The results of Takens’

Theorem and its subsequent generalizations prove that the associated time-delay embedding of a 1-dimensional measurement (time-series) can recover the underlying dynamics of the system [13] .

In this thesis, based on the idea and the recent work in computational topology, we apply the time-delay embedding method to analyze the time series data of vehicle flow and try to unravel any pattern in the vehicle flow data. Thanks to the Caltrans PeMS, we have access to the vehicle flow data recorded by many detectors and over a long period. First, we downloaded the time series data of the 5-minute vehicle flow recorded by a specific detector for a specific week. After applying the time-delay embedding to the 5-minute vehicle flow data, we get a point cloud in a higher dimensional space. Then we apply the TDA workflow to the point cloud and get the persistence diagram. Finally, We can get persistence diagrams for the data of different weeks and compute the distances between different persistence diagrams. The distances indicate the similarity between different week data and therefore might unravel the different patterns. Our results show that holidays and the happening of traffic accidents could notably disturb the normal vehicle flow pattern recorded by detectors over a period. Besides, in the traffic accidents case, the disturbance of each detector could indicate the distance of the detector to the happening site of the accident.

Next, in Chapter 2, we review the background in persistent homology. Chapter 3 introduces time-delay embedding, discusses our approach in detail, and shows our experiment results.

5 Chapter 2: Persistent Homology

In this section, we introduce the background and basic concepts of persistent homology theory.

2.1 Complexes

In order to represent and study topological spaces, it is convenient and usual to decompose these topological spaces into the union of many small pieces that are topologically simple and glued together in a simple manner. These simple pieces should be simple geometric objects like points, line segments, triangles and so on.

Simplex. In geometry, a simplex is a generalization of the notion of a triangle or tetrahedron

k k to arbitrary dimensions. A (k+1)-tuple of points in R ,(u0, u1, ..., uk), where each ui ∈ R , is called affinely independent when vectors {u1−u0, u2−u0, ..., uk−u0} are linearly independent.

Geometrically, an n-simplex σ = (u0, u1, ..., un) is the convex hull of the affinely independent

(n+1)-tuple of points (u0, u1, ..., un). The n+1 points u0, u1, ..., un are called the vertices of the n-simplex. For instance, a 0-simplex is a single vertex, a 1-simplex is a line segment, a

2-simplex is a triangle, a 3-simplex is a tetrahedron, and a 4-simplex is a 5-cell.

Any subset of the affinely independent n+1 points are also affinely independent and therefore also defines a simplex of lower dimension. Geometrically, a face of σ is the convex hull of a non-empty subset of (u0, u1, ..., un). In particular, the convex hull of a subset of size m+1 is a m-simplex, called a m-face of the n-simplex.

6 Figure 2.1: 0-simplex, 1-simplex, 2-simplex, and 3-simplex [2].

Simplicial Complex. In topology and combinatorics, it is common to ”glue together” simplices to form a simplicial complex. The associated combinatorial structure is called a simplicial complex. A simplicial complex K is a set of simplices that satisfies the following conditions:

1. Every face of a simplex from K is also in K

2. The intersection of any two simplices σ1, σ2 ∈ K is a face of both σ1 and σ2.

Note that the empty set is a face of every simplex. The dimension of K is the maximum dimension of any of its simplices. The underlying space, denoted as |K|, sometimes called the carrier of a simplicial complex is the union of its simplices. Simplicial complexes are extremely useful at providing a for topological space, in particular manifold.

Given a topological space T , a triangulation of T consists of:

1. A simplicial complex K

2. A homeomorphism ϕ : |K| −→≈ T

Abstract Simplicial Complex. In mathematics, an abstract simplicial complex is a purely combinatorial description of the geometric notion of a simplicial complex. Given a set A of

7 elements, an abstract simplicial complex is a finite collection S of non-empty subsets of A

such that σ(⊆ A) ∈ S and σ0 ⊆ σ implies σ0 ∈ S. The elements in S, namely subsets of A, are its simplices. For each simplex σ, we have dim(σ) = |σ| − 1, namely the dimension of σ

is its cardinality minus 1. The dimension of the abstract simplicial complex is the maximum

dimension of any of its simplices. Given a geometric simplicial complex K, we can construct

an abstract simplicial complex S by throwing away all simplices and retaining only their sets

of vertices. We call S a vertex scheme of K, and, symmetrically, K a geometric realization of

S. Constructing a geometric realization of an abstract simplicial complex is trivial when the

dimension of the ambient space is high enough. Actually, any abstract simplicial complex of

dimension d has a geometric realization in Euclidean space R2d+1. Complexes. Typically, our input data is a point cloud. We want to use homology to

describe data, but homology operates in simplicial complexes. In this part, we’ll see two

methods one can construct simplex on a finite set of points, and see how these two methods

are related.

Cechˇ Complexes. In algebraic topology and topological data analysis, the Cechˇ complex

is an abstract simplicial complex constructed from a point cloud in any metric space which

is meant to capture topological information about the point cloud or the distribution it is

n ˇ drawn from. Given a collection of points {xα} in Euclidean space E , the Cech complex, C, is the abstract simplicial complex whose k-simplices are determined by unordered (k+1)-tuples

k of points {xα}0 whose closed /2-ball neighborhoods have a point of common intersection.

The nerve theorem states that C is homotopy equivalent to the union of closed radius /2

balls about the point set {xα}.

Vietoris-Rips Complexes. In topology, the Vietoris-Rips complex, also called the Vietoris

complex or Rips complex, is an abstract simplicial complex that can be defined from any

n metric space and distance. Given a collection of points {xα} in Euclidean space E , the

8 Vietoris-Rips complex, R, is the abstract simplicial complex whose k-simplices correspond

k to unordered (k + 1)-tuples of points {xα}0 which are pariwise within distance . The Vietoris-Rips complex characterizes the topology of a point set. This complex is popular in topological data analysis as its construction extends easily to higher dimensions.

2.2 Homology

Chain. Let K be a simplicial complex. Then for any dimension p, we define the p-chain c in

Pk K to be the formal sum of p-simplices in K with some coefficients. Namely, c = i=1 αiσi where σi are the p-simplices and αi are the coefficients. The coefficients σi can be integers, real numbers, rational numbers, elements of a field or elements of a ring. Two p-chains can be added together to get another p-chain. So, for example, two 1-chains (edges) with integer coefficients and integer additions, we have

(2e1 + 3e2 + 6e4) + (e2 + 4e3) = 2e1 + 4e2 + 4e3 + 6e4

In computational topology, we will be mostly interested in Z2-coefficients and modulo-2 additions. Namely, the coefficients can only be 0 or 1 and for the additions we have 0 + 0

= 0, 0 + 1 = 1, 1 + 1 = 0. Then under this setting, for example, we have

(e1 + e2 + e4) + (e2 + e3 + e4) = e1 + e3

The set of p-chains together with the addition operator form the group of p-chains, denoted

as (Cp, +), or simply Cp if the operation is understood. This group is an abelian group with

the identity to be the chain 0 and the inverse of a chain c to be c itself.

Boundary map and chain complex. To relate these p-chain groups of different dimen-

sions, we define a boundary operator ∂p that, for a given p-simplex, returns the (p − 1)-chain of its boundary (p − 1) − simplices, namely its (p − 1)-dimensional faces. For example,

9 suppose σ = {v0, v1, ..., vp} is a p-simplex, then its boundary is

p X ∂pσ = {v0, ..., vˆi, ..., vp} i=0 where {v0, ..., vˆi, ..., vp} denote the simplex spanned by all vertices but vi. We can extend the boundary operator to p-chains by defining the boundary of a p-chain to be the sum of the boundaries of its simplices. Therefore, we have ∂p : Cp → Cp−1. It is easy to check that

∂p(c1 + c2) = ∂pc1 + ∂pc2. Namely, ∂p : Cp → Cp−1 is a homomorphism. It is also trivial to check that ∂p∂p+1 = 0.

In mathematics, a chain complex is an algebraic structure that consists of a sequence of abelian groups and a sequence of homomorphism between consecutive groups such that the image of each homomorphism is included in the kernel of the next. Associated to a chain complex is its homology, which describes how the images are included in the kernels. In our case, the chain complex can be written out as:

∂p+1 ∂p ∂p−1 ∂p−2 ∂1 ... −−→ Cp −→ Cp−1 −−→ Cp−2 −−→ ... −→ C0 → 0

Usually the chain complex is denoted by (C∗, ∂∗).

Homology groups. Homology groups are algebraic tools to quantify topological features in a space. It does not capture all topological aspects of a space in the sense that two spaces with the same homology groups may not be topologically equivalent. However, two spaces that are topologically equivalent must have isomorphic homology groups. It turns out that the homology groups are computationally tractable in many cases, thus making them more attractive in topological data analysis. The homology groups classify the cycles in a cycle group by putting together those cycles in the same class that differ by a boundary. From group theoretic point of view, this is done by taking the quotient of the cycle groups with the boundary groups, which is allowed since the boundary group is a subgroup of the cycle group.

10 We call the kernel ker(∂p) the p-cycles of C∗, denoted as Zp, and the image im(∂p+1) the p-boundaries of C∗, denoted as Bp. Given a chain complex (C∗, ∂∗), its homology is given

ker(∂p) Zp by Hp(C∗, ∂∗) = = , p ≥ 0 im(∂p+1) Bp

2.3 Persistent Homology

Persistence homology [14] is one of the most predominant tools in the field of topological data

analysis. Specifically, persistent homology is an algebraic method for measuring topological

features of shapes and functions. Persistence is introduced by Edelsbrunner, Letscher, and

Zomorodian [5] and refined by Carlsson and Zomorodian [15]. Given a parameterized family

of simplicial complexes, those topological features which persist over a significant scale will

be regraded as signals while the short-lived ones are noises.

Filtration. In order to differentiate the topological features as signals or noises, we need

to construct a family of ’growing’ simplicial complexes. This is exactly what the so-called

filtration does. For a given simplicial complex K, a filtration of K is a increasing sequence

of subcomplexes of K:

∅ = K0 ⊆ K1 ⊆ ... ⊆ Kn = K

i There is a natural sequence of inclusion maps ι : Ki → Ki+1 for i = 0, 1, ..., n − 1. In fact,

i,j i,j j−1 i for any i < j we have ι : Ki → Kj which is simply ι = ι ◦ ... ◦ ι .

Instead of the complexes sequence, we are more interested in the evolution of the topological

features expressed by the corresponding homology groups sequence. For each inclusion map

i,j i,j ι , it induces a homomorphism fp : Hp(Ki) → Hp(Kj), for each dimension p. There- fore, for the filtration, we have a corresponding sequence of homology groups connected by

homomorphisms:

0 = Hp(K0) → Hp(K1) → ... → Hp(Kn) = Hp(K)

11 For example, suppose given a fixed point cloud, we can obtain a sequence of Rips complexes

N N R = (Ri)1 for a sequence of increasing radius (i)1 of open balls centered at each point.

Instead of the homology of each individual complexes Ri, we examine the evolution of the homological classes. Specifically, we are interested in the birth time and death time of the topological features.

The p-th persistent homology groups are the images of the homomorphisms induced

i,j i,j by inclusion, Hp = imfp . The p-th persistent Betti numbers are the ranks of these groups,

i,j i,j i,j namely βp = rank(Hp ). Basically, Hp is the set of homology classes that existed in Hp(Ki) and survive till Hp(Kj). The Betti number shows the number of independent homology class that survive through.

Persistence and Persistence diagram. Let’s consider the following quantity:

i,j i,j−1 i,j i−1,j−1 i−1,j µp := (βp − βp ) − (βp − βp )

i,j µp is called the pairing number. It records the number of p-dimensional homological classes born at Ki and dying entering Kj. Then the p-th persistence diagram Dgm is a planar point

x,y set with multiplicities, where a point (x, y) ∈ Dgm if µp 6= 0 and the multiplicity of (x, y) is

x,y µp . Therefore, each point q ∈ Dgm represents the lifespan of a particular topological feature (connected component, loop, void, etc.), with its birth and death times as coordinates.

Distance between persistence diagrams. We now define the pth diagram distance be- tween persistence diagrams. Let p ∈ N and Dgm1, Dgm2 be two persistence diagrams.

Let Γ: Dgm1 ⊇ A → B ⊆ Dgm2 be a partial bijection between Dgm1 and Dgm2. Then,

p for any point x ∈ A, the p-cost of x is defined as cp(x) = kx − Γ(x)k∞, and for any point

0 p y ∈ (Dgm1 t Dgm2) \ (A t B), the p-cost of y is defined as cp(y) = ky − π∆(y)k∞, where

π∆ is the projection onto the diagonal ∆ = {(x, x): x ∈ R}. The cost cp(Γ) is defined as :

P P 0 1/p cp(Γ) = ( x cp(x) + y cp(y)) . We then define the pth diagram distance wp as the cost of the best partial bijection:

12 dp(Dgm1, Dgm2) = inf cp(Γ). Γ

The pth diagram distance is also termed as pth Wasserstein distance, denoted as wp. In the

0 particular case p = +∞, the cost of Γ is defined as c(Γ) = max{maxx c1(x) + maxy c1(y)}.

The corresponding distance d∞ is often called the bottleneck distance. One can show that dp → d∞ when p → +∞.

2.4 Persistence Module

Persistence module is a critical algebraic concept arising in topological data analysis. In the

TDA workflow, we start with some data, construct a growing family of simplicial complexes and obtain a persistent module by applying to this filtration a functor from topological spaces to vector spaces. For example, the functor can be k-dimensional singular homology with coefficients in some fixed field.

A persistence module M (up to re-indexing) of length n is defined to be an indexed family of vector spaces:

(Vi , 1 ≤ i ≤ n) and a double-indexed family of linear maps

s ft : Vs → Vt , s ≤ t which satisfy the composition law.

s r r ft ◦ fs = ft

t where r ≤ s ≤ t, and where ft is the identity map on Vt. In application, the finite indexing set will usually look like an increasing sequence of non- negative real numbers: r1 < r2 < ... < rn. The corresponding persistence modules are called R indexed persistence modules, which will be our consideration in the following. Interleaving Distance and Stability of Persistence Modules. A key notion arising

13 from the study of persistence module is the concept of interleaving, which can be employed to measure the distance between persistence modules. In 2009, Chazal et al. [16] proposed

-interleavings between persistence modules. The purpose of introducing the -interleavings is to provide a pseudometric dI , interleaving distance, on persistence modules. ν δ δ,δ0 δ0 Given ε ≥ 0, an ε-interleaving between two persistence modules V = {V −−→ V }δ≤δ0 and µ δ δ,δ0 δ0 U = {U −−→ U }δ≤δ0 is given by two families of linear maps

δ δ+ε {ϕδ,δ+ε : V → U }δ∈R,

δ δ+ε {ψδ,δ+ε : U → V }δ∈R such that the following diagrams commute for all δ0 ≥ δ ∈ R:

V and U are also called ε-interleaved. Then, the interleaving distance between persistence modules V and U is defined to be

dI (U, V) := inf{ε ≥ 0 : U and V are ε−interleaved}

Then we have the Algebraic Stability Theorem. Let U, V be two persistence modules, then

dB(Dgm(U), Dgm(V)) ≤ dI (U, V)

14 Besides, it is worthwhile to point out that this stability result is not restricted to the case of continuous functions defined over triangulable spaces. It is a more general form of stability results for persistence modules compared with the classical bottleneck stability [17].

15 Chapter 3: Road Network Data Analysis

Recurrence and periodicity are typical to dynamical systems observed in nature. We apply computational topology tools to analyze recurrent behavior of vehicle flow system. In this section, we first introduce the time-delay embedding. After applying the time-delay embed- ding to the time-series data, we get the point clouds in a high-dimensional space. Since our real data is highly corrupted with noise, we introduce and apply a denoising algorithm to denoise the time-series data. After that, we apply the typical topological data analysis workflow to the denoised data and obtain the persistence diagrams corresponding to the time-series data of different weeks. Then we compute the distances between persistence di- agrams and do hierarchical clustering to unravel the patterns between the time-series data of different weeks.

3.1 Time-delay Embedding

The time-delay embedding, or the sliding window, has been used mostly in the field of dynamical systems to study the nature of their attractors. Tekens’ theorem [13] provides the conditions under which a smooth attractor can be reconstructed from the observation made with a generic function. This methodology has been applied to study the chaotic dynamical systems arising from fields like Electroencephalogram, Electrocardiogram and so on [18].

Then in [19], V. de Silva et al propose a new framework, combining the time-delay embedding

16 with tools from computational topology, to study the periodic behavior of recurrent systems.

After that, in [20], J. A. Perea et al employe the time-delay embedding and tools from topological data analysis to study time series arose in the study of gene expression data.

Definition 3.1.1. Given a time-series f : t → R, and a parameter τ, a time-delay embedding is a lift to a time-series φ : t → RM+1 defined by

φ(t) = (f(t), f(t + τ), ..., f(t + Mτ))

Note that the window size Mτ is critical for the embedding. We will elaborate more on this in the next Section.

Using the time-delay embedding, we lift the time-series data into a point clouds in (M +

1)dimensional space. The points of the lifted time-series data then clustering around some submanifold or other subspace of RM+1. Besides, since real data are always corrupted, we must choose parameters M and τ carefully for a ’good’ embedding.

For periodic time-series data, the lifted point clouds will trace one or more loops, namely topological circles, in the (M + 1)dimensional space [19]. For example, the simplest periodic system is sinusoid, if we do the time delay embedding {t 7→ (sin(t), sin(t + τ))}, the lifted points will trace a loop in a plane. Then, we can use tools from computational topology to detect these topological circles.

3.2 Data Visualization

As mentioned previously, we downloaded the dataset from Caltrans PeMS. The dataset records the 5-minute vehicles flow detected by each detector in one week( from Sunday to

Saturday ). The following are some examples in figure 3.1.

The first plot is the time-series data for the week from 2017/10/01 to 2017/10/07 while the second plot is the time series data for the week from 2017/11/19 to 2017/11/25. They are

17 Figure 3.1: Time series data visualization

both for the same detector (Detector ID: 409529). The second week is the thanksgiving week of 2017 while the first week is just an usual week. As figure shows, for the first plot, there are

five higher peaks in the center and one lower peak for each end. This is reasonable. The five higher peaks correspond to the busy week days while the two lower peaks indicate cosy and stay-at-home weekends. Then, for the second plot, things are a little bit different. The first lower peak indicate the usual Sunday. After that there are three higher peaks corresponding to usual week days. Following is a dramatically decreased peak accounting for the first day of the Thanksgiving holiday. Finally, for the last two peaks , we can observe a gradually increase in the peak. Maybe after having a one-day rest, more people are starting to drive out and enjoy holiday.

18 Basically, these are how the time-series data look like. Mainly, there are two typical patterns, namely the usual week pattern and the holiday week pattern.

In this project, initially I collected the time-series data of 14 different weeks from 2017/10/01 to 2018/01/06. The week numbers and the corresponding dates are shown as below.

Week 1 : 2017/10/01 ∼ 2017/10/07 Week 8 : 2017/11/19 ∼ 2017/11/25 Thanksgiving week Week 2 : 2017/10/08 ∼ 2017/10/14 Week 9 : 2017/11/26 ∼ 2017/12/02 Week 3 : 2017/10/15 ∼ 2017/10/21 Week 10 : 2017/12/03 ∼ 2017/12/09 Week 4 : 2017/10/22 ∼ 2017/10/28 Week 11 : 2017/12/10 ∼ 2017/12/16 Week 5 : 2017/10/29 ∼ 2017/11/04 Week 12 : 2017/12/17 ∼ 2017/12/23 Week 6 : 2017/11/05 ∼ 2017/11/11 Week 13 : 2017/12/24 ∼ 2017/12/30 Christmas week Week 7 : 2017/11/12 ∼ 2017/11/18 Week 14 : 2017/12/31 ∼ 2018/01/06 New Year week

Then for each time-series dataset, namely each week’s time series data, I applied the time- delay embedding method to project the time-series data to a corresponding point cloud into a high-dimensional space. First, let’s look at an example to gain some insights and intuition about how the point cloud would look like if we project the periodic time-series data into a

1 high-dimensional space. We sample one thousand points on the function f(x) = sin( 2 x) + 2 and x-coordinates are evenly distributed over the period [0, 16π] and then apply time-delay embedding to the dataset. According to the time-delay embedding method, the optimal window size Mτ should be close to the intrinsic periodicity of the system. In this test, we

16π 4000 choose M = 5 and τ = 50× 999 and therefore the windows size is Mτ = 999 ×π which is very close the intrinsic periodicity of the function, namely 4π. Hence, in this way, we projected the time-series data sampled from a periodic function into a point cloud in 6-dimensional space.

After doing PCA for dimension reduction, we get the following figure 3.2. It is basically a circle in the 2-d plane.

Now let’s look into our real dataset. Apparently the periodicity for our real data is one day, or

1440 minutes. Therefore, in the time-delay embedding, we set M = 5 and τ = 250 minutes,

19 Figure 3.2: project the point cloud to 2D plane

and the window size is Mτ = 1250 minutes , which is very close to the real periodicity 1440 minutes. Hence, in this way, we projected the real time-series data into a point cloud in

6-dimensional space. Still we use the time-series data of Week 1 and Week 8 of Detector

409529, shown above, as an example. After doing PCA for dimension reduction, we get

Figure 3.3.

20 (a) Week 1 (b) Week 8

Figure 3.3: project the high-dimensional point cloud to 2D plane

As the figure indicates, intuitively we can recognize one circle for the plot of Week 1 and roughly two circles for the plot of Week 8. Also we notice that our data are corrupted by noise. After applying the standard topological data analysis workflow to the noised data of all the fourteen weeks, we get the persistence diagrams. Then we compute the bottleneck distance and do the hierarchical clustering, the dendrogram is shown as following.

21 Figure 3.4: Using Bottleneck Distance

It is very unexpected that Week 4 is dissimilar to the rest weeks. To look into the phe- nomenon, we plot the time-series data of Week 4.

22 Figure 3.5: Time series data of Week 4

Also we plot the projection of the point cloud data of week 4.

23 Figure 3.6: project the point cloud of week 4 to 2D plane

As the plot shows, there are some abnormalities in the time-series data. In some time intervals, there is a dramatic decrease in the vehicle flow. Then, in the point cloud, the dramatic decrease generates some outliers after we do the time delay embedding. One possible explanation is that there is a traffic incident near the detector around that period.

As we searched in google news, there is actually a major injury collision nearby. Here is the news link: https : //patch.com/california/paloalto/crash−blocks−oregon−expressway− palo − alto − police. According to the news report, the police closed the Oregon Expressway for 45 minutes and reopen it at about 9:50 a.m . The Oregon Expressway is not too far away from the detector 409529 (coordinate: 37.366666 -121.921519). The abnormality in the time series last for about 45 minutes corresponding to the closure time of the Expressway.

This abnormality, caused by the possible car accident, can result in outliers in the point cloud data after time-delay embedding. Persistent homology results are sensitive to noise and outliers in the data. Therefore, before apply the standard topological data analysis workflow to the point cloud data, we denoise the point cloud data.

24 3.3 Denoising

Data from real life are always inevitably corrupted. Noise data can adversely affect the

results of any data analysis if not handled properly. In this project, we use the advanced

denoising proposed by M. Buchet etc in [21].

In many data analysis application, we assume that the given point set is supposed to represent

an underlying ground truth K in a metric space. However, it is often that the some outliers, far away from K, corrupt the data set. The outliers are also termed as ambient noise or background noise. Therefore, we design and apply the denoising algorithm to eliminate the ambient so that the curated data lie within a bounded Hausdorff distance of K [21]. In the paper, M. Buchet etc proposed two denoising algorithms: the first one is a simple denoising algorithm that requires only a single parameter but provides a theoretical guarantee on the quality of the output on general input points; the second algorithm even avoids this parameter by paying for it with a slight strengthening of the sampling condition on the input points which is not unrealistic [21].

Specifically, the first simple yet effective denoising algorithm, termed as Declutter Algorithm, takes a set of points P and a parameter k as input, and outputs a set of points Q ⊆ P .

Suppose P is an noisy sample of a hidden compact space K, then the algorithm guarantees that within a small tubular neighborhood of K and outliers are all eliminated [21]. On the other hand, the second algorithm, termed as ParfreeDelutter Algorithm, is parameter free.

The Declutter Algorithm is simple and effective but it also ”sparsifies” the input points.

The second algorithm fix the problem by combining the first algorithm with a resampling process to enrich the output set Q by bringing some outliers back and therefore obtain a denser sampling of the ground truth [21].

In this project, we use the second algorithm, namely the ParfreeDelutter Algorithm, to denoise our point cloud. We use the point cloud data of Week 1 and Week 4 of Detector

25 409529 as examples, and the denoised point cloud data is shown in Figure 3.7.

(a) Week 1 (b) Week 4

Figure 3.7: project the denoised point cloud to 2D plane

Besides, let’s take a look at the persistence diagrams for week 4 before and after denoising.

(a) Persistence Diagram Before Denoising (b) Persistence Diagram After Denoising

Figure 3.8: Persistence Diagrams for week 4

As the persistence diagrams plot (figure 3.8) shows, some outliers are eliminated and therefore the topological features now become more persistent.

26 3.4 Experiments

Based on the concept of power distance, proposed by M. Buchet etc in [22], we adopt a more

general setting of the distance instead of the Euclidean distance to a point cloud. Specifically,

we define the power distance between the pth and qth points in the point cloud to be

q 2 f(p, q) = d (p, q) + wp + wq

th th where the term d(p, q) is the Euclidean distance between the p and q points, and wp and

wq are the value of w at the points p and q. In our project, we set w to be the average

distance of the nearest 15 points, namely wi stands for the average distance of the nearest

15 points to the ith point.

Then we apply the topological data analysis workflow to the point clouds with the power

distance. In this part, we used Ripser [23] for barcode visualization, SimBa [24] for comput-

ing the persistence diagram and R package TDA [25] for computing the Bottleneck distance

and Wasserstein distance between diagrams. Finally, based on the Bottleneck distance and

Wasserstein distance, we did hierarchical clustering.

3.5 Results and Comparation

First, let’s still use some results of Week 1 and Week8 of Detector 409529 as examples. Since

we are only interested in the 1-cycles traced by the point clouds, the Vietoris-Rips persistence

diagrams in dimension 1 are shown. We show the 1D persistence diagrams in figure 3.9

27 (a) Week 1 (b) Week 8

Figure 3.9: Persistence Diagram

Also we compute the persistence diagrams for the point cloud of the time-series data of each week and the Bottleneck distance and Wasserstein distance between each pair of per- sistence diagrams. Finally, we do hierarchical clustering based on the Bottleneck distance and Wasserstein distance respectively. The dendrograms we got are shown in figure 3.10.

28 (a) Complete Linkage

(b) Single Linkage

(c) Average Linkage

Figure 3.10: Hierarchical Clustering of Weekly Data of Detector: 409529

29 As the hierarchical clustering dendrograms indicate, Week 8, Week 13 and Week 14 seem to be significantly dissimilar to the rest, which is consistent with the fact that Week 8 is the Thanksgiving week, Week 13 is the Christmas week and Week 14 is the New Year week.

Besides, the hierarchical clustering using the Bottleneck distance indicates that Week 11 is also dissimilar. To understand this result, we plot the time-series data of Week 11 for

Detector 409529 shown in figure 3.11.

Figure 3.11: Time series data of Week 11

As we can observe, there is an obvious irregularity in the fifth periodicity of the time-series data. This should account for the dissimilarity of Week 11. On the other hand, a possible explanation for the irregularity might be that there is a traffic accident during that period.

All above are the results for the specific detector 409529. Now let’s do the same experiment to another detector 409528. The time-series data of Week 1 and Week 8 of detector 409528 are shown in figure 3.12.

30 Figure 3.12: Time series data of Week 1 and Week 8

The final results for the hierarchical clustering is as shown in figure 3.13

31 (a) Complete Linkage

(b) Single Linkage

(c) Average Linkage

Figure 3.13: Hierarchical Clustering of Weeky Data of Detector: 409528

32 Almost the same as the results for the detector 409529. Week 8, Week 11, Week 13, and

Week 14 are significantly dissimilar to the others. Now we plot the time series data of Week

11 of detector 409528 in figure 3.14.

Figure 3.14: Time series data of Week 11

As we can see, there is a period that the vehicle flow recorded by the detector is zero. This contribute to the dissimilarity of Week 11. During that period, there might be a traffic accident at a location near the two detectors and furthermore maybe more close to detector

409528.

3.6 Extension and future work

In addition to the analysis of weekly data, we also apply the workflow to analyze monthly data. Specifically, we collect the monthly data for 2017. Each dataset records the hourly vehicle flow of a natural month. Finally, we get the hierarchical clustering dendrograms as shown in figure 3.15.

33 (a) Complete Linkage

(b) Single Linkage

(c) Average Linkage

Figure 3.15: Hierarchical Clustering of Monthly Data

34 As the dendrogram shows, Month 3 and Month 12, namely March and December, are sig- nificantly dissimilar to the rest. The explanation might be that the spring break and the

Christmas week notably disturbed the normal pattern of the vehicle flow and therefore make

March and December dissimilar.

Besides, one possible future work might be that we can analyze the data of the same week or month for different detectors and see if there are any cluster structure between those detectors. This might help extract geographical information about the detectors and the transport infrastructure.

35 Appendix A: More results and plots

First, we add more plots of results..

Figure A.1: project the denoised point cloud of week 8 to 2D plane

36 Figure A.2: Persistence Diagram for Week 11

According to the time-delay embedding method, the optimal window size Mτ should be close to the intrinsic periodicity of the system. Apparently the periodicity for our real data is one day, or 1440 minutes. Therefore, previously in the time-delay embedding, we set M = 5 and

τ = 250 minutes, and the window size is Mτ = 1250 minutes , which is very close to the real periodicity 1440 minutes. But what if we choose the parameters such that the window size is far away from the intrinsic periodicity? In this part, we set M = 4 and τ = 50 and run the work pipeline. The following are some results and plots.

37 (a) Week 1 (b) Week 4

Figure A.3: project the generated point cloud to 2D plane (M = 4 and τ = 50)

As the plot shows, now the point clouds seem to be squashed to a line. Also, we can visualize the persistence barcodes of dimension 1 in live.Ripser.

(a) Week 1 (b) Week 4

Figure A.4: Visualization of major barcodes (M = 4 and τ = 50)

As the figures show, under this time-delay embedding setting, the topological features are not significantly persistent. As J. A. Perea etc in [18] point out, ”the maximum persistence, as a measure of the roundness of the point cloud, occurs when the window size corresponds

38 to the natural frequency of the signal”.

Besides, for the weekly data of Detector 409529, in the power distance part, if we set k = 25, then we get the dendrogram of hierarchical clustering: As the plot indicates, setting k = 25 doesn’t affect the final dendrograms too much.

39 (a) Complete Linkage

(b) Single Linkage

(c) Average Linkage

Figure A.5: Hierarchical Clustering of Weekly Data for Detector 409529 with k = 25

40 Appendix B: Main Code

Python code for processing the downloaded text data file: f = open(’filePath’, ’r+’) g = open(’filePath’, ’a’) # h = open(’d04 text meta 2017 06 10.txt’, ’r+’) for l i n e in f : temp = line.split(’,’) # p r i n t temp # print len(temp) i f temp[1] == ’409529’: i f str (temp[9]) == ’’: g . w rite ( str (temp[0]) + ’ ’ + str (temp[1]) + ’ ’ + str (−1) + ’ \n ’ ) else : g . w rite ( str (temp[0]) + ’ ’ + str (temp[1]) + ’ ’ + str (temp[9]) + ’ \n ’ ) Code for time-delay embedding: f i d = fopen (’processedDataPath’, ’r’); t l i n e = f g e t l ( f i d ) ; temp = str2num( t l i n e ( 2 8 : end )); allTDS2 = [temp]; while ischar(tline) t l i n e = f g e t l ( f i d ) ; temp = str2num( t l i n e ( 2 8 : end )); allTDS2 = [allTDS2, temp]; end fclose ( f i d ) pCloud = [ ] ; step = 5 0 ; pCD = 6 ; for i = 1 : length ( allTDS2)− step ∗(pCD−1) temp = allTDS2(i:step:i+step ∗(pCD−1)); pCloud = vertcat(pCloud, temp); end Code for computing power distance: pCloud2 = pCloud; [ row , c o l ] = size ( pCloud2 ) ; distMat = zeros ( row ) ; for i = 1 : row for j = i : row i f i == j distMat(i, j) = 0;

41 else distMat(i, j) = norm(pCloud2(i , :) − pCloud2(j, :), 2); distMat(j, i) = norm(pCloud2(i , :) − pCloud2(j, :), 2); end

end end

Wp2 = zeros (1 , row ) ; k = 1 5 ; for i = 1 : row temp = distMat(i, :); temp = sort ( temp ) ; Wp2( i ) = sum(temp(2:(k + 1)).ˆ2)/k; end

for i = 1 : row for j = 1 : row distMat(i, j) = sqrt (distMat(i , j)ˆ2 + Wp2(i) + Wp2(j)); end end

42 Bibliography

[1] https : //en.wikipedia.org/wiki/T opological data analysis.

[2] H. Edelsbrunner and J. Harer. Computational Topology. An Introduction. American Mathematical Society, Providence, 2010.

[3] F. Chazal and B. Michel. An introduction to Topological Data Analysis: fundamental and practical aspects for data scientists. arXiv preprint arXiv:1710.04019, 2017.

[4] G. Carlsson, A. Collins, L. Guibas, and A. Zomorodian. Persistence barcodes for shapes. Internat. J. Shape Modeling, 2005.

[5] H. Edelsbrunner, D. Letscher, and A. Zomorodian. Topological persistence and simplification. Discrete Comput. Geom., 28:511–533, 2002.

[6] G. Carlsson, A. Collins, L. Guibas, and A. Zomorodian. Topology and data. Bull. Amer. Math. Soc. (N.S.), 46(2):255–308, 2009.

[7] R. Ghristn. Barcodes: the persistent topology of data. Bull. Amer. Math. Soc. (N.S.), 45(1):61–758, 2008.

[8] A. Cerri, M. Ferri, and D. Giorgi. Retrieval of trademark images by means of size functions. Graphical Models, 68(5):451–471, 2006.

[9] T. Nakamura, Y. Hiraoka, A. Hirata, E. G. Escolar, and Y. Nishiura. Persistent homology and many- body atomic structure for medium-range order in the glass. Nanotechnology, 26(304001), 2015.

[10] D. de Silva and R. Ghrist. Coverage in sensor networks via persistent homology. Alg. Geom. Topology, 7:339–358, 2007.

[11] A. Nicolau, A. J. Levine, and G. Carlsson. Topology based data analysis identifies a subgroup of breast cancers with a unique mutational profile and excellent survival. Proc. Nat. Acad. Sci., 108(17), 2011.

[12] http : //pems.dot.ca.gov/P eMS Intro User Guide v5.pdf. An Introduction to the California Depart- ment of Transportation Performance Measurement System[PeMS].

[13] F. Takens. Detecting strange attractors in turbulence. in D. A. Rand and L. -S. Young. Dynamical Systems and Turbulence, Lecture Notes in Mathematics. Springer-Verlag, 898:366–381.

43 [14] S. Y. Oudot. Persistence Theory: From Quiver Representations to Data Analysis ( Mathematical Surveys and Monographs ). American Mathematical Society, 2017.

[15] A. Zomorodian and G. Carlsson. Computing persistent homology. Discrete & Computational Geometry, 33(2):249–274, 2004.

[16] F. Chazal, D. Cohen-Steiner, M. Glisse, L. J. Guibas, and S. Y. Oudot. Proximity of persistence modules nad their diagrams. Proceedings of the twenty-fifth annual symposium on Computational geometry, ACM, pages 237–246, 2009.

[17] D. Cohen-Steiner, H. Edelsbrunner, and J. Harer. Stability of persistence diagrams. Discrete & Com- putational Geometry, 37(1):103–120, 2007.

[18] J. A. Perea and J. Harer. Sliding windows and persistence: An application of topological methods to signal analysi. Foundations of Computational Mathematics, pages 1–40, 2013.

[19] V. de Silva, P. Skraba, and M. Vejdemo-Johansson. Topological Analysis of Recurrent Systems. Work- shop on Algebraic Topology and Machine Learning, NIPS, 2012.

[20] J. A. Perea, A. Deckard, S. B. Haase, and J. Harer. Sw1pers: Sliding window and 1-persistence scoring; discovering periodicity in gene expression time series data. BMC bioinformatics, 16(1), 2015.

[21] M. Buchet, T. K. Dey, J. Wang, and Y. Wang. Declutter and resample: Towards parameter free denoising. ArXiv e-prints, 2015.

[22] M. Buchet, F. Chazal, S. Y. Oudot, and D. R. Sheehy. Efficient and robust topological data analysis on the metric spaces. ArXiv, 2013.

[23] http : //live.ripser.org.

[24] http : //web.cse.ohio − state.edu/dey.8/SimBa/Simba.html.

[25] https : //cran.rproject.org/web/packages/T DA/index.htmll.

44