Numerical Methods for Spectral Clustering a Spectral Cluster Analysis of the European Air Traffic Network, Using Schur-Wielandt Deflation
Total Page:16
File Type:pdf, Size:1020Kb
DEGREE PROJECT IN TECHNOLOGY, FIRST CYCLE, 15 CREDITS STOCKHOLM, SWEDEN 2020 Numerical Methods for Spectral Clustering A Spectral Cluster analysis of the European Air Traffic Network, using Schur-Wielandt Deflation JOHAN LARSSON ISAK ÅGREN KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ENGINEERING SCIENCES Abstract The Aviation industry is important to the European economy and development, therefore a study of the sensitivity of the European flight network is interesting. If clusters exist within the network, that could indicate possible vulnerabilities or bottlenecks, since that would represent a group of airports poorly connected to other parts of the network. In this paper a cluster analysis using spectral clustering is performed with flight data from 34 different European countries. The report also looks at how to implement the spectral clustering algorithm for large data sets. After performing the spectral clustering it appears as if the European flight network is not clustered, and thus does not appear to be sensitive. Sammanfattning Flygindustrin ¨arviktig f¨orden europeiska ekonomin och utvecklingen, d¨arf¨or¨aren studie av k¨ansligheten f¨ordet europeiska flygn¨atetintressant. Om det finns kluster i n¨atverket kan det indikera m¨ojligas˚arbarhetereller flaskhalsar, eftersom det skulle representera en grupp flygplatser som ¨ard˚aligtanslutna till andra delar av n¨atver- ket. I denna rapport utf¨orsen klusteranalys med spektralklustering p˚aflygdata fr˚an 34 olika europeiska l¨ander.Rapporten tittar ocks˚ap˚ahur man implementerar spek- tralklustering f¨orstora datam¨angder.Efter att ha utf¨ortspektralklustering verkar det som om det europeiska flygn¨atverket inte ¨arklusterat och d¨arf¨orverkar det inte som att det ¨ark¨ansligt. Acknowledgments This thesis has been under the supervision of Assoc. Prof. Elias Jarlebring and Assoc. Prof Mattias Sandberg. We are thankful for their guidance and support throughout the entirety of this degree project. Contents 1 Introduction 1 1.1 Background . .1 1.1.1 Aviation industry . .1 1.1.2 Clustering . .1 1.2 Problem . .1 2 Clustering 3 2.1 Graphs . .3 2.1.1 Methods for Constructing a Graph . .4 2.2 K-means . .4 2.3 Unnormalized Laplacian . .5 2.4 Normalized Laplacian . .6 2.5 Algebraic connectivity . .6 2.6 Cuts . .7 2.7 Perturbation Viewpoint . .8 2.8 Spectral Clustering Algorithm . .9 2.8.1 Spectral Clustering (RatioCut) . .9 2.8.2 Spectral Clustering (NCut) . .9 3 Eigenvectors 10 3.1 Eigenvectors and Rayleigh-Ritz Quotients . 10 3.2 Single Eigenvalue Algorithms . 11 3.2.1 Power Iteration . 11 3.2.2 Inverse Iteration . 12 3.3 Deflation . 12 3.3.1 Wielandt Deflation . 13 3.3.2 Hotelling's Deflation . 13 3.3.3 Schur-Wielandt Deflation . 14 3.4 LU-Decomposition . 15 3.4.1 Inverse Power Iteration using LU-decomposition . 16 3.5 Eigenvector Algorithm for Spectral Clustering . 17 4 Result 19 4.1 Data . 19 4.1.1 Base Model and Similarity Matrix . 19 4.2 Unweighted Graph . 19 4.3 Passenger Weighted Graph . 21 4.4 2018 . 23 5 Analysis and Conclusions 26 A Notation 29 B Airports 30 1. Introduction 1.1 Background 1.1.1 Aviation industry Air traffic and airports play an important role in our modern society, connecting cities all over the world. The European Commission concludes that "The European aviation sector is one of the best performing parts of the European economy, and is a world leading industry. 900 million air passengers travel each year to, from and within the European Union, making up one third of the world market." [2] The economic factor in aviation industry is not only due to flights travelling between airports, but also due to jobs created at airports by shops, cafeterias, mechanics and transport to and from the airport [5]. These two factors, bringing people together and driving economic growth is of great importance to regional development and will in itself generate regional development as well as high tech industries [10]. But it will also lead to a more globalised Europe. 1.1.2 Clustering Clustering is a method where a set is divided into subsets based on some charac- teristics of the set. This method is often used in science to group different objects, for example in machine learning when one want to group objects and the main goal is that objects within one cluster should be similar to each other but dissimilar to objects in other clusters [11]. The idea of clustering is not new, it can be dated back to the early 1900s when they started using this idea in psychology. The more modern and numerical ap- proaches came later and 1963 the Principal of Numerical Taxonomy proposed the use of clusters in biological classification [1]. Since then research involving clustering has been done on many occasions. When clustering data, there is not a single method or algorithm for finding clusters. Instead there are several ones with different properties [11]. In this paper we will look at Spectral Clustering that was proposed back in 1973 [4] and uses the Laplacian and spectrum of the Laplacian to perform clustering. 1.2 Problem The great importance of the Aviation industry and airports in general was discussed in Section 1.1. This raises several questions on how vulnerable the aviation industry J. Larsson, I. Agren˚ 1 1. Introduction 1.2. Problem and especially the airports are to sudden changes. What would happen if some airports shut down or if some routes closed? In this report we want to look into these cases with spectral cluster analysis. In particular we want to see if Europe can be divided into several clusters of airports, and how those clusters are connected. If some routes or airports were to shut down suddenly, how would that affect the ability to travel between different regions of Europe? J. Larsson, I. Agren˚ 2 2. Clustering Clustering can be performed with many algorithms and one way to do it is by spectral clustering. To do this a basic knowledge about graphs and representation of graphs through matrices is important since spectral clustering uses the spectrum of the Laplacian matrix to perform clustering. A basic knowledge about basic standard machine learning algorithms like K-means is also needed. 2.1 Graphs A graph is a set of points represented as nodes, which are connected to each other via edges. The edges can be either weighted, meaning some connections are stronger than others, or unweighted, meaning that all connections are equally strong. There are different ways of representing graphs, in this report the notation a graph G = (V; W ) is used, where V is a set of nodes and W is a set of edges between the nodes. This set of edges W can itself be represented by a matrix. Using this matrix notation, one ca say that the edge between node vi and vj has the weight wi;j, where W = (wi;j). Definition 2.1.1 (Weighted Graph) Let G = (V; W ) be a graph with a set of nodes V = fv1; ··· ; vng and a set of edges n×n W 2 R where wi;j is the weight between node i and node j. Definition 2.1.2 (Symmetric Graph) A graph G = (V; W ) is said to be symmetric if Wi;j = Wj;i: Definition 2.1.3 (Degree Matrix) The degree matrix D is defined as the diagonal matrix n X Di;i := di := Wi;j: j=1 Definition 2.1.4 (Number of nodes in Graph) The number of nodes in Graph G is written as jGj. Definition 2.1.5 (Volume of Graph) The volume of a Graph is written as vol(G) and is defined as n X vol(G) := di: i=0 J. Larsson, I. Agren˚ 3 2. Clustering 2.2. K-means 2.1.1 Methods for Constructing a Graph p Data can be represented as points x(1);:::; x(n) 2 R , where p 2 N and represented as nodes in a graph. A norm can be used to defined a distance between all data points. There are several norms, for example the euclidean norm. We denote the distance or similarity from point i to point j as sij. Using the norm, decisions whether or not two nodes should be connected or not can be made. This is often done using "-neighbourhood (Definition 2.1.6), k-nearest neighbour (Definition 2.1.7) or by setting wi;j = si;j. A graph is said to be fully connected if all nodes have non zero edges to all other nodes, that is wi;j > 0; i 6= j. Definition 2.1.6 ("-neighbourhood graph) Given an value ", all nodes with a distance sij < " is connected. So ( 1; if si;j < " wi;k = : 0; otherwise Definition 2.1.7 (k-nearest neighbour graph) Two vertices are connected if the vertex vi is among the k nearest neighbours of the vertex vj. An undirected graph can be created in several ways. The two most common ones are that two vertices are connected only if vi is among the k nearest neighbours of vj and vj is among the k nearest neighbours of vi, or that two nodes are connected if one of them is among the k nearest neighbours of the other. Gaussian Similarity In a data set with the data points x(1);:::; x(n) one can use a Gaussian function to compare the similarity between two points i and j 2 ! − x(i) − x(j) s = exp ; i;j 2σ2 where σ 2 R. Cosine Similarity In a data set with data points x(1);:::; x(n) the angle between the point i and j can be measured using cos θ.